[R-bloggers] Why R? 2020 (Remote) Call for Papers Extended (and 4 more aRticles)

Why R? 2020 (Remote) Call for Papers Extended
fairmodels: let’s fight with biased Machine Learning models (part 1 — detection)
Spatial GLMM(s) using the INLA Approximation
rfm 0.2.2
FNN-VAE for noisy time series forecasting

Why R? 2020 (Remote) Call for Papers Extended

Posted: 31 Jul 2020 09:00 AM PDT

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This decided to give you one more week to submit a talk or a workshop to Call for Papers for 2020.whyr.pl remote conference. Please fill this form 2020.whyr.pl/submit/ if you are interested in an active participation.

The new deadline for submissions is 2020-08-07 23:59 CEST (UTC+2)!

Looking forward to your submissions!

As the meeting is held in English we invite R users from all over the globe!

We will stream the conference on youtube.com/WhyRFoundation. The channel already contains past R webinars and keynote talks from previous conferences.

Find out our Social Media channels:

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

fairmodels: let’s fight with biased Machine Learning models (part 1 — detection)

Posted: 31 Jul 2020 08:36 AM PDT

[This article was first published on Stories by Przemyslaw Biecek on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

fairmodels: let's fight with biased Machine Learning models (part 1 — detection)

Author: Jakub Wiśniewski

TL;DR

The fairmodels R Package facilitates bias detection through model visualizations. It implements few mitigation strategies that could reduce the bias. It enables easy to use checks for fairness metrics and comparison between different Machine Learning (ML) models.

Longer version

Fairness in ML is a quickly emerging field. Big companies like IBM or Google developed some tools already (see AIF360) with growing community of users. Unfortunately, there aren't many tools enabling to discover bias and discrimination in machine learning models created in R. Therefore, checking the fairness of the classifier created in R might be a difficult task. This is why R package fairmodels was created.

Introduction to fairness concepts

What does it mean that model is fair? Imagine we have a classification model which decisions would have some impact on a human. For example, the model must decide whether some individuals will get a loan or not. What we don't want is our model predictions to be based on sensitive (later called protected) attributes such as sex, race, nationality, etc… because it could potentially harm some unprivileged groups of people. However, not using such variables might not be enough because the correlations are usually hidden deep inside the data. That is what fairness in ML is for. It checks if privileged and unprivileged groups are treated similarly and if not, it offers some bias mitigation techniques.

There are numerous fairness metrics such as Statistical Parity, Equalized odds, Equal opportunity, and more. They check if model properties on privileged and unprivileged groups are the same

Equal opportunity criterion is satisfied when probabilities for 2 subgroups where A = 1 denotes privileged one are equal.

Many of these metrics can be derived from the confusion matrix. For example, Equal opportunity is ensuring the equal rate of TPR (True Positive Rate) among subgroups in the protected variable. However, knowing these rates is not essential information for us. We would like to know whether the difference between these rates between the privileged group and the unprivileged ones is significant. Let's say that the acceptable difference in fairness metrics is 0.1. We will call this epsilon. TPR criterion for this metric would be:

For all subgroups (unique values in the protected variable) the fairness metric difference between subgroup denoted as i and the **privileged** one must be lower than some acceptable value epsilon ( 0.1 in our case ).

Such a criterion is double-sided. It also ensures that there is not much difference in favour of the unprivileged group.

fairmodels as bias detection tool

fairmodels is R package for discovering, eliminating, and visualizing bias. Its main function — fairness_check() enables the user to quickly check if popular fairness metrics are satisfied. fairness_check() return an object called fairness_object. It wraps models together with metrics in useful structure. To create this object we need to provide:

Classification model explained with DALEX
The protected variable in the form of a factor. Unlike in other fairness tools, it doesn't need to be binary, just discrete.
The privileged group in the form of character or factor

So let's see how it works in practice. We will make a linear regression model with german credit data predicting whether a certain person makes more or less than 50k annually. Sex will be used as a protected variable.

Create a model

library(fairmodels)
data("german")
y_numeric <- as.numeric(german$Risk) -1
lm_model <- glm(Risk~., data = german, family=binomial(link="logit"))

2. Create an explainer

library(DALEX)
explainer_lm <- explain(lm_model, data = german[,-1], y = y_numeric)

3. Use the fairness_check(). Here the epsilon value is set to default which is 0.1

fobject <- fairness_check(explainer_lm,
                          protected = german$Sex,
                          privileged = "male")

Now we can check the level of bias

print(fobject)- prints information in console. Tells us how many metrics model passes and what is the total difference (loss) in all metrics

plot(object) — returns ggplot object. Shows red and green areas, where the red field signifies bias. If the bar reaches the left red field it means that the unprivileged group is discriminated, if it reaches the right red zone it means that there is a bias towards the privileged group.

As we can see checking fairness is not difficult. What is more complicated is comparing the discrimination between models. But even this can be easily done with fairmodels!

fairmodels is flexible

When we have many models, they can be passed into one fairness_check() together. Moreover, there is possible an iterative approach. As we explain the model and it does not satisfy fairness criteria, we can add other models along with fairness_object to fairness_check(). That way even the same model with different parameters and/or trained on different data can be compared with the previous one(s).

library(ranger)

rf_model     <- ranger(Risk ~., data = german, probability = TRUE)
explainer_rf <- explain(rf_model, data = german[,-1], y = y_numeric)
fobject      <- fairness_check(explainer_rf, fobject)

print(fobject) with additional explainer

That is it. Ranger model passes our fairness criteria (epsilon = 0.1) and therefore is fair.

Summary

fairmodels is flexible and easy to use tool for asserting that the ML model is fair. It can handle multiple models, trained on different data no matter if it was encoded, features were standardized, etc… It facilitates the bias detection process in multiple models allowing at the same time to compare those models with each other.

Learn more

Check the package's GitHub website for more details
Tutorial on full capabilities of the fairmodels package
Tutorial on bias mitigation techniques
Keynote of Erin LeDell at UseR2020

To leave a comment for the author, please follow the link and comment on their blog: Stories by Przemyslaw Biecek on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Spatial GLMM(s) using the INLA Approximation

Posted: 30 Jul 2020 05:00 PM PDT

[This article was first published on Corey Sparks R blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The INLA Approach to Bayesian models

The Integrated Nested Laplace Approximation, or INLA, approach is a recently developed, computationally simpler method for fitting Bayesian models [(Rue et al., 2009, compared to traditional Markov Chain Monte Carlo (MCMC) approaches. INLA fits models that are classified as latent Gaussian models, which are applicable in many settings (Martino & Rue, 2010. In general, INLA fits a general form of additive models such as:

\(\eta = \alpha + \sum_{j=1}^{nf} f^{(j)}(u_{ij}) + \sum_{k=1}^{n\beta}\beta_k z_{ki} + \epsilon_i\)

where \(\eta\) is the linear predictor for a generalized linear model formula, and is composed of a linear function of some variables u, \(\beta\) are the effects of covariates, z, and \(\epsilon\) is an unstructured residual (Rue et al., 2009). As this model is often parameterized as a Bayesian one, we are interested in the posterior marginal distributions of all the model parameters. Rue and Martino (2007) show that the posterior marginal for the random effects (x) in such models can be approximated as:

\(\tilde{p}(x_i|y) = \sum_k \tilde{p}(x_i|\theta_k, y) \tilde{p}(\theta_k|y) \Delta_k\)

via numerical integration (Rue & Martino, 2007; Schrodle & Held, 2011a, 2011b). The posterior distribution of the hyperparameters (\(\theta\)) of the model can also be approximated as:

\(\tilde{p}(\theta | y)) \propto \frac{p(x, \theta, y)}{\tilde{p}G(x| \theta,y)} \mid _{x} = x^*(\theta)\)

, where G is a Gaussian approximation of the posterior and \(x^*(\theta)\) is the mode of the conditional distribution of \(p(x|\theta,y)\). Thus, instead of using MCMC to find an iterative, sampling-based estimate of the posterior, it is arrived at numerically. This method of fitting the spatial models specified above has been presented by numerous authors (Blangiardo & Cameletti, 2015; Blangiardo et al., 2013; Lindgren & Rue, 2015; Martins et al., 2013; Schrodle & Held, 2011a, 2011b), with comparable results to MCMC.

Libraries

#library(rgdal)  library(spdep)  library(RColorBrewer)  library(lattice)  library(INLA)  library(tigris)  library(tidycensus)  library(ggplot2)  library(dplyr)

Data

I have the data on my github site under the nhgis_vs page. These are data from the NHGIS project by IPUMS who started providing birth and death data from the US Vital statistics program.

The data we will use here are infant mortality rates in US counties between 2000 and 2007.

files<-list.files("~/ExpanDrive/Google Drive/classes/dem7473/data/nhgis0022_csv/nhgis0022_csv/", pattern = "*.csv", full.names = T)  vital<-lapply(files, read.csv, header=T)      library(plyr)  df <- ldply(vital, data.frame)  df$cofips<-paste(substr(df$GISJOIN, 2,3), substr(df$GISJOIN, 5,7), sep="")    df<-df%>%    filter(YEAR %in%2000:2007)%>%    mutate(rate=as.numeric(AGWJ001) )%>%    select(YEAR, cofips,rate)  head(df)

##   YEAR cofips rate  ## 1 2000  01001   34  ## 2 2000  01003   61  ## 3 2000  01005  125  ## 4 2000  01007   70  ## 5 2000  01009   89  ## 6 2000  01011  242

Census intercensus population estimates

From the Census population estimates program

popurl<-url("http://www2.census.gov/programs-surveys/popest/datasets/2000-2010/intercensal/county/co-est00int-tot.csv")  pops<-read.csv(popurl)  names(pops)<-tolower(names(pops))  pops<-pops%>%    mutate(cofips = paste(sprintf(fmt = "%02d", state), sprintf(fmt = "%03d",county), sep=""))%>%    filter(sumlev==50, !state%in%c(2, 15))    head(pops)

##   sumlev region division state county  stname        ctyname estimatesbase2000  ## 1     50      3        6     1      1 Alabama Autauga County             43751  ## 2     50      3        6     1      3 Alabama Baldwin County            140416  ## 3     50      3        6     1      5 Alabama Barbour County             29042  ## 4     50      3        6     1      7 Alabama    Bibb County             19856  ## 5     50      3        6     1      9 Alabama  Blount County             50982  ## 6     50      3        6     1     11 Alabama Bullock County             11603  ##   popestimate2000 popestimate2001 popestimate2002 popestimate2003  ## 1           44021           44889           45909           46800  ## 2          141342          144875          147957          151509  ## 3           29015           28863           28653           28594  ## 4           19913           21028           21199           21399  ## 5           51107           51845           52551           53457  ## 6           11581           11358           11256           11316  ##   popestimate2004 popestimate2005 popestimate2006 popestimate2007  ## 1           48366           49676           51328           52405  ## 2          156266          162183          168121          172404  ## 3           28287           28027           27861           27757  ## 4           21721           22042           22099           22438  ## 5           54124           54624           55485           56240  ## 6           11056           11011           10776           11011  ##   popestimate2008 popestimate2009 census2010pop popestimate2010 cofips  ## 1           53277           54135         54571           54632  01001  ## 2          175827          179406        182265          183195  01003  ## 3           27808           27657         27457           27411  01005  ## 4           22705           22941         22915           22867  01007  ## 5           57055           57341         57322           57338  01009  ## 6           10953           10987         10914           10890  01011

Data prep

pops.long<-reshape(data = pops, idvar = "cofips", varying = list(names(pops)[9:16]), direction="long", drop = names(pops)[c(2,3,4,5,6,8,17,18,19,20)], v.names = "population")  pops.long$year<-pops.long$time+1999  head(pops.long)

##         sumlev        ctyname cofips time population year  ## 01001.1     50 Autauga County  01001    1      44021 2000  ## 01003.1     50 Baldwin County  01003    1     141342 2000  ## 01005.1     50 Barbour County  01005    1      29015 2000  ## 01007.1     50    Bibb County  01007    1      19913 2000  ## 01009.1     50  Blount County  01009    1      51107 2000  ## 01011.1     50 Bullock County  01011    1      11581 2000

dat.long<-merge(pops.long, df, by.x=c("cofips", "year"), by.y=c("cofips", "YEAR"))  head(dat.long)

##   cofips year sumlev        ctyname time population rate  ## 1  01001 2000     50 Autauga County    1      44021   34  ## 2  01001 2001     50 Autauga County    2      44889   78  ## 3  01001 2002     50 Autauga County    3      45909   83  ## 4  01001 2003     50 Autauga County    4      46800   79  ## 5  01001 2004     50 Autauga County    5      48366   76  ## 6  01001 2005     50 Autauga County    6      49676  124

Get census data using tidycensus

Here I get data from the 2000 decennial census summary file 3

#v00<-load_variables(year=2000, dataset = "sf3", cache = T)  cov_dat<-get_decennial(geography = "county", year = 2000, sumfile = "sf3",                         summary_var = "P001001",                         variables = c("P007003", "P007004","P007010","P053001", "P089001", "P089002" ),                        output = "wide")

## Getting data from the 2000 decennial Census

cov_dat<-cov_dat%>%    mutate(cofips=GEOID,pwhite=P007003/summary_value, pblack=P007004/summary_value, phisp=P007010/summary_value,medhhinc=as.numeric(scale(P053001)), ppov=P089002/P089001)      final.dat<-merge(dat.long, cov_dat, by="cofips")  head(final.dat)

##   cofips year sumlev        ctyname time population rate GEOID           NAME  ## 1  01001 2006     50 Autauga County    7      51328   93 01001 Autauga County  ## 2  01001 2003     50 Autauga County    4      46800   79 01001 Autauga County  ## 3  01001 2004     50 Autauga County    5      48366   76 01001 Autauga County  ## 4  01001 2005     50 Autauga County    6      49676  124 01001 Autauga County  ## 5  01001 2000     50 Autauga County    1      44021   34 01001 Autauga County  ## 6  01001 2007     50 Autauga County    8      52405   83 01001 Autauga County  ##   P007003 P007004 P007010 P053001 P089001 P089002 summary_value    pwhite  ## 1   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 2   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 3   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 4   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 5   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 6   34760    7450     394   42013   43377    4738         43671 0.7959515  ##      pblack       phisp  medhhinc      ppov  ## 1 0.1705938 0.009022005 0.7593459 0.1092284  ## 2 0.1705938 0.009022005 0.7593459 0.1092284  ## 3 0.1705938 0.009022005 0.7593459 0.1092284  ## 4 0.1705938 0.009022005 0.7593459 0.1092284  ## 5 0.1705938 0.009022005 0.7593459 0.1092284  ## 6 0.1705938 0.009022005 0.7593459 0.1092284

Create expected numbers of cases

In count data models, and spatial epidemiology, we have to express the raw counts of events relative to some expected value, or population offset, see this Rpub for a reminder.

#ratesyr<-aggregate(rate~year, final.dat,mean) #in this case, we will standardize to the average IMR for the period  #ratesyr$E<-ratesyr$rate  #final.dat<-merge(final.dat, ratesyr[,-2], by="year")  #rates<-aggregate(rate~1, final.dat, mean)  final.dat$E_d<-mean(final.dat$rate)    final.dat<-final.dat[order(final.dat$cofips, final.dat$year),]  final.dat$id<-1:dim(final.dat)[1]    head(final.dat)

##   cofips year sumlev        ctyname time population rate GEOID           NAME  ## 5  01001 2000     50 Autauga County    1      44021   34 01001 Autauga County  ## 8  01001 2001     50 Autauga County    2      44889   78 01001 Autauga County  ## 7  01001 2002     50 Autauga County    3      45909   83 01001 Autauga County  ## 2  01001 2003     50 Autauga County    4      46800   79 01001 Autauga County  ## 3  01001 2004     50 Autauga County    5      48366   76 01001 Autauga County  ## 4  01001 2005     50 Autauga County    6      49676  124 01001 Autauga County  ##   P007003 P007004 P007010 P053001 P089001 P089002 summary_value    pwhite  ## 5   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 8   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 7   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 2   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 3   34760    7450     394   42013   43377    4738         43671 0.7959515  ## 4   34760    7450     394   42013   43377    4738         43671 0.7959515  ##      pblack       phisp  medhhinc      ppov      E_d id  ## 5 0.1705938 0.009022005 0.7593459 0.1092284 72.33683  1  ## 8 0.1705938 0.009022005 0.7593459 0.1092284 72.33683  2  ## 7 0.1705938 0.009022005 0.7593459 0.1092284 72.33683  3  ## 2 0.1705938 0.009022005 0.7593459 0.1092284 72.33683  4  ## 3 0.1705938 0.009022005 0.7593459 0.1092284 72.33683  5  ## 4 0.1705938 0.009022005 0.7593459 0.1092284 72.33683  6

options(scipen=999)

Next we make the spatial information, we get the polygons from census directly using counties from the tigris package. We drop counties not in the contiguous 48 US states.

us_co<-counties( cb = T)  us_co<-us_co%>%    subset(!STATEFP%in%c("02", "15", "60", "66", "69", "72", "78"))%>%    filter(STATEFP%in%c("01", "05", "12", "13", "21", "22", "28", "37", "45", "47", "48", "51", "40"))

Construction of spatial relationships:

Contiguity based neighbors

In a general sense, we can think of a square grid. Cells that share common elements of their geometry are said to be "neighbors". There are several ways to describe these patterns, and for polygons, we generally use the rules of the chess board.

Rook adjacency Neighbors must share a line segment

Queen adjacency Neighbors must share a vertex or a line segment

If polygons share these boundaries (based on the specific definition: rook or queen), they are said to be "spatial neighbors" of one another. The figure below illustrates this principle.

For an observation of interest, the pink area, the Rood adjacent areas are those in green in the figure, because they share a line segment. For the second part of the figure on the right, the pink area has different sets of neighbors, compared to the Rook rule neighbors, because the area also shares vertices with other polygons, making them Queen neighbors.

Adjacency using Chessboard Rules

Order of adjacency

The figure above also highlights the order of adjacency among observations. By order of adjacency, we simply men that observations are either immediate neighbors (the green areas), or they are neighbors of immediate neighbors. These are referred to as first and second order neighbors.

So, we can see, that the yellow polygons are the neighboring areas for this tract, which allows us to think about what the spatial structure of the area surrounding this part of campus.

For an example, let's consider the case of San Antonio again. If our data are polygons, then there is a function in the spdep library in R, poly2nb that will take a polygon layer and find the neighbors of all areas using either a queen or rook rule. First we form the neighbors using the rook rule for all the tracts in Bexar County.

Distance based association

The queen and rook rules are useful for polygon features, but distance based contiguity is useful for all feature types (points, polygons, lines). The idea is similar to the polygon adjacency rule from above, but the distance rule is based on the calculated distance between areas. There are a variety of distance metrics that are used in statistics, but the most commonly assumed one is the Euclidean distance. The Euclidean distance between any two points is:

\[D^2 = \sqrt{\left (x_1 – x_2 \right)^2 + \left (y_1 – y_2 \right)^2 } \] Where x and y are the coordinates of each of the two areas. For polygons, these coordinates are typically the centroid of the polygon (you may have noticed this above when we were plotting the neighbor lists), while for point features, these are the two dimensional geometry of the feature. The collection of these distances between all features forms what is known as the distance matrix between observations. This summarizes all distances between all features in the data.

K nearest neighbors

A useful way to use distances is to construct a k-nearest neighbors set.
This will find the "k" closest observations for each observation, where k is some integer.
For instance if we find the k=3 nearest neighbors, then each observation will have 3 neighbors, which are the closest observations to it, regardless of the distance between them which is important.
Using the k nearest neighbor rule, two observations could potentially be very far apart and still be considered neighbors.

#In INLA, we don't need FIPS codes, we need a simple numeric index for our counties  us_co$struct<-1:dim(us_co)[1]  nbs<-knearneigh(coordinates(as_Spatial(us_co) ), k = 5, longlat = T) #k=5 nearest neighbors  nbs<-knn2nb(nbs, row.names = us_co$struct, sym = T) #force symmetry!!  mat <- nb2mat(nbs, style="B",zero.policy=TRUE)  colnames(mat) <- rownames(mat)   mat <- as.matrix(mat[1:dim(mat)[1], 1:dim(mat)[1]])      nb2INLA("cl_graph",nbs)  am_adj <-paste(getwd(),"/cl_graph",sep="")  H<-inla.read.graph(filename="cl_graph")  #image(inla.graph2matrix(H), xlab="", ylab="", main="")

Plot geographies

library(sf)  us_co<-st_as_sf(us_co)  us_co$cofips<-paste(us_co$STATEFP, us_co$COUNTYFP, sep="")  us_co%>%    ggplot()+geom_sf()+coord_sf(crs =7603)

final.dat<-merge( us_co,final.dat, by="cofips", all.y=F)  final.dat<-final.dat[order(final.dat$cofips, final.dat$year),]  final.dat%>%    filter(year==2000)%>%    ggplot()+geom_sf(aes(fill=rate, col=rate))+coord_sf(crs =7603)+ggtitle("Infant Mortality Rate - 2000")

Model setup

We have a count outcome (deaths and births), in counties over time, and a set of time-constant covariates.
We have several options in the GLM framework with which to model these data, for example:
Binomial – \[y_{ij} \sim Bin(\pi_{ij}) \text{: } logit(\pi_{ij} ) = \beta_{0}+ x'\beta_k \]
Poisson – \[y_{ij} \sim Pois(\lambda_{ij} E_{ij}) \text{: } log(\lambda_{ij} ) = log(E_{ij}) + \beta_{0}+ x'\beta_k \]
Negative Binomial – \[y_{ij} \sim \text{Neg Bin} (\mu_{ij}, \alpha, E_{ij}) \text{: } log(\mu_{ij} ) = log(E_{ij}) + \beta_{0}+ x'\beta_k \]
In addition to various zero-inflated versions of these data.

ggplot(data = final.dat)+geom_histogram(aes(x =log(rate) , y=0.5*..density..))+facet_wrap(~year)+    ggtitle(label = "Distribution of Infant Mortality Rate by Year", subtitle = "US Counties, 2000-2007")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1849 rows containing non-finite values (stat_bin).

ggplot(data = final.dat)+geom_histogram(aes(x =log(rate/E_d) , y=0.5*..density..))+facet_wrap(~year)+    ggtitle(label = "Distribution of Infant Mortality Relative Risk by Year", subtitle = "US Counties, 2000-2007")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1849 rows containing non-finite values (stat_bin).

final.dat%>%    dplyr::group_by(cofips)%>%    dplyr::summarise(meanrate=mean(rate), meanE=mean(E_d))%>%    dplyr::mutate(rate_e=meanrate/meanE)%>%    #ungroup()    # filter(year%in%c(2000))%>%    # mutate(rate_e=rate/E_d)%>%     mutate(qrr=cut(rate_e, breaks = quantile(rate_e, p=seq(0,1,length.out = 7)), include.lowest = T))%>%    ggplot()+    geom_sf(aes(fill=qrr))+    scale_fill_brewer(type="seq", palette = "RdBu")+    ggtitle(label="Relative Risk Quartile - period average 2000 to 2007")+    coord_sf(crs = 7603)

## `summarise()` ungrouping output (override with `.groups` argument)

We can fit these model using the Bayesian framework with INLA.

First, we consider the basic GLM for the mortality outcome, with out any hierarchical structure. We can write this model as a Negative Binomial model, for instance as:

\[\text{Deaths}_{ij} \sim NB(\mu_{ij}, \gamma)\] \[\mu_{ij} = \text{log(E_d)}_{ij} + X' \beta\]

INLA will use vague Normal priors for the \(\beta\)'s, and we have other parameters in the model to specify priors for. INLA does not require you to specify all priors, as all parameters have a default prior specification. In this example, I will use a \(Gamma(1, .5)\) prior for all hierarchical variance terms.

#Model specification:  f1<-rate~scale(pblack)+scale(phisp)+scale(ppov)+year    #Model fit  mod1<-inla(formula = f1,data = final.dat, #linear predictor - fixed effects             family = "nbinomial", E = E_d,  #marginal distribution for the outcome, expected count             control.compute = list(waic=T), # compute DIC or not?             control.predictor = list(link=1), #estimate predicted values & their marginals or not?             num.threads = 2,                  verbose = F)  #model summary  summary(mod1)

##   ## Call:  ##    c("inla(formula = f1, family = \"nbinomial\", data = final.dat, E =   ##    E_d, ", " verbose = F, control.compute = list(waic = T),   ##    control.predictor = list(link = 1), ", " num.threads = 2)")   ## Time used:  ##     Pre = 0.928, Running = 21.8, Post = 0.722, Total = 23.5   ## Fixed effects:  ##                 mean     sd 0.025quant 0.5quant 0.975quant   mode kld  ## (Intercept)   -5.047 10.723    -26.102   -5.048     15.989 -5.047   0  ## scale(pblack)  0.159  0.015      0.130    0.159      0.188  0.159   0  ## scale(phisp)  -0.025  0.013     -0.050   -0.025      0.001 -0.025   0  ## scale(ppov)    0.041  0.015      0.012    0.041      0.070  0.041   0  ## year           0.003  0.005     -0.008    0.003      0.013  0.003   0  ##   ## Model hyperparameters:  ##                                                         mean    sd 0.025quant  ## size for the nbinomial observations (1/overdispersion) 0.624 0.009      0.608  ##                                                        0.5quant 0.975quant  ## size for the nbinomial observations (1/overdispersion)    0.624      0.641  ##                                                         mode  ## size for the nbinomial observations (1/overdispersion) 0.624  ##   ## Expected number of effective parameters(stdev): 5.04(0.001)  ## Number of equivalent replicates : 2124.92   ##   ## Watanabe-Akaike information criterion (WAIC) ...: 114586.38  ## Effective number of parameters .................: 10.27  ##   ## Marginal log-Likelihood:  -57331.80   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

Plot our observed vs fitted values

plot(x= mod1$summary.fitted.values$mean, y=final.dat$rate/final.dat$E_d , ylab="Observed", xlab="Estimated" )

Basic county level random intercept model

Now we add basic nesting of rates within counties, with a random intercept term for each county. This would allow there to be heterogeneity in the mortality rate for each county, over and above each county's observed characteristics.

This model would be:

\[\text{Deaths}_{ij} \sim NB(\mu_{ij}, \gamma)\] \[\mu_{ij} = \text{log(E_d)}_{ij} + X' \beta + u_j\] \[u_j \sim \text{Normal} (0 , \tau_u)\]

where \(\tau_u\) here is the precision, not the variance and precision = 1/variance. INLA puts a log-gamma prior on the the precision by default.

f2<-rate~scale(pblack)+scale(phisp)+scale(ppov)+year+ #fixed effects    f(struct, model = "iid",param=c(1,.5))  #random effects    mod2<-inla(formula = f2,data = final.dat,             family = "nbinomial", E = E_d,              control.compute = list(waic=T),              control.predictor = list(link=1),             num.threads = 2,                  verbose = F)    #total model summary  summary(mod2)

##   ## Call:  ##    c("inla(formula = f2, family = \"nbinomial\", data = final.dat, E =   ##    E_d, ", " verbose = F, control.compute = list(waic = T),   ##    control.predictor = list(link = 1), ", " num.threads = 2)")   ## Time used:  ##     Pre = 0.571, Running = 160, Post = 1.36, Total = 162   ## Fixed effects:  ##                 mean     sd 0.025quant 0.5quant 0.975quant   mode kld  ## (Intercept)   -2.824 10.758    -23.945   -2.824     18.279 -2.824   0  ## scale(pblack)  0.158  0.015      0.128    0.158      0.189  0.158   0  ## scale(phisp)  -0.041  0.014     -0.069   -0.041     -0.013 -0.041   0  ## scale(ppov)    0.044  0.015      0.014    0.044      0.074  0.044   0  ## year           0.001  0.005     -0.009    0.001      0.012  0.001   0  ##   ## Random effects:  ##   Name     Model  ##     struct IID model  ##   ## Model hyperparameters:  ##                                                          mean    sd 0.025quant  ## size for the nbinomial observations (1/overdispersion)  0.627 0.009      0.609  ## Precision for struct                                   50.626 7.005     38.292  ##                                                        0.5quant 0.975quant  ## size for the nbinomial observations (1/overdispersion)    0.627      0.644  ## Precision for struct                                     50.138     65.780  ##                                                          mode  ## size for the nbinomial observations (1/overdispersion)  0.626  ## Precision for struct                                   49.174  ##   ## Expected number of effective parameters(stdev): 125.34(15.33)  ## Number of equivalent replicates : 85.47   ##   ## Watanabe-Akaike information criterion (WAIC) ...: 114610.09  ## Effective number of parameters .................: 66.26  ##   ## Marginal log-Likelihood:  -57375.58   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

Marginal Distributions of hyperparameters

We can plot the posterior marginal of the hyperparameter in this model, in this case \(\sigma_u = 1/\tau_u\)

m2<- inla.tmarginal(          function(x) (1/x), #invert the precision to be on variance scale          mod2$marginals.hyperpar$`Precision for struct`)  #95% credible interval for the variance  inla.hpdmarginal(.95, marginal=m2)

##                   low       high  ## level:0.95 0.01491462 0.02565338

plot(m2, type="l", main=c("Posterior distibution for between county variance", "- IID model -"))

final.dat$fitted_m2<-mod2$summary.fitted.values$mean    p1<-final.dat%>%    filter(year%in%c(2000))%>%    mutate(qrr=cut(fitted_m2, breaks = quantile(final.dat$fitted_m2, p=seq(0,1,length.out = 6)), include.lowest = T))%>%    ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2000")+coord_sf(crs = 7603)      p2<-final.dat%>%    filter(year%in%c(2007))%>%    mutate(qrr=cut(fitted_m2, breaks = quantile(final.dat$fitted_m2, p=seq(0,1,length.out = 6)), include.lowest = T))%>%    ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2007")+coord_sf(crs = 7603)    library(gridExtra)

##   ## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':  ##   ##     combine

pall<-grid.arrange(p1, p2, nrow=2)

pall

## TableGrob (2 x 1) "arrange": 2 grobs  ##   z     cells    name           grob  ## 1 1 (1-1,1-1) arrange gtable[layout]  ## 2 2 (2-2,1-1) arrange gtable[layout]

# library(mapview)  #   # map1<-final.dat%>%  #   filter(year%in%c(2007))%>%  #   mutate(qrr=cut(fitted_m2, breaks = quantile(fitted_m2, p=seq(0,1,length.out = 8))))  # clrs <- colorRampPalette(brewer.pal(8, "RdBu"))  # mapView(as(map1, "Spatial"), zcol="qrr", legend=T, col.regions=clrs)

BYM Model

Model with spatial correlation – Besag, York, and Mollie (1991) model and temporal heterogeneity \[\text{Deaths}_{ij} \sim NB(\mu_{ij}, \gamma)\] \[\mu_{ij} = \text{log(E_d)}_{ij} + X' \beta + u_j + v_j + \gamma_t\]

Which has two random effects, one an IID random effect and the second a spatially correlated random effect, specified as a conditionally auto-regressive prior for the \(v_j\)'s. This is the Besag model:

\[v_j|v_{\neq j},\sim\text{Normal}(\frac{1}{n_i}\sum_{i\sim j}v_j,\frac{1}{n_i\tau})\] and \(u_j\) is an IID normal random effect, \(\gamma_t\) is also given an IID Normal random effect specification, and there are now three hyperparameters, \(\tau_u\) and \(\tau_v\) and \(\tau_{\gamma}\) and each are given log-gamma priors.

For the BYM model we must specify the spatial connectivity matrix in the random effect.

#final.dat$year_c<-final.dat$year - 2004  f3<-rate~scale(pblack)+scale(phisp)+scale(ppov)+    f(struct, model = "bym", constr = T, scale.model = T, graph = H,param=c(1,.5))+    f(year, model="iid",param=c(1,.5)) #temporal random effect    mod3<-inla(formula = f3,data = final.dat,             family = "nbinomial", E = E_d,              control.compute = list(waic=T),              num.threads = 2,                  verbose = F,             control.predictor = list(link=1))    #total model summary  summary(mod3)

##   ## Call:  ##    c("inla(formula = f3, family = \"nbinomial\", data = final.dat, E =   ##    E_d, ", " verbose = F, control.compute = list(waic = T),   ##    control.predictor = list(link = 1), ", " num.threads = 2)")   ## Time used:  ##     Pre = 0.737, Running = 138, Post = 1.26, Total = 140   ## Fixed effects:  ##                 mean    sd 0.025quant 0.5quant 0.975quant   mode kld  ## (Intercept)    0.115 0.129     -0.145    0.115      0.374  0.115   0  ## scale(pblack)  0.157 0.016      0.126    0.158      0.189  0.158   0  ## scale(phisp)  -0.039 0.016     -0.069   -0.039     -0.007 -0.040   0  ## scale(ppov)    0.043 0.016      0.012    0.043      0.075  0.043   0  ##   ## Random effects:  ##   Name     Model  ##     struct BYM model  ##    year IID model  ##   ## Model hyperparameters:  ##                                                            mean       sd  ## size for the nbinomial observations (1/overdispersion)    0.627    0.009  ## Precision for struct (iid component)                     51.094    7.099  ## Precision for struct (spatial component)               1974.289 1903.577  ## Precision for year                                        8.760    4.130  ##                                                        0.025quant 0.5quant  ## size for the nbinomial observations (1/overdispersion)      0.609    0.627  ## Precision for struct (iid component)                       38.602   50.591  ## Precision for struct (spatial component)                  174.447 1425.658  ## Precision for year                                          2.885    8.075  ##                                                        0.975quant    mode  ## size for the nbinomial observations (1/overdispersion)      0.644   0.628  ## Precision for struct (iid component)                       66.447  49.595  ## Precision for struct (spatial component)                 7055.730 496.592  ## Precision for year                                         18.742   6.583  ##   ## Expected number of effective parameters(stdev): 133.75(15.30)  ## Number of equivalent replicates : 80.09   ##   ## Watanabe-Akaike information criterion (WAIC) ...: 114605.76  ## Effective number of parameters .................: 69.81  ##   ## Marginal log-Likelihood:  -56934.15   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

plot(y=mod3$summary.random$year$mean,x=unique(final.dat$year), type="l")

m3a<- inla.tmarginal(          function(x) (1/x),          mod3$marginals.hyperpar$`Precision for struct (iid component)`)  m3b<- inla.tmarginal(          function(x) (1/x),          mod3$marginals.hyperpar$`Precision for struct (spatial component)`)  m3c<- inla.tmarginal(          function(x) (1/x),          mod3$marginals.hyperpar$`Precision for year`)    plot(m3a, type="l", main=c("Posterior distibution for between county variance", "- BYM model -"), xlim=c(0, .2), ylim=c(0, 300))  lines(m3b, col="red")  lines(m3c, col="green")  legend("topright", legend=c("BYM IID", "BYM Spatial", "Year"), col=c(1, "red", "green"), lty=c(1,1,1))

#HPD intervals  inla.hpdmarginal(.95,m3a)

##                   low       high  ## level:0.95 0.01475866 0.02544088

inla.hpdmarginal(.95,m3b)

##                      low        high  ## level:0.95 0.00005416961 0.003970123

inla.hpdmarginal(.95,m3c)

##                   low      high  ## level:0.95 0.03927999 0.2945931

This indicates very low spatially correlated variance in these data.

Space-time mapping of the fitted values

final.dat$fitted_m3<-mod3$summary.fitted.values$mean    p3<-final.dat%>%    filter(year%in%c(2000))%>%    mutate(qrr=cut(fitted_m3, breaks = quantile(final.dat$fitted_m3, p=seq(0,1,length.out = 6)), include.lowest = T))%>%    ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2000")+coord_sf(crs = 7603)      p4<-final.dat%>%    filter(year%in%c(2007))%>%    mutate(qrr=cut(fitted_m3, breaks = quantile(final.dat$fitted_m3, p=seq(0,1,length.out = 6)), include.lowest = T))%>%    ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Relative Risk Quartile"))+ggtitle(label="Relative Risk Quartile - IID Model, 2007")+coord_sf(crs = 7603)    pall2<-grid.arrange(p3, p4, nrow=2)

pall2

## TableGrob (2 x 1) "arrange": 2 grobs  ##   z     cells    name           grob  ## 1 1 (1-1,1-1) arrange gtable[layout]  ## 2 2 (2-2,1-1) arrange gtable[layout]

#library(mapview)    #map1<-final.dat%>%  #  filter(year%in%c(2007))%>%  #  mutate(qrr=cut(fitted_m3, breaks = quantile(fitted_m3, p=seq(0,1,length.out = 8))))  #clrs <- colorRampPalette(brewer.pal(8, "RdBu"))  #mapView(as(map1, "Spatial"), zcol="qrr", legend=T, col.regions=clrs)

Map of spatial random effects

It is common to map the random effects from the BYM model to look for spatial trends, in this case, there are not strong spatial signals:

us_co$sp_re<-mod3$summary.random$struct$mean[1:length(unique(final.dat$cofips))]  us_co%>%    mutate(qse=cut(sp_re, breaks = quantile(sp_re, p=seq(0,1,length.out = 6)), include.lowest = T))%>%    ggplot()+geom_sf(aes(fill=qse))+scale_colour_brewer(palette = "RdBu" )+scale_fill_brewer(palette = "RdBu", na.value="grey")+guides(fill=guide_legend(title="Spatial Excess Risk"))+ggtitle(label="Spatial Random Effect - BYM Model")+coord_sf(crs = 7603)

Exceedence probabilities

In Bayesian spatial models that are centered on an epidemiological type of outcome, it is common to examine the data for spatial clustering. One way to do this is to examine the clustering in the relative risk from one of these GLMM models. For instance if \(\theta\) is the relative risk \[\theta = exp(\beta_0 + \beta_1*x_1 + u_j)\] from one of our Negative binomial models above. We can use the posterior marginals of the relative risk to ask \(\theta \gt \theta^*\) where \(\theta^*\) is a specific level of excess risk, say 50% extra or \(\theta > 1.25\). If the density, or \(\text{Pr}(\theta \gt \theta^*)\) is high, then there is evidence that the excess risk is not only high, but significantly high.

To get the exceedence probabilities from one of our models, we can use the inla.pmarginal() function to ask if \(\text{Pr}(\theta \gt \theta^*)\)

thetastar<-1.25#theta*  inlaprob<- unlist(lapply(mod3$marginals.fitted.values, function(X){     1-inla.pmarginal(thetastar, X)  }))  hist(inlaprob)

So, we see lots of occasions where the exceedence probability is greater than .9. We can visualize these in a map.

final.dat$exceedprob<-inlaprob    final.dat%>%    filter(year%in%c(2007))%>%    mutate(qrr=cut(exceedprob, breaks = c(0, .5, .9, .95, .99, 1), include.lowest = T))%>%    ggplot()+geom_sf(aes(fill=qrr))+scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title=""))+ggtitle(label=expression(paste("Exceedence Probability Relative Risk ","Pr( ",theta," >1.25"," )  - 2007") ))+coord_sf(crs = 7603)

#library(mapview)    #map1<-final.dat%>%   # filter(year%in%c(2007))%>%  #  mutate(qrr=cut(exceedprob, breaks = c(0, .5, .9, .95, .99, 1), include.lowest = T))      #clrs <- colorRampPalette(brewer.pal(6, "Blues"))  #mapView(as(map1, "Spatial"), zcol="qrr", legend=T, col.regions=clrs, map.types="OpenStreetMap")

Which shows several areas of the south where risk the infant mortality rate is signficantly higher than the national rate, with high posterior probability.

References

Besag, J., York, J., & Mollie, a. (1991). Bayesian image-restoration, with 2 applications in spatial statistics. Annals of the Institute of Statistical Mathematics, 43(1), 1-20. https://doi.org/10.1007/BF00116466

To leave a comment for the author, please follow the link and comment on their blog: Corey Sparks R blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

rfm 0.2.2

Posted: 30 Jul 2020 05:00 PM PDT

[This article was first published on Rsquared Academy Blog - Explore Discover Learn, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We're excited to announce the release of rfm 0.2.2 on CRAN! rfm provides tools for customer segmentation using Recency Frequency Monetary value analysis. It includes a Shiny app for interactive segmentation. You can install rfm with:

install.packages("rfm")

In this blog post, we will summarize the changes implemented in the current (0.2.2) and previous release (0.2.1).

Segmentation

In previous versions, rfm_segment() would overwrite a segment if the intervals used to define the segment was a subset of another segment. It was expected that the end user would be careful to ensure that the intervals for each segment would be unique and not a subset of any other segment. You can see the example here.

We are grateful to @leungi for bringing this to our attention and also for fixing it. Now, rfm_segment() does not overwrite
the segments even if the intervals for one segment is a subset of another.

# analysis date  analysis_date <- lubridate::as_date("2006-12-31")    # rfm score  rfm_result <- rfm_table_order(rfm_data_orders, customer_id, order_date,                                revenue, analysis_date)    rfm_result

## # A tibble: 995 x 9  ##    customer_id date_most_recent recency_days transaction_cou~ amount  ##                                             ##  1 Abbey O'Re~ 2006-06-09                205                6    472  ##  2 Add Senger  2006-08-13                140                3    340  ##  3 Aden Lesch~ 2006-06-20                194                4    405  ##  4 Admiral Se~ 2006-08-21                132                5    448  ##  5 Agness O'K~ 2006-10-02                 90                9    843  ##  6 Aileen Bar~ 2006-10-08                 84                9    763  ##  7 Ailene Her~ 2006-03-25                281                8    699  ##  8 Aiyanna Br~ 2006-04-29                246                4    157  ##  9 Ala Schmid~ 2006-01-16                349                3    363  ## 10 Alannah Bo~ 2005-04-21                619                4    196  ## # ... with 985 more rows, and 4 more variables: recency_score ,  ## #   frequency_score , monetary_score , rfm_score

# segmentation  segment_names <- c(    "Champions", "Loyal Customers", "Potential Loyalist",    "New Customers", "Promising", "Need Attention", "About To Sleep",    "At Risk", "Can't Lose Them", "Lost"  )    recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)  recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)  frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)  frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)  monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)  monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)    segments <-    rfm_segment(      rfm_result,      segment_names,      recency_lower,      recency_upper,      frequency_lower,      frequency_upper,      monetary_lower,      monetary_upper    )    # segment size  segments %>%    count(segment) %>%    arrange(desc(n)) %>%    rename(Segment = segment, Count = n)

## # A tibble: 8 x 2  ##   Segment            Count  ##                   ## 1 Loyal Customers      278  ## 2 Potential Loyalist   229  ## 3 Champions            158  ## 4 Lost                 111  ## 5 At Risk               86  ## 6 About To Sleep        50  ## 7 Others                48  ## 8 Need Attention        35

In the above example, the interval used to define the Champions segment is a subset of Loyal Customers. In the previous versions, those customers who
should have been assigned Champions were reassigned as Loyal Customers if the criteria for Champions was evaluated before Loyal Customers. From version 0.2.0, rfm_segment() will avoid such overwriting.

Visualization

rfm used print all the plots by default instead of returning a plot object. This resulted in difficulties for some end users who wanted to:

further modify the plot
include the plot in a panel of other plots

From version 0.2.1, all plotting functions use an additional argument print_plot. It is set to TRUE by default to avoid any disruption to current work flows. Those users who want a plot object to be returned can set the above argument to FALSE.

# analysis date  analysis_date <- lubridate::as_date('2007-01-01')    # transactions data  rfm_order <- rfm_table_order(rfm_data_orders, customer_id, order_date,                               revenue, analysis_date)    # customer data  rfm_customer <- rfm_table_customer(rfm_data_customer, customer_id,                                     number_of_orders, recency_days, revenue,                                     analysis_date)    # plots  p1 <-     rfm_heatmap(rfm_order,                 plot_title = "Transaction Data",                 print_plot = FALSE)    p2 <-     rfm_heatmap(rfm_customer,                 plot_title = "Customer Data",                 print_plot = FALSE)    # using patchwork  p1 + p2

Custom Threshold for RFM Scores

Lots of users wanted to know the threshold used for generating the RFM scores. From version 0.2.1, rfm_table_* family of functions return the threshold.

analysis_date <- lubridate::as_date('2006-12-31')  result <- rfm_table_order(rfm_data_orders, customer_id, order_date,                            revenue, analysis_date)    # threshold  result$threshold

## # A tibble: 5 x 6  ##   recency_lower recency_upper frequency_lower frequency_upper monetary_lower  ##                                                      ## 1            1           115                1               4            12   ## 2          115           181                4               5           256.  ## 3          181           297.               5               6           382   ## 4          297.          482                6               8           506.  ## 5          482           977                8              15           666   ## # ... with 1 more variable: monetary_upper

Another request (see here) was to be able to use custom or user specific threshold for generating RFM score. rfm uses quantiles to generate the lower and upper thresholds used for generating the scores. Unfortunately, if the data is skewed, using quantiles is not effective. From version 0.2.1, users can specify custom threshold for generating the RFM score and we will learn how to do this using an example.

analysis_date <- lubridate::as_date('2006-12-31')  result <- rfm_table_order(rfm_data_orders, customer_id, order_date, revenue,                            analysis_date)  result$threshold

## # A tibble: 5 x 6  ##   recency_lower recency_upper frequency_lower frequency_upper monetary_lower  ##                                                      ## 1            1           115                1               4            12   ## 2          115           181                4               5           256.  ## 3          181           297.               5               6           382   ## 4          297.          482                6               8           506.  ## 5          482           977                8              15           666   ## # ... with 1 more variable: monetary_upper

If you look at the above output, we have 5 bins/scores and there are six different values. Let us focus on the monetary_* columns in the threshold table. The lower threshold of the first bin and the upper threshold of the last bin are the min and max values form the revenue column of rfm_data_orders and the rest of the values are returned by the quantile() function.

revenue <-     rfm_data_orders %>%     group_by(customer_id) %>%     summarize(total = sum(revenue))

## `summarise()` ungrouping (override with `.groups` argument)

# revenue at customer level  revenue

## # A tibble: 995 x 2  ##    customer_id        total  ##  *                 ##  1 Abbey O'Reilly DVM   472  ##  2 Add Senger           340  ##  3 Aden Lesch Sr.       405  ##  4 Admiral Senger       448  ##  5 Agness O'Keefe       843  ##  6 Aileen Barton        763  ##  7 Ailene Hermann       699  ##  8 Aiyanna Bruen PhD    157  ##  9 Ala Schmidt DDS      363  ## 10 Alannah Borer        196  ## # ... with 985 more rows

# min and max  min(revenue$total)

## [1] 12

max(revenue$total)

## [1] 1488

Let us look at the quantiles used for generating the scores.

quantile(revenue$total, probs = seq(0, 1, length.out = 6))

##     0%    20%    40%    60%    80%   100%   ##   12.0  254.8  381.0  505.4  665.0 1488.0

The intervals are created in the below style:

Left-closed, right-open: [ a , b ) = { x ∣ a ≤ x < b }

Since rfm uses left closed intervals to generate the scores, we add 1 to all values except the minimum value. Now, let us recreate the RFM scores using custom threshold instead of quantiles.

rfm_table_order(rfm_data_orders,                   customer_id,                   order_date,                   revenue,                  analysis_date,                   recency_bins = c(115, 181, 297, 482),                  frequency_bins = c(4, 5, 6, 8),                   monetary_bins = c(256, 382, 506, 666))

## # A tibble: 995 x 9  ##    customer_id date_most_recent recency_days transaction_cou~ amount  ##                                             ##  1 Abbey O'Re~ 2006-06-09                205                6    472  ##  2 Add Senger  2006-08-13                140                3    340  ##  3 Aden Lesch~ 2006-06-20                194                4    405  ##  4 Admiral Se~ 2006-08-21                132                5    448  ##  5 Agness O'K~ 2006-10-02                 90                9    843  ##  6 Aileen Bar~ 2006-10-08                 84                9    763  ##  7 Ailene Her~ 2006-03-25                281                8    699  ##  8 Aiyanna Br~ 2006-04-29                246                4    157  ##  9 Ala Schmid~ 2006-01-16                349                3    363  ## 10 Alannah Bo~ 2005-04-21                619                4    196  ## # ... with 985 more rows, and 4 more variables: recency_score ,  ## #   frequency_score , monetary_score , rfm_score

We have used the values from the threshold table to reproduce the earlier result. If you observe carefully, we have specified 4 values while generating 5 bins/scores. Whenever using custom threshold, values supplied should be one less than the number of bins/scores generated as rfm internally computes the min and max values. In general, if you have n bins/scores, you only specify the upper threshold for n - 1 bins/scores.

We have tried our best to explain how to use custom threshold but completely understand that it can be confusing to implement at beginning. If you have any questions about this method, feel free to write to us at support@rsquaredacademy.com and our team will be happy to help you.

Acknowledgements

We are grateful to @leungi, @gfagherazzi and @DavidGarciaEstaun for their inputs.

Learning More

Feedback

*As the reader of this blog, you are our most important critic and commentator.
We value your opinion and want to know what we are doing right, what we could
do better, what areas you would like to see us publish in, and any other words
of wisdom you are willing to pass our way.

We welcome your comments. You can email to let us know what you did or did not
like about our blog as well as what we can do to make our post better.*

Email: support@rsquaredacademy.com

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog - Explore Discover Learn.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

FNN-VAE for noisy time series forecasting

Posted: 29 Jul 2020 05:00 PM PDT

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

") training_loop_vae(ds_train)

test_batch <- as_iterator(ds_test) %>% iter_next() encoded <- encoder(test_batch[[1]][1:1000]) test_var <- tf\(math\)reduce_variance(encoded, axis = 0L) print(test_var %>% as.numeric() %>% round(5)) } "`

Experimental setup and data

The idea was to add white noise to a deterministic series. This time, the Roessler system was chosen, mainly for the prettiness of its attractor, apparent even in its two-dimensional projections:

(#fig:unnamed-chunk-1)Roessler attractor, two-dimensional projections.

Like we did for the Lorenz system in the first part of this series, we use deSolve to generate data from the Roessler equations.

Then, noise is added, to the desired degree, by drawing from a normal distribution, centered at zero, with standard deviations varying between 1 and 2.5.

Here you can compare effects of not adding any noise (left), standard deviation-1 (middle), and standard deviation-2.5 Gaussian noise:

(#fig:unnamed-chunk-4)Roessler series with added noise. Top: none. Middle: SD = 1. Bottom: SD = 2.5.

Otherwise, preprocessing proceeds as in the previous posts. In the upcoming results section, we'll compare forecasts not just to the "real", after noise addition, test split of the data, but also to the underlying Roessler system – that is, the thing we're really interested in. (Just that in the real world, we can't do that check.) This second test set is prepared for forecasting just like the other one; to avoid duplication we don't reproduce the code.

Results

The LSTM used for comparison with the VAE described above is identical to the architecture employed in the previous post. While with the VAE, an fnn_multiplier of 1 yielded sufficient regularization for all noise levels, some more experimentation was needed for the LSTM: At noise levels 2 and 2.5, that multiplier was set to 5.

As a result, in all cases, there was one latent variable with high variance and a second one of minor importance. For all others, variance was close to 0.

In all cases here means: In all cases where FNN regularization was used. As already hinted at in the introduction, the main regularizing factor providing robustness to noise here seems to be FNN loss, not KL divergence. So for all noise levels, besides FNN-regularized LSTM and VAE models we also tested their non-constrained counterparts.

Low noise

Seeing how all models did superbly on the original deterministic series, a noise level of 1 can almost be treated as a baseline. Here you see sixteen 120-timestep predictions from both regularized models, FNN-VAE (dark blue), and FNN-LSTM (orange). The noisy test data, both input (x, 120 steps) and output (y, 120 steps) are displayed in (blue-ish) grey. In green, also spanning the whole sequence, we have the original Roessler data, the way they would look had no noise been added.

Roessler series with added Gaussian noise of standard deviation 1. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from FNN-LSTM. Dark blue: Predictions from FNN-VAE.

(#fig:unnamed-chunk-6)Roessler series with added Gaussian noise of standard deviation 1. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from FNN-LSTM. Dark blue: Predictions from FNN-VAE.

Despite the noise, forecasts from both models look excellent. Is this due to the FNN regularizer?

Looking at forecasts from their unregularized counterparts, we have to admit these do not look any worse. (For better comparability, the sixteen sequences to forecast were initiallly picked at random, but used to test all models and conditions.)

(#fig:unnamed-chunk-7)Roessler series with added Gaussian noise of standard deviation 1. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from unregularized LSTM. Dark blue: Predictions from unregularized VAE.

What happens when we start to add noise?

Substantial noise

Between noise levels 1.5 and 2, something changed, or became noticeable from visual inspection. Let's jump directly to the highest-used level though: 2.5.

Here first are predictions obtained from the unregularized models.

Roessler series with added Gaussian noise of standard deviation 2.5. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from unregularized LSTM. Dark blue: Predictions from unregularized VAE.

(#fig:unnamed-chunk-8)Roessler series with added Gaussian noise of standard deviation 2.5. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from unregularized LSTM. Dark blue: Predictions from unregularized VAE.

Both LSTM and VAE get "distracted" a bit too much by the noise, the latter to an even higher degree. This leads to cases where predictions strongly "overshoot" the underlying non-noisy rhythm. This is not surprising, of course: They were trained on the noisy version; predict fluctuations is what they learned.

Do we see the same with the FNN models?

(#fig:unnamed-chunk-9)Roessler series with added Gaussian noise of standard deviation 2.5. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from FNN-LSTM. Dark blue: Predictions from FNN-VAE.

Interestingly, we see a much better fit to the underlying Roessler system now! Especially the VAE model, FNN-VAE, surprises with a whole new smoothness of predictions; but FNN-LSTM turns up much smoother forecasts as well.

"Smooth, fitting the system…" – by now you may be wondering, when are we going to come up with more quantitative assertions? If quantitative implies "mean squared error" (MSE), and if MSE is taken to be some divergence between forecasts and the true target from the test set, the answer is that this MSE doesn't differ much between any of the four architectures. Put differently, it is mostly a function of noise level.

However, we could argue that what we're really interested in is how well a model forecasts the underlying process. And there, we see differences.

In the following plot, we contrast MSEs obtained for the four model types (grey: VAE; orange: LSTM; dark blue: FNN-VAE; green: FNN-LSTM). The rows reflect noise levels (1, 1.5, 2, 2.5); the columns represent MSE in relation to the noisy("real") target (left) on the one hand, and in relation to the underlying system on the other (right). For better visibility of the effect, MSEs have been normalized as fractions of the maximum MSE in a category.

So, if we want to predict signal plus noise (left), it is not extremely critical whether we use FNN or not. But if we want to predict the signal only (right), with increasing noise in the data FNN loss becomes increasingly effective. This effect is far stronger for VAE vs. FNN-VAE than for LSTM vs. FNN-LSTM: The distance between the grey line (VAE) and the dark blue one (FNN-VAE) becomes larger and larger as we add more noise.

Normalized MSEs obtained for the four model types (grey: VAE; orange: LSTM; dark blue: FNN-VAE; green: FNN-LSTM). Rows are noise levels (1, 1.5, 2, 2.5); columns are MSE as related to the real target (left) and the underlying system (right).

(#fig:unnamed-chunk-10)Normalized MSEs obtained for the four model types (grey: VAE; orange: LSTM; dark blue: FNN-VAE; green: FNN-LSTM). Rows are noise levels (1, 1.5, 2, 2.5); columns are MSE as related to the real target (left) and the underlying system (right).

Summing up

Our experiments show that when noise is likely to obscure measurements from an underlying deterministic system, FNN regularization can strongly improve forecasts. This is the case especially for convolutional VAEs, and probably convolutional autoencoders in general. And if an FNN-constrained VAE performs as well, for time series prediction, as an LSTM, there is a strong incentive to use the convolutional model: It trains significantly faster.

With that, we conclude our mini-series on FNN-regularized models. As always, we'd love to hear from you if you were able to make use of this in your own work!

Thanks for reading!

To leave a comment for the author, please follow the link and comment on their blog: RStudio AI Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now