[R-bloggers] EARL London 2019 Conference Recap (and 8 more aRticles)

[R-bloggers] EARL London 2019 Conference Recap (and 8 more aRticles)

Link to R-bloggers

EARL London 2019 Conference Recap

Posted: 30 Sep 2019 04:12 AM PDT

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Damian Rodziewicz of Appsilon Data Science Consulting at EARL London 2019

I had an awesome time at the Enterprise Applications of the R Language (EARL) Conference held in London in September, 2019. EARL reminded me that it is good to keep showing up at conferences. I entered and the first thing I heard was organisers at the table welcoming me "Damian is that you? Awesome to see you again!" Feels great to be a part of the amazing R community. Within minutes I already met a couple of people. I like the vibe at EARLs. I ran into some of the people from other conferences who remembered me from Insurance Data Science in Zurich and EARL 2018 in Seattle

At EARL 2018 in Seattle I presented about analyzing satellite imagery with deep learning networks, but this time I was a pure attendee, which was a different sort of intense and fantastic. The conference kicked off with a keynote by Julia Silge, the Jane Austen loving astrophysicist and author of "Text Mining with R.". Julia shared very interesting insights about Data Scientists all over the world that were gathered in a huge Stack Overflow poll.

I have worked on 50 something Shiny dashboard applications, but I always learn something new at conferences. Christel Swift had an interesting presentation about using shiny at BBC. I also learned quite a few tricks during the workshop about R Markdown and Interactive Dashboards that took place the day before the conference.

I didn't know that it took 15 years to develop a new drug. I heard a fascinating presentation by Bayer about where data science meets the world of pharmaceuticals. The topic is close to us as we also face many challenges when working with data in the pharma industry.  

I thought this slide was hilarious: 


At the conference you could also learn how Transport for London (TFL) uses Data Science to reduce station overcrowding and closures – take a look at Mark Samuels blog post about the presentation – https://diginomica.com/how-tfl-using-data-science-reduce-station-overcrowding-and-closures.

Before/between/after the presentations I had so many fascinating conversations, some of which continued into the wee hours. I think everyone, even competitors, recognized that there was so much to gain from sharing information and bouncing ideas off of each other. 

Many people I met were just starting with R and introducing R in their companies. I heard a lot of questions about using R in production – our pitch about big pharma that introduced R with our support fit perfectly there. 

Note to self: pick up a Catch Box for when we host a conference. You can throw it at people instead of awkwardly leaning over crowds of people trying to hand them the microphone. It was entertaining each time they tossed the Catch Box at an audience member.    


EARL was exciting and well organized. I got to know the Mango Solutions founders, a lot of RStudio folks and plenty of different Data Scientists from various companies. EARL is the conference that you don't want to miss.

We are in the middle of the WhyR Warsaw conference at the moment. We're so excited to host Dr. Kenneth Benoit from the London School of Economics and creator of the quanteda R package. I will co-present with him on the topic of Natural Language Processing for non-programmers. But that is a post for another time! Thanks for stopping by. Questions? Comments? You can find me on Twitter @D_Rodziewicz.

 

Article EARL London 2019 Conference Recap comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Meetup Recap: Survey and Measure Development in R

Posted: 30 Sep 2019 03:00 AM PDT

[This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Have you ever taken a survey at the doctor or for a job interview and wondered what exactly that data was used for? There is a long-standing series of methodologies, many coming from psychology, on how to reliably measure "latent" traits, such as depression or loyalty, from self-report survey data. 

While measurement is a common method in fields ranging from kinesiology to education, it's usually conducted in proprietary tools like SPSS or MPlus. There is really not much training available online for survey development in R, but the program is beyond capable of conducting it through packages like psych  and lavaan

Over the summer, I presented the following hour-long workshop on survey development in R to the Greater Cleveland R Users meetup group. The video and slides are below. You can also access all code, files and assets used in the presentation on RStudio Cloud.

Slides:

 

To learn more about survey development, check out my DataCamp course, "Survey and Measure Development in R." The first chapter is free. 

Is your meetup looking to learn about survey measurement? Would your organization like to build validated survey instruments? Does your organization do it, but wants to move to R? Drop me a line

To leave a comment for the author, please follow the link and comment on their blog: George J. Mount.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Cleaning Anomalies to Reduce Forecast Error by 9% with anomalize

Posted: 29 Sep 2019 07:45 PM PDT

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial, we'll show how we used clean_anomalies() from the anomalize package to reduce forecast error by 9%.

R Packages Covered:

  • anomalize – Time series anomaly detection

Cleaning Anomalies to Reduce Forecast Error by 9%

We can often improve forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the clean_anomalies() function from anomalize into your forecast workflow.

Forecast Workflow

Forecast With Anomalize

We'll use the following workflow to remove time series anomalies prior to forecasting.

  1. Identify the anomalies – Decompose the time series with time_decompose() and anomalize() the remainder (residuals)

  2. Clean the anomalies – Use the new clean_anomalies() function to reconstruct the time series, replacing anomalies with the trend and seasonal components

  3. Forecast – Use a forecasting algorithm to predict new observations from a training set, then compare to test set with and without anomalies cleaned

Step 1 – Load Libraries

First, load the following libraries to follow along.

library(tidyverse)  # Core data manipulation and visualization libraries  library(tidyquant)  # Used for business-ready ggplot themes  library(anomalize)  # Identify and clean time series anomalies  library(timetk)     # Time Series Machine Learning Features  library(knitr)      # For kable() function

Step 2 – Get the Data

This tutorial uses the tidyverse_cran_downloads dataset that comes with anomalize. These are the historical downloads of several "tidy" R packages from 2017-01-01 to 2018-03-01.

Let's take one package with some extreme events. We'll hone in on lubridate (but you could pick any).

tidyverse_cran_downloads %>%      time_decompose(count) %>%      anomalize(remainder) %>%      time_recompose() %>%      plot_anomalies(ncol = 3, alpha_dots = 0.3)

plot of chunk unnamed-chunk-2

We'll filter() downloads of the lubridate R package.

lubridate_tbl <- tidyverse_cran_downloads %>%    ungroup() %>%    filter(package == "lubridate")

Here's a visual representation of the forecast experiment setup. Training data will be any data before "2018-01-01".

plot of chunk unnamed-chunk-4

Step 3 – Workflow for Cleaning Anomalies

The workflow to clean anomalies:

  1. We decompose the "counts" column using time_decompose() – This returns a Seasonal-Trend-Loess (STL) Decomposition in the form of "observed", "season", "trend" and "remainder".

  2. We fix any negative values – If present, they can throw off forecasting transformations (e.g. log and power transformations)

  3. We identifying anomalies (anomalize()) on the "remainder" column – Returns "remainder_l1" (lower limit), "remainder_l2" (upper limit), and "anomaly" (Yes/No).

  4. We use the function, clean_anomalies(), to add new column called "observed_cleaned" that repairs the anomalous data by replacing all anomalies with the trend + seasonal components from the decompose operation.

lubridate_anomalized_tbl <- lubridate_tbl %>%            # 1. Decompose download counts and anomalize the STL decomposition remainder      time_decompose(count) %>%            # 2. Fix negative values if any in observed      mutate(observed = ifelse(observed < 0, 0, observed)) %>%            # 3. Identify anomalies      anomalize(remainder) %>%          # 4. Clean & repair anomalous data      clean_anomalies()    # Show change in observed vs observed_cleaned  lubridate_anomalized_tbl %>%     filter(anomaly == "Yes") %>%    select(date, anomaly, observed, observed_cleaned) %>%    head() %>%     kable()
date anomaly observed observed_cleaned
2017-01-12 Yes 0 3522.194
2017-04-19 Yes 8549 5201.716
2017-09-01 Yes 0 4136.721
2017-09-07 Yes 9491 4871.176
2017-10-30 Yes 11970 6412.571
2017-11-13 Yes 10267 6640.871

Here's a visual of the "observed" (uncleaned) vs the "observed_cleaned" (cleaned) training sets. We'll see what influence these anomalies have on a forecast regression (next).

plot of chunk unnamed-chunk-6

Step 4 – Forecasting Downloads of the Lubridate Package

First, we'll make a function, forecast_downloads(), that can take the input of both cleaned and uncleaned anomalies and return the forecasted downloads versus actual downloads. The modeling function is described in the Appendix – Forecast Downloads Function.

Step 4.1 – Before Cleaning with anomalize

We'll first perform a forecast without cleaning anomalies (high leverage points).

  • The forecast_downloads() function trains on the "observed" (uncleaned) data and returns predictions versus actual.
  • Internally, a power transformation (square-root) is applied to improve the forecast due to the multiplicative properties.
  • The model uses a linear regression of the form sqrt(observed) ~ numeric index + year + quarter + month + day of week.
lubridate_forecast_with_anomalies_tbl <- lubridate_anomalized_tbl %>%          # See Apendix - Forecast Downloads Function      forecast_downloads(          col_train = observed,     # First train with anomalies included          sep       = "2018-01-01", # Separate at 1st of year          trans     = "sqrt"        # Perform sqrt() transformation      )

Forecast vs Actual Values

The forecast is overplotted against the actual values.

plot of chunk unnamed-chunk-9

We can see that the forecast is shifted vertically, an effect of the high leverage points.

plot of chunk unnamed-chunk-10

Forecast Error Calculation

The mean absolute error (MAE) is 1570, meaning on average the forecast is off by 1570 downloads each day.

lubridate_forecast_with_anomalies_tbl %>%      summarise(mae = mean(abs(prediction - actual)))
## # A tibble: 1 x 1  ##     mae  ##     ## 1 1570.

Step 4.2 – After Cleaning with anomalize

We'll next perform a forecast this time using the repaired data from clean_anomalies().

  • The forecast_downloads() function trains on the "observed_cleaned" (cleaned) data and returns predictions versus actual.
  • Internally, a power transformation (square-root) is applied to improve the forecast due to the multiplicative properties.
  • The model uses a linear regression of the form sqrt(observed_cleaned) ~ numeric index + year + quarter + month + day of week
lubridate_forecast_without_anomalies_tbl <- lubridate_anomalized_tbl %>%            # See Appendix - Forecast Downloads Function      forecast_downloads(          col_train = observed_cleaned, # Forecast with cleaned anomalies          sep       = "2018-01-01",     # Separate at 1st of year          trans     = "sqrt"            # Perform sqrt() transformation      )

Forecast vs Actual Values

The forecast is overplotted against the actual values. The cleaned data is shown in Yellow.

plot of chunk unnamed-chunk-13

Zooming in on the forecast region, we can see that the forecast does a better job following the trend in the test data.

plot of chunk unnamed-chunk-14

Forecast Error Calculation

The mean absolute error (MAE) is 1435, meaning on average the forecast is off by 1435 downloads each day.

lubridate_forecast_without_anomalies_tbl %>%      summarise(mae = mean(abs(prediction - actual)))
## # A tibble: 1 x 1  ##     mae  ##     ## 1 1435.

8.6% Reduction in Forecast Error

Using the new anomalize function, clean_anomalies(), prior to forecasting results in an 8.6% reduction in forecast error as measure by Mean Absolute Error (MAE).

((1435 - 1570) / 1570)
## [1] -0.08598726

Conclusion

Forecasting with clean anomalies is a good practice that can provide substantial improvement to forecasting accuracy by removing high leverage points. The new clean_anomalies() function in the anomalize package provides an easy workflow for removing anomalies prior to forecasting. Learn more in the anomalize documentation.

Data Science Training

Interested in Learning Anomaly Detection?

Business Science offers two 1-hour labs on Anomaly Detection:

Interested in Improving Your Forecasting?

Business Science offers a 1-hour lab on increasing Forecasting Accuracy:

  • Learning Lab 5 – 5 Strategies to Improve Forecasting Performance by 50% (or more) using arima and glmnet

Interested in Becoming an Expert in Data Science for Business?

Business Science offers a 3-Course Data Science for Business R-Track designed to take students from no experience to an expert data scientists (advanced machine learning and web application development) in under 6-months.

Appendix – Forecast Downloads Function

The forecast_downloads() function uses the following procedure:

  • Split the data into training and testing data using a date specified using the sep argument.
  • Apply a statistical transformation: none, log-1-plus (log1p()), or power (sqrt())
  • Model the daily time series of the training data set from observed (demonstrates no cleaning) or observed and cleaned (demonstrates improvement from cleaning). Specified by the col_train argument.
  • Compares the predictions to the observed values.
forecast_downloads <- function(data, col_train,                                  sep = "2018-01-01",                                  trans = c("none", "log1p", "sqrt")) {                predict_expr <- enquo(col_train)      trans <- trans[1]            # Spit into training/testing sets      train_tbl <- data %>% filter(date < ymd(sep))      test_tbl  <- data %>% filter(date >= ymd(sep))            # Apply Transformation      pred_form <- quo_name(predict_expr)      if (trans != "none") pred_form <- str_glue("{trans}({pred_form})")            # Make the model formula      model_formula <- str_glue("{pred_form} ~ index.num + half                                 + quarter + month.lbl + wday.lbl") %>%           as.formula()            # Apply model formula to data that is augmented with time-based features      model_glm <- train_tbl %>%          tk_augment_timeseries_signature() %>%          glm(model_formula, data = .)                # Make Prediction      suppressWarnings({          # Suppress rank-deficit warning          prediction <- predict(model_glm, newdata = test_tbl %>%                                    tk_augment_timeseries_signature())           actual     <- test_tbl %>% pull(!! actual_expr)      })            if (trans == "log1p") prediction <- expm1(prediction)      if (trans == "sqrt")  prediction <- ifelse(prediction < 0, 0, prediction)^2            # Return predictions and actual      tibble(          date       = tk_index(test_tbl),          prediction = prediction,          actual     = observed      )  }

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Understanding Bootstrap Confidence Interval Output from the R boot Package

Posted: 29 Sep 2019 05:00 PM PDT

[This article was first published on Rstats on pi: predict/infer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nuances of Bootstrapping

Most applied statisticians and data scientists understand that bootstrapping is a method that mimics repeated sampling by drawing some number of new samples (with replacement) from the original sample in order to perform inference. However, it can be difficult to understand output from the software that carries out the bootstrapping without a more nuanced understanding of how uncertainty is quantified from bootstrap samples.

To demonstrate the possible sources of confusion, start with the data described in Efron and Tibshirani's (1993) text on bootstrapping (page 19). We have 15 paired observations of student LSAT scores and GPAs. We want to estimate the correlation between LSAT and GPA scores. The data are the following:

student lsat gpa
1 576 3.39
2 635 3.30
3 558 2.81
4 578 3.03
5 666 3.44
6 580 3.07
7 555 3.00
8 661 3.43
9 651 3.36
10 605 3.13
11 653 3.12
12 575 2.74
13 545 2.76
14 572 2.88
15 594 2.96

The correlation turns out to be 0.776. For reasons we'll explore, we want to use the nonparametric bootstrap to get a confidence interval around our estimate of \(r\). We do so using the boot package in R. This requires the following steps:

  1. Define a function that returns the statistic we want.
  2. Use the boot function to get R bootstrap replicates of the statistic.
  3. Use the boot.ci function to get the confidence intervals.

For step 1, the following function is created:

get_r <- function(data, indices, x, y) {      d <- data[indices, ]    r <- round(as.numeric(cor(d[x], d[y])), 3)      r    }

Steps 2 and 3 are performed as follows:

set.seed(12345)    boot_out <- boot(    tbl,    x = "lsat",     y = "gpa",     R = 500,    statistic = get_r  )    boot.ci(boot_out)
## Warning in boot.ci(boot_out): bootstrap variances needed for studentized  ## intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out)  ##   ## Intervals :   ## Level      Normal              Basic           ## 95%   ( 0.5247,  1.0368 )   ( 0.5900,  1.0911 )    ##   ## Level     Percentile            BCa            ## 95%   ( 0.4609,  0.9620 )   ( 0.3948,  0.9443 )    ## Calculations and Intervals on Original Scale  ## Some BCa intervals may be unstable

Looking at the boot.ci output, the following questions come up:

  1. Why are there multiple CIs? How are they calculated?
  2. What are the bootstrap variances needed for studentized intervals?
  3. What does it mean that the calculations and intervals are on the original scale?
  4. Why are some BCa intervals unstable?

To understand this output, let's review statistical inference, confidence intervals, and the bootstrap.

Statistical Inference

The usual test statistic for determining if \(r \neq 0\) is:

\[
t = \frac{r}{SE_r}
\]

where

\[
SE_r = \sqrt{\frac{1-r^2}{n-2}}
\]

In our case:

\[
SE_r = \sqrt{\frac{1-r^2}{n-2}}
= \sqrt{\frac{1-0.776^2}{15-2}}
= 0.175
\]

Dividing \(r\) by \(SE_r\) yields our \(t\) statistic

\[
t = \frac{r}{SE_r} = \frac{0.776}{0.175} = 4.434
\]

We compare this to a \(t\) distribution with \(n-2 = 13\) degrees of freedom and easily find it to be significant.

In words: If the null hypothesis were true, and we repeatedly draw samples of size \(n\), and we calculate \(r\) each time, then the probability that we would observe an estimate of \(|r| = 0.776\) or larger is less than 5%.

An important caveat. The above formula for the standard error is only correct when \(r = 0\). The closer we get to \(\pm 1\), the less correct it is.

Confidence Intervals

We can see why the standard error formula above becomes less correct the further we get from zero by considering the 95% confidence interval for our estimate. The usual formula you see for a confidence interval is the estimate plus or minus the 97.5th percentile of the normal or \(t\) distribution times the standard error. In this case, the \(t\)-based formula would be:

\[
\text{95% CI} = r \pm t_{df = 13} SE_r
\]

If we were to sample 15 students repeatedly from the population and calculate this confidence interval each time, the interval should include the true population value 95% of the time. So what happens if we use the standard formula for the confidence interval?

\[
\begin{align}
\text{95% CI} &= r \pm t_{df = 13}SE_r \\
&= 0.776 \pm 2.16\times 0.175 \\
&= [0.398, 1.154]
\end{align}
\]

Recall that correlations are bounded in the range \([-1, +1]\), but our 95% confidence interval contains values greater than one!

Alternatives:

  • Use Fisher's \(z\)-transformation. This is what your software will usually do, but it doesn't work for most other statistics.
  • Use the bootstrap. While not necessary for the correlation coefficient, its advantage is that it can be used for almost any statistic.

The next sections review the nonparametric and parametric bootstrap.

Nonparametric Bootstrap

We do not know the true population distribution of LSAT and GPA scores. What we have instead is our sample. Just like we can use our sample mean as an estimate of the population mean, we can use our sample distribution as an estimate of the population distribution.

In the absence of supplementary information about the population (e.g. that it follows a specific distribution like bivariate normal), the empirical distribution from our sample contains as much information about the population distribution as we can get. If statistical inference is typically defined by repeated sampling from a population, and our sample provides a good estimate of the population distribution, we can conduct inferential tasks by repeatedly sampling from our sample.

(Nonparametric) bootstrapping thus works as follows for a sample of size N:

  1. Draw a random sample of size N with replacement from our sample, which is the first bootstrap sample.
  2. Estimate the statistic of interest using the bootstrap sample.
  3. Draw a new random sample of size N with replacement, which is the second bootstrap sample.
  4. Estimate the statistic of interest using the new bootstrap sample.
  5. Repeat \(k\) times.
  6. Use the distribution of estimates across the \(k\) bootstrap samples as the sampling distribution.

Note that the sampling is done with replacement. As an aside, most results from traditional statistics are based on the assumption of random sampling with replacement. Usually, the population we sample from is large enough that we do not bother noting the "with replacement" part. If the sample is large relative to the population, and sampling without replacement is used, we would typically be advised to use a finite population correction. This is just to say that the "with replacement" requirement is a standard part of the definition of random sampling.

Let's take our data as an example. We will draw 500 bootstrap samples, each of size \(n = 15\) chosen with replacement from our original data. The distribution across repeated samples is:

Note a few things about this distribution.

  1. The distribution is definitely not normal.
  2. The mean estimate of \(r\) across the 500 bootstrap samples is 0.771. The difference between the mean of the bootstrap estimates \((\mathbb{E}(r_b) = 0.771)\) and the original sample estimate \((r = 0.776)\) is the bias.
  3. The bootstrap standard error is the standard deviation of the bootstrap sampling distribution. Here the value is 0.131, which is much smaller than our earlier estimate of 0.175. This is because 0.175 was based on a formula that is only valid when \(r = 0\) and does not account for the fact that a correlation's values are bounded at -1 and +1.

The non-normality of the sampling distribution means that, if we divide \(r\) by the bootstrap standard error, we will not get a statistic that is distributed standard normal or \(t\). Instead, we decide that it is a better idea to summarize our uncertainty using a confidence interval. Yet we also want to make sure that our confidence intervals are bounded within the \([-1, +1]\) range, so the usual formula will not work.

Before turning to different methods for obtaining bootstrap confidence intervals, for completeness the next section describes the parametric bootstrap.

Parametric Bootstrap

The prior section noted that, in the absence of supplementary information about the population, the empirical distribution from our sample contains as much information about the population distribution as we can get.

An example of supplementary information that may improve our estimates would be that we know the LSAT and GPA scores are distributed bivariate normal. If we are willing to make this assumption, we can use our sample to estimate the distribution parameters. Based on our sample, we find:

\[
\begin{pmatrix}
\text{LSAT} \\
\text{GPA}
\end{pmatrix}\sim N\left(\begin{pmatrix}
600.27 \\
3.09
\end{pmatrix},\begin{pmatrix}
1746.78 & 7.90 \\
7.90 & 0.06
\end{pmatrix}\right).
\]

The distribution looks like the following:

We can draw 500 random samples of size 15 from this specific bivariate normal distribution and calculate the correlation between the two variables for each.

get_cor <- function(iteration, n) {      dta <- MASS::mvrnorm(      n,      mu = c(600, 3),      Sigma = matrix(c(1747, 7.9, 7.9, .06), 2)    ) %>%      as.data.frame()      tibble(      iteration = iteration,      r = cor(dta$V1, dta$V2)    )    }    par_boot_tbl <- map_dfr(1:500, ~ get_cor(.x, 15))

The distribution of the correlation estimates across the 500 samples represents our parametric bootstrap sampling distribution. It looks like the following:

The average correlation across the 500 samples was 0.767, and the standard deviation (our estimate of the standard error) was 0.111.

This is smaller than our non-parametric bootstrap estimate of the standard error, 0.131, which is reflective of the fact that our knowledge of the population distribution gives us more information. This in turn reduces sampling variability.

Of course, we often will not feel comfortable saying that the population distribution follows a well-defined shape, and hence we will typically default to the non-parametric version of the bootstrap.

Bootstrap Confidence Intervals

Recall that the usual formula for estimating a confidence around a statistic \(\theta\) is something like:

\[
\text{95% CI} = \theta \pm t_{df,1-\alpha/2} SE_{\theta}
\]

We saw that using the textbook standard error estimate for a correlation led us astray because we ended up with an interval outside of the range of plausible values. There are a variety of alternative approaches to calculating confidence intervals based on the bootstrap.

Standard Normal Interval

The first approach starts with the usual formula for calculating a confidence interval, using the normal distribution value of 1.96 as the multiplier of the standard error. However, there are two differences. First, we use our bootstrap estimate of the standard error in the formula. Second, we make an adjustment for the estimated bias, -0.005:

In our example, we get

\[
\begin{align}
\text{95% CI} &= r – \text{bias} \pm 1.96 \times SE_r \\
&= 0.776 + .005 \pm 1.96 \times 0.131 \\
&= [0.524, 1.038]
\end{align}
\]

This matches R's output (given our hand calculations did some rounding along the way).

boot.ci(boot_out, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out, type = "norm")  ##   ## Intervals :   ## Level      Normal          ## 95%   ( 0.5247,  1.0368 )    ## Calculations and Intervals on Original Scale

Problems:

  • If a normal approximation were valid, we probably don't need to bootstrap.
  • We still have a CI outside the appropriate range.

We generally won't use this method.

Studentized (t) Intervals

Recall that, when we calculate a \(t\)-statistic, we mean-center the original statistic and divide by the sample estimate of the standard error. That is,

\[
t = \frac{\hat{\theta} – \theta}{\widehat{SE}_{\theta}}
\]

where \(\hat{\theta}\) is the sample estimate of the statistic, \(\theta\) is the "true" population value (which we get from our null hypothesis), and \(\widehat{SE}_{\theta}\) is the sample estimate of the standard error.

There is an analog to this process for bootstrap samples. In the bootstrap world, we can convert each bootstrap sample into a \(t\)-score as follows:

\[
t = \frac{\tilde{\theta} – \hat{\theta}}{\widehat{SE}_{\hat{\theta}}}
\]

Here \(\tilde{\theta}\) is the statistic estimated from a single bootstrap sample, and \(\hat{\theta}\) is the estimate from the original (non-bootstrap) sample.

But where does \(\widehat{SE}_{\hat{\theta}}\) come from?

Just like for a \(t\)-test, where we estimated the standard error using our one sample, we estimate the standard error separately for each bootstrap sample. That is, we need an estimate of the bootstrap sample variance. (Recall the message from the R output above).

If we are lucky enough to have a formula for a sample standard error, we use that in each sample. For the mean, each bootstrap sample would return:

  1. The bootstrap sample mean, \(\frac{1}{n}\sum(s_{bi})\)
  2. The bootstrap sample variance: \(\frac{s^2_b}{n}\).

We don't have such a formula that works for any correlation, so we need another means to estimate the variance. The delta method is one choice. Alternatively, there is the nested bootstrap.

Nested bootstrap algorithm:

  1. Draw a bootstrap sample.
  2. Estimate the statistic.
  3. Bootstrap the bootstrap sample, using the variance of estimates across the bootstrapped estimates as the estimate of the variance.
  4. Save the bootstrap estimate of the statistic and the nested bootstrap estimate of the variance.
  5. For each bootstrap sample, estimate \(t = \frac{\tilde{\theta} – \hat{\theta}}{\widehat{SE}_{\hat{\theta}}}\)

We now have the information we need to calculate the studentized confidence interval. The formula for the studentized bootstrap confidence interval is:

\[
95\% \text{ CI} = [\hat{\theta} – sq_{1-\alpha/2}, \hat{\theta} – sq_{\alpha/2}]
\]

The terms are:

  1. \(\hat{\theta}\): Our sample statistic (without performing the bootstrap)
  2. \(s\): Our bootstrap estimate of the standard error (the standard deviation of bootstrap estimates, not the nested bootstrap part)
  3. \(q_{1-\alpha/2}\): For \(\alpha = .05\), the 97.5th percentile of our bootstrap \(t\) estimates.
  4. \(q_{\alpha/2}\): For \(\alpha = .05\), the 2.5th percentile of our bootstrap \(t\) estimates.

For each bootstrap sample, we calculated a \(t\) statistic. The \(q_{1-\alpha/2}\) and \(q_{\alpha/2}\) are identified by taking the appropriate quantile of these \(t\) estimates. This is akin to creating our own table of \(t\)-statistics, rather than using the typical tables for the \(t\) distribution you'd find in text books.

What does this look like in R? We need a second function for bootstrapping inside our bootstrap. The following will work.

get_r_var <- function(x, y, data, indices, its) {      d <- data[indices, ]    r <- cor(d[x], d[y]) %>%      as.numeric() %>%      round(3)    n <- nrow(d)      v <- boot(      x = x,       y = y,      R = its,      data = d,      statistic = get_r    ) %>%      pluck("t") %>%      var(na.rm = TRUE)      c(r, v)    }    boot_t_out <- boot(    x = "lsat", y = "gpa", its = 200,    R = 1000, data = tbl, statistic = get_r_var  )

We now have our bootstrap estimates of \(t\), and we can use the quantiles of the distribution to plug into the formula. We find that \(q_{1-\alpha/2} = 8.137\) and that \(q_{\alpha/2} = -1.6\). Substituting:

\[
\begin{align}
\text{95% CI} &= [\hat{\theta} – sq_{1-\alpha/2}, \hat{\theta} – sq_{\alpha/2}] \\
&= [0.776 – 0.142 \times 8.137, 0.776 – 0.142 \times -1.6] \\
&= [-0.383,1.004]
\end{align}
\]

Checking our by-hand calculations, the studentized confidence interval from boot.ci is:

boot.ci(boot_t_out, type = "stud")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 1000 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_t_out, type = "stud")  ##   ## Intervals :   ## Level    Studentized       ## 95%   (-0.3828,  1.0039 )    ## Calculations and Intervals on Original Scale

Problems:

  • The nested bootstrap part is computationally intensive, even for simple problems like this.
  • May still produce estimates outside the range of plausible values.
  • Here our sample was small, so the variance of each bootstrap estimate was large.
  • Has been found to be erratic in practice.

Basic Bootstrap Confidence Interval

Another way of writing a confidence interval:

\[
1-\alpha = P(q_{\alpha/2} \leq \theta \leq q_{1-\alpha/2})
\]

In non-bootstrap confidence intervals, \(\theta\) is a fixed value while the lower and upper limits vary by sample. In the basic bootstrap, we flip what is random in the probability statement. Define \(\tilde{\theta}\) as a statistic estimated from a bootstrap sample. We can write

\[
1-\alpha = P(q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2})
\]

Recall that the bias of a statistic is the difference between its expected value (mean) across many samples and the true population value:

\[
\text{bias} = \mathbb{E}(\hat{\theta}) – \theta
\]

We estimate this using our bootstrap samples, \(\mathbb{E}(\tilde{\theta}) – \hat{\theta}\), where \(\hat{\theta}\) is the estimate from the original sample (before bootstrapping).

We can add in the bias-correction term to each side of our inequality as follows.

\[
\begin{align}
1-\alpha &= P(q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}) \\
&= P(q_{\alpha/2} – \hat{\theta} \leq \tilde{\theta} – \hat{\theta} \leq q_{1-\alpha/2} – \hat{\theta})
\end{align}
\]

Some more algebra eventually leads to:

\[
1-\alpha = P(2\hat{\theta} – q_{1-\alpha/2} \leq \theta \leq 2\hat{\theta} – q_{\alpha/2} )
\]

The right-hand side is our formula for the basic bootstrap confidence interval.

Because we started out with \(\tilde{\theta}\) as the random variable, we can use our bootstrap quantiles for the values of \(q_{1-\alpha/2}\) and \(q_{\alpha/2}\). To do so, arrange the estimates in order from lowest to highest, then use a percentile function to find the value at the 2.5th and 97.5th percentiles (given two-tailed \(\alpha = .05\)). We find that \(q_{1-\alpha/2} = 0.962\) and that \(q_{\alpha/2} = 0.461\). Substituting into the inequality:

\[
\begin{align}
1-\alpha &= P(2\hat{r} – q_{1-\alpha/2} \leq r \leq 2\hat{r} – q_{\alpha/2} ) \\
&= P(2(0.776) – 0.962) \leq r \leq 2(0.776) – 0.461) \\
&= P(0.59 \leq r \leq 1.091)
\end{align}
\]

The basic bootstrap interval is \([0.59, 1.091]\).

To confirm:

boot.ci(boot_out, type = "basic")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out, type = "basic")  ##   ## Intervals :   ## Level      Basic           ## 95%   ( 0.5900,  1.0911 )    ## Calculations and Intervals on Original Scale

But we're still outside the range we want.

Percentile Confidence Intervals

Here's an easy solution. Line up the bootstrap estimates from lowest to highest, then take the 2.5th and 97.5th percentile.

quantile(boot_out$t, probs = c(.025, .975), type = 6)
##     2.5%    97.5%   ## 0.460775 0.962000

Compare:

boot.ci(boot_out, type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out, type = "perc")  ##   ## Intervals :   ## Level     Percentile       ## 95%   ( 0.4609,  0.9620 )    ## Calculations and Intervals on Original Scale

(The slight difference is due to boot using its own quantile function.)

Looks like we have a winner. Our confidence interval will necessarily be limited to the range of plausible values. But let's look at one other.

Bias Corrected and Accelerated (BCa) Confidence Intervals

BCa intervals require estimating two terms: a bias term and an acceleration term.

Bias is by now a familiar concept, though the calculation for the BCa interval is a little different. For BCa confidence intervals, estimate the bias correction term, \(\hat{z}_0\), as follows:

\[
\hat{z}_0 = \Phi^{-1}\left(\frac{\#\{\hat{\theta}^*_b < \hat{\theta}\}}{B}\right)
\]

where \(\#\) is the counting operator. The formula looks complicated but can be thought of as estimating something close to the median bias transformed into normal deviates (\(\Phi^{-1}\) is the inverse standard normal cdf).

The acceleration term is estimated as follows:

\[
\hat{a} = \frac{\sum^n_{i=1}(\hat{\theta}_{(\cdot)} – \hat{\theta}_{(i)})}{6\{\sum^n_{i=1}(\hat{\theta}_{(\cdot)} – \hat{\theta}_{(i)})^2\}^{3/2}}
\]

where \(\hat{\theta}_{(\cdot)}\) is the mean of the bootstrap estimates and \(\hat{\theta}_{(i)}\) the estimate after deleting the \(i\)th case. The process of estimating a statistic \(n\) times, each time dropping one of the \(i \in N\) observations, is known as the jackknife estimate.

The purpose of the acceleration term is to account for situations in which the standard error of an estimator changes depending on the true population value. This is exactly what happens with the correlation (the SE estimator we provided at the start of the post only works when \(r = 0\)). An equivalent way of thinking about this is that it accounts for skew in the sampling distribution, like what we have seen in the prior histograms.

Armed with our bias correction and acceleration term, we now estimate the quantiles we will use for establishing the confidence limits.

\[
\alpha_1 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z^{(\alpha)}}{1-\hat{a}(\hat{z}_0 + z^{(\alpha)}) } \right)
\]

\[
\alpha_2 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z^{(1 – \alpha)}}{1-\hat{a}(\hat{z}_0 + z^{(1-\alpha)}) } \right)
\]

where \(\alpha\) is our Type-I error rate, usually .05.

Our confidence limits are:

\[
\\
95\% \text{ CI} = [\hat{\theta}^{*(\alpha_1)}, \hat{\theta}^{*(\alpha_2)}]
\]

Based on the formulas above, it should be obvious that \(a_1\) and \(a_2\) reduces to the percentile intervals when the bias and acceleration terms are zero. The effect of the bias and acceleration corrections is to change the percentiles we use to establish our limits.

If we perform all of the above calculations, we get the following:

boot.ci(boot_out, type = "bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out, type = "bca")  ##   ## Intervals :   ## Level       BCa            ## 95%   ( 0.3948,  0.9443 )    ## Calculations and Intervals on Original Scale  ## Some BCa intervals may be unstable

We get a warning message that BCa intervals may be unstable. This is because the accuracy of the bias and acceleration terms require a large number of bootstrap samples and, especially when using the jackknife to get the acceleration parameter, this can be computationally intensive. If so, there is another type of confidence interval known as the ABC interval that provides a satisfactory approximation to BCa intervals that is less computationally demanding. Type ?boot::abc.ci at the command line for how to implement this in R.

Transformations

What does it mean that calculations and intervals are on the original scale?

There are sometimes advantages to transforming a statistic so that it is on a different scale. An example is the correlation coefficient. We mentioned briefly above that the usual way of performing inference is to use the Fisher-\(z\) transformation.

\[
z = \frac{1}{2}\text{ln}\left(\frac{1+r}{1-r} \right)
\]

This transformation is normally distributed with standard error \(\frac{1}{\sqrt{N – 3}}\), so we can construct confidence intervals the usual way and then reverse-transform the limits using the function's inverse. For Fisher-\(z\), the inverse of the transformation function is:

\[
r = \frac{\text{exp}(2z) – 1}{\text{exp}(2z) + 1}
\]

If we prefer to work with the transformed statistic, we can include the transformation function and its inverse in the boot.ci function. Define the transformations:

fisher_z <- function(r) .5 * log((1 + r) / (1 - r))  inv_fisher_z <- function(z) (exp(2 * z) - 1) / (exp(2 * z) + 1)

We can use these functions within a call to boot.ci. What we get in return will depend on which functions we specify.

  • If only the transformation function is applied, the confidence intervals are on the transformed scale.
  • If the transformation and the inverse transformation functions are applied, the confidence intervals are calculated on the transformed scale but returned on the original scale.

Recall that not specifying either function returns:

boot.ci(    boot_out,     type = c("norm", "basic", "perc", "bca")  )
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out, type = c("norm", "basic", "perc",   ##     "bca"))  ##   ## Intervals :   ## Level      Normal              Basic           ## 95%   ( 0.5247,  1.0368 )   ( 0.5900,  1.0911 )    ##   ## Level     Percentile            BCa            ## 95%   ( 0.4609,  0.9620 )   ( 0.3948,  0.9443 )    ## Calculations and Intervals on Original Scale  ## Some BCa intervals may be unstable

Specifying the transformation only returns:

boot.ci(    boot_out,    h = fisher_z,    type = c("norm", "basic", "perc", "bca")  )
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out, type = c("norm", "basic", "perc",   ##     "bca"), h = fisher_z)  ##   ## Intervals :   ## Level      Normal              Basic           ## 95%   ( 0.212,  1.688 )   ( 0.098,  1.572 )    ##   ## Level     Percentile            BCa            ## 95%   ( 0.498,  1.972 )   ( 0.417,  1.777 )    ## Calculations and Intervals on  Transformed Scale  ## Some BCa intervals may be unstable

Specifying the transformation and its inverse returns the following:

boot.ci(    boot_out,    h = fisher_z,     hinv = inv_fisher_z,    type = c("norm", "basic", "perc", "bca")  )
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS  ## Based on 500 bootstrap replicates  ##   ## CALL :   ## boot.ci(boot.out = boot_out, type = c("norm", "basic", "perc",   ##     "bca"), h = fisher_z, hinv = inv_fisher_z)  ##   ## Intervals :   ## Level      Normal              Basic           ## 95%   ( 0.2091,  0.9340 )   ( 0.0981,  0.9173 )    ##   ## Level     Percentile            BCa            ## 95%   ( 0.4609,  0.9620 )   ( 0.3948,  0.9443 )    ## Calculations on Transformed Scale;  Intervals on Original Scale  ## Some BCa intervals may be unstable

Conclusion

It is hoped that this post clarifies the output from boot::boot.ci, and in particular facilitates understanding the messages the function produces. We saw that the percentile and BCa methods were the only ones considered here that were guaranteed to return a confidence interval that respected the statistic's sampling space. It turns out that there are theoretical grounds to prefer BCa in general. It is "second-order accurate", meaning that it converges faster to the correct coverage. Unless you have a reason to do otherwise, make sure to perform a sufficient number of bootstrap replicates (a few thousand is usually not too computationally intensive) and go with reporting BCa intervals.

To leave a comment for the author, please follow the link and comment on their blog: Rstats on pi: predict/infer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

More models, more features: what’s new in ‘parameters’ 0.2.0

Posted: 29 Sep 2019 05:00 PM PDT

[This article was first published on R on easystats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The easystats project continues to grow, expanding its capabilities and features, and the parameters package 0.2.0 update is now on CRAN.

The primary goal of this package is to provide utilities for processing the parameters of various statistical models. It is useful for end-users as well as developers, as it is a lightweight and open-developed package.

The main function, model_parameters(), can be seen as an alternative to broom::tidy(). However, the package also include many more useful features, some of which are described in our improved documentation:

Improved Support

Besides stabilizing and improving the functions for the most popular models (glm(), glmer(), stan_glm(), psych and lavaan…), the functions p_value(), ci(), standard_error(), standardize() and most importantly model_parameters() now support many more model objects, including mixed models from packages nlme, glmmTMB or GLMMadaptive, zero-inflated models from package pscl, other regression types from packages gam or mgcv, fixed effects regression models from panelr, lfe, feisr or plm, and structural models from FactoMineR.

Improved Printing

For models with special components, in particular zero-inflated models, model_parameters() separates these components for a clearer output.

## # Conditional component  ##   ## Parameter   | Coefficient |   SE |         95% CI |     z |      p  ## ------------------------------------------------------------------  ## (Intercept) |       -0.36 | 0.28 | [-0.90,  0.18] | -1.30 | > .1    ## spp (PR)    |       -1.27 | 0.24 | [-1.74, -0.80] | -5.27 | < .001  ## spp (DM)    |        0.27 | 0.14 | [ 0.00,  0.54] |  1.95 | 0.05    ## spp (EC-A)  |       -0.57 | 0.21 | [-0.97, -0.16] | -2.75 | < .01   ## spp (EC-L)  |        0.67 | 0.13 | [ 0.41,  0.92] |  5.20 | < .001  ## spp (DES-L) |        0.63 | 0.13 | [ 0.38,  0.87] |  4.96 | < .001  ## spp (DF)    |        0.12 | 0.15 | [-0.17,  0.40] |  0.78 | > .1    ## mined (no)  |        1.27 | 0.27 | [ 0.74,  1.80] |  4.72 | < .001  ##   ## # Zero-Inflated component  ##   ## Parameter   | Coefficient |   SE |         95% CI |     z |      p  ## ------------------------------------------------------------------  ## (Intercept) |        0.79 | 0.27 | [ 0.26,  1.32] |  2.90 | < .01   ## mined (no)  |       -1.84 | 0.31 | [-2.46, -1.23] | -5.87 | < .001

Join the team

There is still room for improvement, and some new exciting features are already planned. Feel free to let us know how we could further improve this package!

Note that easystats is a new project in active development, looking for contributors and supporters. Thus, do not hesitate to contact one of us if you want to get involved 🙂

  • Check out our other blog posts here!

To leave a comment for the author, please follow the link and comment on their blog: R on easystats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

bamlss: A Lego Toolbox for Flexible Bayesian Regression

Posted: 29 Sep 2019 03:00 PM PDT

[This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Modular R tools for Bayesian regression are provided by bamlss: From classic MCMC-based GLMs and GAMs to distributional models using the lasso or gradient boosting.

Citation

Umlauf N, Klein N, Simon T, Zeileis A (2019). "bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond)." arXiv:1909.11784, arXiv.org E-Print Archive. https://arxiv.org/abs/1909.11784

Abstract

Over the last decades, the challenges in applied regression and in predictive modeling have been changing considerably: (1) More flexible model specifications are needed as big(ger) data become available, facilitated by more powerful computing infrastructure. (2) Full probabilistic modeling rather than predicting just means or expectations is crucial in many applications. (3) Interest in Bayesian inference has been increasing both as an appealing framework for regularizing or penalizing model estimation as well as a natural alternative to classical frequentist inference. However, while there has been a lot of research in all three areas, also leading to associated software packages, a modular software implementation that allows to easily combine all three aspects has not yet been available. For filling this gap, the R package bamlss is introduced for Bayesian additive models for location, scale, and shape (and beyond). At the core of the package are algorithms for highly-efficient Bayesian estimation and inference that can be applied to generalized additive models (GAMs) or generalized additive models for location, scale, and shape (GAMLSS), also known as distributional regression. However, its building blocks are designed as "Lego bricks" encompassing various distributions (exponential family, Cox, joint models, …), regression terms (linear, splines, random effects, tensor products, spatial fields, …), and estimators (MCMC, backfitting, gradient boosting, lasso, …). It is demonstrated how these can be easily recombined to make classical models more flexible or create new custom models for specific modeling challenges.

Software

CRAN package: https://CRAN.R-project.org/package=bamlss
Replication script: bamlss.R
Project web page: http://www.bamlss.org/

Quick overview

To illustrate that the bamlss follows the same familiar workflow of the other regression packages such as the basic stats package or the well-established mgcv or gamlss two quick examples are provided: a Bayesian logit model and a location-scale model where both mean and variance of a normal response depend on a smooth term.

The logit model is a basic labor force participation model, a standard application in microeconometrics. Here, the data are loaded from the AER package and the same model formula is specified that would also be used for glm() (as shown on ?SwissLabor).

data("SwissLabor", package = "AER")  f <- participation ~ income + age + education + youngkids + oldkids + foreign + I(age^2)  

Then, the model can be estimated with bamlss() using essentially the same look-and-feel as for glm(). The default is to use Markov chain Monte Carlo after obtaining initial parameters via backfitting.

library("bamlss")  set.seed(123)  b <- bamlss(f, family = "binomial", data = SwissLabor)  summary(b)    ## Call:  ## bamlss(formula = f, family = "binomial", data = SwissLabor)  ## ---  ## Family: binomial  ## Link function: pi = logit  ## *---  ## Formula pi:  ## ---  ## participation ~ income + age + education + youngkids + oldkids +  ##     foreign + I(age^2)  ## -  ## Parametric coefficients:  ##                 Mean     2.5%      50%    97.5% parameters  ## (Intercept)  6.15503  1.55586  5.99204 11.11051      6.196  ## income      -1.10565 -1.56986 -1.10784 -0.68652     -1.104  ## age          3.45703  2.05897  3.44567  4.79139      3.437  ## education    0.03354 -0.02175  0.03284  0.09223      0.033  ## youngkids   -1.17906 -1.51099 -1.17683 -0.83047     -1.186  ## oldkids     -0.24122 -0.41231 -0.24099 -0.08054     -0.241  ## foreignyes   1.16749  0.76276  1.17035  1.55624      1.168  ## I(age^2)    -0.48990 -0.65660 -0.49205 -0.31968     -0.488  ## alpha        0.87585  0.32301  0.99408  1.00000         NA  ## ---  ## Sampler summary:  ## -  ## DIC = 1033.325 logLik = -512.7258 pd = 7.8734  ## runtime = 1.417  ## ---  ## Optimizer summary:  ## -  ## AICc = 1033.737 converged = 1 edf = 8  ## logLik = -508.7851 logPost = -571.3986 nobs = 872  ## runtime = 0.012  ## ---  

The summary is based on the MCMC samples, which suggest "significant" effects for all covariates, except for variable education, since the 95% credible interval contains zero. In addition, the acceptance probabilities alpha are reported and indicate proper behavior of the MCMC algorithm. The column parameters shows respective posterior mode estimates of the regression coefficients, which are calculated by the upstream backfitting algorithm.

To show a more flexible regression model we fit a distributional scale-location model to the well-known simulated motorcycle accident data, provided as mcycle in the MASS package.

Here, the relationship between head acceleration and time after impact is captured by smooth relationships in both mean and variance. See also ?gaulss in the mgcv package for the same type of model estimated with REML rather than MCMC. Here, we load the data, set up a list of two formula with smooth terms (and increased knots k for more flexibility), fit the model almost as usual, and then visualize the fitted terms along with 95% credible intervals.

data("mcycle", package = "MASS")  f <- list(accel ~ s(times, k = 20), sigma ~ s(times, k = 20))  set.seed(456)  b <- bamlss(f, data = mcycle, family = "gaussian")  plot(b, model = c("mu", "sigma"))  

mcycle distributional regression

Flexible count regression for lightning reanalysis

Finally, we show a more challenging case study. Here, emphasis is given to the illustration of the workflow. For more details on the background for the data and interpretation of the model, see Section 5 in the full paper linked above. The goal is to establish a probabilistic model linking positive counts of cloud-to-ground lightning discharges in the European Eastern Alps to atmospheric quantities from a reanalysis dataset.

The lightning measurements form the response variable and regressors are taken from the atmospheric quantities from ECMWF's ERA5 reanalysis data. Both have a temporal resolution of 1 hour for the years 2010-2018 and a spatial mesh size of approximately 32 km. The subset of the data analyzed along with the fitted bamlss model are provided in the FlashAustria data on R-Forge which can be installed by

install.packages("FlashAustria", repos = "http://R-Forge.R-project.org")  

To model only the lightning counts with at least one lightning discharge we employ a negative binomial count distribution, truncated at zero. The data can be loaded as follows and the regression formula set up:

data("FlashAustria", package = "FlashAustria")  f <- list(    counts ~ s(d2m, bs = "ps") + s(q_prof_PC1, bs = "ps") +      s(cswc_prof_PC4, bs = "ps") + s(t_prof_PC1, bs = "ps") +      s(v_prof_PC2, bs = "ps") + s(sqrt_cape, bs = "ps"),    theta ~ s(sqrt_lsp, bs = "ps")  )  

The expectation mu of the underlying untruncated negative binomial model is modeled by various smooth terms for the atmospheric variables while the overdispersion parameter theta only depends on one smooth regressor. To fit this challenging model, gradient boosting is employed in a first step to obtain initial values for the subsequent MCMC sampler. Running the model takes about 30 minutes on a well-equipped standard PC. In order to move quickly through the example we load the pre-computed model from the FlashAustria package:

data("FlashAustriaModel", package = "FlashAustria")  b <- FlashAustriaModel  

But, of course, the model can also be refitted:

set.seed(111)  b <- bamlss(f, family = "ztnbinom", data = FlashAustriaTrain,    optimizer = boost, maxit = 1000,	   ## Boosting arguments.    thin = 5, burnin = 1000, n.iter = 6000)  ## Sampler arguments.  

To explore this model in some more detail, we show a couple of visualizations. First, the contribution to the log-likelihood of individual terms during gradient boosting is depicted.

pathplot(b, which = "loglik.contrib", intercept = FALSE)  

flash model, likelihood contributions

Subsequently, we show traceplots of the MCMC samples (left) along with autocorrelation for two splines the term s(sqrt_cape) of the model for mu.

plot(b, model = "mu", term = "s(sqrt_cape)", which = "samples")  

flash model, MCMC samples

Next, the effects of the terms s(sqrt_cape) and s(q_prof_PC1) from the model for mu and term s(sqrt_lsp) from the model for theta are shown along with 95% credible intervals derived from the MCMC samples.

plot(b, term = c("s(sqrt_cape)", "s(q_prof_PC1)", "s(sqrt_lsp)"),    rug = TRUE, col.rug = "#39393919")  

flash model, fitted smooth effects

Finally, estimated probabilities for observing 10 or more lightning counts (within one grid box) are computed and visualized. The reconstructions for four time points on September 15-16, 2001 are shown.

fit <- predict(b, newdata = FlashAustriaCase, type = "parameter")  fam <- family(b)  FlashAustriaCase$P10 <- 1 - fam$p(9, fit)  world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")    library("ggplot2")  ggplot() + geom_sf(aes(fill = P10), data = FlashAustriaCase) +    colorspace::scale_fill_continuous_sequential("Oslo", rev = TRUE) +    geom_sf(data = world, col = "white", fill = NA) +    coord_sf(xlim = c(7.95, 17), ylim = c(45.45, 50), expand = FALSE) +    facet_wrap(~time, nrow = 2) + theme_minimal() +    theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0))  

flash model, spatial maps

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Getting started with {golem}

Posted: 29 Sep 2019 11:30 AM PDT

[This article was first published on Rtask, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A little blog post about where to look if you want to get started with {golem}, and an invitation to code with us in October. go-what? If you've never heard about it before, {golem} is a tool for building production-grade Shiny applications. With {golem}, Shiny developers have a toolkit for making a stable, easy-to-maintain, and robust for production web applications

L'article Getting started with {golem} est apparu en premier sur Rtask.

To leave a comment for the author, please follow the link and comment on their blog: Rtask.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Tidy forecasting in R

Posted: 28 Sep 2019 05:00 PM PDT

[This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The fable package for doing tidy forecasting in R is now on CRAN. Like tsibble and feasts, it is also part of the tidyverts family of packages for analysing, modelling and forecasting many related time series (stored as tsibbles).

For a brief introduction to tsibbles, see this post from last month.

Here we will forecast Australian tourism data by state/region and purpose. This data is stored in the tourism tsibble where Trips contains domestic visitor nights in thousands.

library(tidyverse)  library(tsibble)  library(lubridate)  library(fable)  tourism
## # A tsibble: 24,320 x 5 [1Q]  ## # Key:       Region, State, Purpose [304]  ##    Quarter Region   State           Purpose  Trips  ##                            ##  1 1998 Q1 Adelaide South Australia Business  135.  ##  2 1998 Q2 Adelaide South Australia Business  110.  ##  3 1998 Q3 Adelaide South Australia Business  166.  ##  4 1998 Q4 Adelaide South Australia Business  127.  ##  5 1999 Q1 Adelaide South Australia Business  137.  ##  6 1999 Q2 Adelaide South Australia Business  200.  ##  7 1999 Q3 Adelaide South Australia Business  169.  ##  8 1999 Q4 Adelaide South Australia Business  134.  ##  9 2000 Q1 Adelaide South Australia Business  154.  ## 10 2000 Q2 Adelaide South Australia Business  169.  ## # … with 24,310 more rows

There are 304 combinations of Region, State and Purpose, each one defining a time series of 80 observations.

To simplify the outputs, we will abbreviate the state names.

tourism <- tourism %>%    mutate(      State = recode(State,        "Australian Capital Territory" = "ACT",        "New South Wales" = "NSW",        "Northern Territory" = "NT",        "Queensland" = "QLD",        "South Australia" = "SA",        "Tasmania" = "TAS",        "Victoria" = "VIC",        "Western Australia" = "WA"      )    )

Forecasting a single time series

Although the fable package is designed to handle many time series, we will be begin by demonstrating its use on a single time series. For this purpose, we will extract the tourism data for holidays in the Snowy Mountains region of NSW.

snowy <- tourism %>%    filter(      Region == "Snowy Mountains",      Purpose == "Holiday"    )  snowy
## # A tsibble: 80 x 5 [1Q]  ## # Key:       Region, State, Purpose [1]  ##    Quarter Region          State Purpose Trips  ##                        ##  1 1998 Q1 Snowy Mountains NSW   Holiday 101.   ##  2 1998 Q2 Snowy Mountains NSW   Holiday 112.   ##  3 1998 Q3 Snowy Mountains NSW   Holiday 310.   ##  4 1998 Q4 Snowy Mountains NSW   Holiday  89.8  ##  5 1999 Q1 Snowy Mountains NSW   Holiday 112.   ##  6 1999 Q2 Snowy Mountains NSW   Holiday 103.   ##  7 1999 Q3 Snowy Mountains NSW   Holiday 254.   ##  8 1999 Q4 Snowy Mountains NSW   Holiday  74.9  ##  9 2000 Q1 Snowy Mountains NSW   Holiday 118.   ## 10 2000 Q2 Snowy Mountains NSW   Holiday 114.   ## # … with 70 more rows
snowy %>% autoplot(Trips)

For this data set, a reasonable benchmark forecast method is the seasonal naive method, where forecasts are set to be equal to the last observed value from the same quarter. Alternative models for this series are ETS and ARIMA models. All these can be included in a single call to the model() function like this.

fit <- snowy %>%    model(      snaive = SNAIVE(Trips ~ lag("year")),      ets = ETS(Trips),      arima = ARIMA(Trips)    )  fit
## # A mable: 1 x 6  ## # Key:     Region, State, Purpose [1]  ##   Region          State Purpose snaive   ets         arima                   ##                                          ## 1 Snowy Mountains NSW   Holiday  

The returned object is called a "mable" or model table, where each cell corresponds to a fitted model. Because we have only fitted models to one time series, this mable has only one row.

To forecast all models, we pass the object to the forecast function.

fc <- fit %>%    forecast(h = 12)  fc
## # A fable: 36 x 7 [1Q]  ## # Key:     Region, State, Purpose, .model [3]  ##    Region          State Purpose .model Quarter Trips .distribution  ##                                  ##  1 Snowy Mountains NSW   Holiday snaive 2018 Q1 119.  N(119,  666)   ##  2 Snowy Mountains NSW   Holiday snaive 2018 Q2 124.  N(124,  666)   ##  3 Snowy Mountains NSW   Holiday snaive 2018 Q3 378.  N(378,  666)   ##  4 Snowy Mountains NSW   Holiday snaive 2018 Q4  84.7 N( 85,  666)   ##  5 Snowy Mountains NSW   Holiday snaive 2019 Q1 119.  N(119, 1331)   ##  6 Snowy Mountains NSW   Holiday snaive 2019 Q2 124.  N(124, 1331)   ##  7 Snowy Mountains NSW   Holiday snaive 2019 Q3 378.  N(378, 1331)   ##  8 Snowy Mountains NSW   Holiday snaive 2019 Q4  84.7 N( 85, 1331)   ##  9 Snowy Mountains NSW   Holiday snaive 2020 Q1 119.  N(119, 1997)   ## 10 Snowy Mountains NSW   Holiday snaive 2020 Q2 124.  N(124, 1997)   ## # … with 26 more rows

The return object is a "fable" or forecast table with the following characteristics:

  • the .model column becomes an additional key;
  • the .distribution column contains the estimated probability distribution of the response variable in future time periods;
  • the Trips column contains the point forecasts equal to the mean of the probability distribution.

The autoplot() function will produce a plot of all forecasts. By default, level=c(80,95) so 80% and 95% prediction intervals are shown. But to avoid clutter, we will set level=NULL to show no prediction intervals.

fc %>%    autoplot(snowy, level = NULL) +    ggtitle("Forecasts for Snowy Mountains holidays") +    xlab("Year") +    guides(colour = guide_legend(title = "Forecast"))

If you want to compute the prediction intervals, the hilo() function can be used:

hilo(fc, level = 95)
## # A tsibble: 36 x 7 [1Q]  ## # Key:       Region, State, Purpose, .model [3]  ##    Region          State Purpose .model Quarter Trips          `95%`  ##                                   ##  1 Snowy Mountains NSW   Holiday snaive 2018 Q1 119.  [ 68.5, 170]95  ##  2 Snowy Mountains NSW   Holiday snaive 2018 Q2 124.  [ 73.9, 175]95  ##  3 Snowy Mountains NSW   Holiday snaive 2018 Q3 378.  [327.6, 429]95  ##  4 Snowy Mountains NSW   Holiday snaive 2018 Q4  84.7 [ 34.1, 135]95  ##  5 Snowy Mountains NSW   Holiday snaive 2019 Q1 119.  [ 47.5, 191]95  ##  6 Snowy Mountains NSW   Holiday snaive 2019 Q2 124.  [ 53.0, 196]95  ##  7 Snowy Mountains NSW   Holiday snaive 2019 Q3 378.  [306.6, 450]95  ##  8 Snowy Mountains NSW   Holiday snaive 2019 Q4  84.7 [ 13.1, 156]95  ##  9 Snowy Mountains NSW   Holiday snaive 2020 Q1 119.  [ 31.4, 207]95  ## 10 Snowy Mountains NSW   Holiday snaive 2020 Q2 124.  [ 36.9, 212]95  ## # … with 26 more rows

Forecasting many series

To scale this up to include all series in the tourism data set requires no more work — we use exactly the same code.

fit <- tourism %>%    model(      snaive = SNAIVE(Trips ~ lag("year")),      ets = ETS(Trips),      arima = ARIMA(Trips)    )  fit
## # A mable: 304 x 6  ## # Key:     Region, State, Purpose [304]  ##    Region       State Purpose  snaive   ets        arima                     ##                                          ##  1 Adelaide     SA    Business       ##  4 Adelaide     SA    Visiting       ##  6 Adelaide Hi… SA    Holiday               ##  7 Adelaide Hi… SA    Other       ##  8 Adelaide Hi… SA    Visiting              ##  9 Alice Sprin… NT    Business    ## 10 Alice Sprin… NT    Holiday     ## # … with 294 more rows

Now the mable includes models for every combination of keys in the tourism data set.

We can extract information about some specific model using the filter, select and report functions.

fit %>%    filter(Region == "Snowy Mountains", Purpose == "Holiday") %>%    select(arima) %>%    report()
## Series: Trips   ## Model: ARIMA(1,0,0)(0,1,2)[4]   ##   ## Coefficients:  ##         ar1    sma1    sma2  ##       0.216  -0.371  -0.190  ## s.e.  0.116   0.128   0.116  ##   ## sigma^2 estimated as 592.9:  log likelihood=-350  ## AIC=707   AICc=708   BIC=716

When the mable is passed to the forecast() function, forecasts are computed for every model and every key combination.

fc <- fit %>%    forecast(h = "3 years")  fc
## # A fable: 10,944 x 7 [1Q]  ## # Key:     Region, State, Purpose, .model [912]  ##    Region   State Purpose  .model Quarter Trips .distribution  ##                            ##  1 Adelaide SA    Business snaive 2018 Q1  129. N(129, 2018)   ##  2 Adelaide SA    Business snaive 2018 Q2  174. N(174, 2018)   ##  3 Adelaide SA    Business snaive 2018 Q3  185. N(185, 2018)   ##  4 Adelaide SA    Business snaive 2018 Q4  197. N(197, 2018)   ##  5 Adelaide SA    Business snaive 2019 Q1  129. N(129, 4036)   ##  6 Adelaide SA    Business snaive 2019 Q2  174. N(174, 4036)   ##  7 Adelaide SA    Business snaive 2019 Q3  185. N(185, 4036)   ##  8 Adelaide SA    Business snaive 2019 Q4  197. N(197, 4036)   ##  9 Adelaide SA    Business snaive 2020 Q1  129. N(129, 6054)   ## 10 Adelaide SA    Business snaive 2020 Q2  174. N(174, 6054)   ## # … with 10,934 more rows

Note the use of natural language to specify the forecast horizon. The forecast() function is able to interpret many different time specifications. For quarterly data, h = "3 years" is equivalent to setting h = 12.

Plots of individual forecasts can also be produced, although filtering is helpful to avoid plotting too many series at once.

fc %>%    filter(Region == "Snowy Mountains") %>%    autoplot(tourism, level = NULL) +    xlab("Year") + ylab("Overnight trips (thousands)")

Forecast accuracy calculations

To compare the forecast accuracy of these models, we will create a training data set containing all data up to 2014. We will then forecast the remaining years in the data set and compare the results with the actual values.

train <- tourism %>%    filter(year(Quarter) <= 2014)  fit <- train %>%    model(      ets = ETS(Trips),      arima = ARIMA(Trips),      snaive = SNAIVE(Trips)    ) %>%    mutate(mixed = (ets + arima + snaive) / 3)

Here we have introduced an ensemble forecast (mixed) which is a simple average of the three fitted models. Note that forecast() will produce distributional forecasts from the ensemble as well, taking into account the correlations between the forecast errors of the component models.

fc <- fit %>% forecast(h = "3 years")
fc %>%    filter(Region == "Snowy Mountains") %>%    autoplot(tourism, level = NULL)

Now to check the accuracy, we use the accuracy() function. By default it computes several point forecasting accuracy measures such as MAE, RMSE, MAPE and MASE for every key combination.

accuracy(fc, tourism)
## # A tibble: 1,216 x 12  ##    .model Region State Purpose .type    ME  RMSE   MAE      MPE  MAPE  MASE  ##                       ##  1 arima  Adela… SA    Busine… Test  22.5  28.5  25.3    11.9    14.0 0.765  ##  2 arima  Adela… SA    Holiday Test  21.9  34.8  28.0     9.93   14.8 1.31   ##  3 arima  Adela… SA    Other   Test   4.71 17.5  14.6     0.529  20.2 1.11   ##  4 arima  Adela… SA    Visiti… Test  32.8  37.1  32.8    13.7    13.7 1.05   ##  5 arima  Adela… SA    Busine… Test   1.31  5.58  3.57 -Inf     Inf   1.09   ##  6 arima  Adela… SA    Holiday Test   6.46  7.43  6.46   37.4    37.4 1.14   ##  7 arima  Adela… SA    Other   Test   1.35  2.79  1.93  -31.0    99.4 1.71   ##  8 arima  Adela… SA    Visiti… Test   8.37 12.6  10.4    -3.98   72.3 1.35   ##  9 arima  Alice… NT    Busine… Test   9.85 12.2  10.7    34.4    44.3 1.74   ## 10 arima  Alice… NT    Holiday Test   4.80 11.3   9.30    4.46   35.2 1.00   ## # … with 1,206 more rows, and 1 more variable: ACF1 

But because we have generated distributional forecasts, it is also interesting to look at the accuracy using CRPS (Continuous Rank Probability Scores) and Winkler Scores (for 95% prediction intervals).

fc_accuracy <- accuracy(fc, tourism,    measures = list(      point_accuracy_measures,      interval_accuracy_measures,      distribution_accuracy_measures    )  )
fc_accuracy %>%    group_by(.model) %>%    summarise(      RMSE = mean(RMSE),      MAE = mean(MAE),      MASE = mean(MASE),      Winkler = mean(winkler),      CRPS = mean(CRPS)    ) %>%    arrange(RMSE)
## # A tibble: 4 x 6  ##   .model  RMSE   MAE  MASE Winkler  CRPS  ##             ## 1 mixed   19.8  16.0 0.997    104.  11.4  ## 2 ets     20.2  16.4 1.00     128.  11.9  ## 3 snaive  21.5  17.3 1.17     121.  12.8  ## 4 arima   21.9  17.8 1.07     140.  13.0

In this case, the mixed model is doing best on all accuracy measures.

Moving from forecast to fable

Many readers will be familiar with the forecast package and will wonder about the differences between forecast and fable. Here are some of the main differences.

  • fable is designed for tsibble objects, forecast is designed for ts objects.
  • fable handles many time series at a time, forecast handles one time series at a time.
  • fable can fit multiple models at once, forecast fits one model at a time.
  • forecast produces point forecasts and prediction intervals. fable produces point forecasts and distribution forecasts. In fable, you can get prediction intervals from the forecast object using hilo() and in plots using autoplot().
  • fable handles ensemble forecasting easily whereas forecast has no facilities for ensembles.
  • fable has a more consistent interface with every model specified as a formula.
  • Automated modelling in fable is obtained by simply not specifying the right hand side of the formula. This was shown in the ARIMA() and ETS() functions here.

Subsequent posts will explore other features of the fable package.

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Does news coverage boost support for presidential candidates in the Democratic primary?

Posted: 28 Sep 2019 05:00 PM PDT

[This article was first published on R on Jacob Long, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Matt Grossmann noted the close relationship between
the amount of news coverage candidates in the Democratic primary have been
receiving and their polling numbers.

This got me thinking about what the available data can bring to bear on this
question. I have ongoing interest in longitudinal data and the
software for analyzing it, so this seemed like a fun,
quick project. Luckily, there are several great resources to take the pain out
of data collection in this case.

The GDELT project
offers a TV API
that allows anyone to look at how much cable news channels mention the
candidates (specifically, the number of 15-second windows of coverage that
mention the candidate by name). Media Cloud also
lets you look at how often candidates are mentioned in online news articles.
Helpfully, the fine folks at FiveThirtyEight have
compiled these data
as well as polls, already.

Now I'm going to walk through how to get these data into R. Skip to the
analysis by clicking here.

Getting the data

As mentioned, FiveThirtyEight has compiled most of the data we're interested in,
albeit in different places. We will read them into R as separate data frames and
join them later. There are some warnings from the CSV parser but they aren't
important for our purposes.

library(tidyverse)  library(jtools)  library(tsibble)    cable_mentions <- read_csv("https://github.com/fivethirtyeight/data/raw/master/media-mentions-2020/cable_weekly.csv")  online_mentions <- read_csv("https://github.com/fivethirtyeight/data/raw/master/media-mentions-2020/online_weekly.csv")  # Immediately convert `end_date` to date class  polls <- read_csv("https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv")  

Now we have the data, but we still have to get it in shape. First, we deal with
the polls.

Polls

These data are formatted such that every row is a unique combination
of candidate and poll. So if a poll included 20 candidates, there would be 20
rows to cover the results of that single poll. This is actually a good thing
for our purposes.

I first create two vectors of candidate names. The first is the
candidates who will be retained for analysis, in the format they are named in
the polling data. The second is the same set of candidates, but with their less
formal names that are used in the media data.

candidates <- c("Amy Klobuchar", "Andrew Yang", "Bernard Sanders",                  "Beto O'Rourke", "Bill de Blasio", "Cory A. Booker",                  "Elizabeth Warren", "Eric Swalwell", "Jay Robert Inslee",                   "Joe Sestak", "John Hickenlooper", "John K. Delaney",                  "Joseph R. Biden Jr.", "Julián Castro", "Kamala D. Harris",                   "Kirsten E. Gillibrand", "Marianne Williamson",                   "Michael F. Bennet", "Pete Buttigieg", "Seth Moulton",                  "Steve Bullock", "Tim Ryan", "Tom Steyer", "Tulsi Gabbard",                  "Wayne Messam")    candidates_clean <- c("Amy Klobuchar", "Andrew Yang", "Bernie Sanders",                        "Beto O'Rourke", "Bill de Blasio", "Cory Booker",                        "Elizabeth Warren", "Eric Swalwell", "Jay Inslee",                         "Joe Sestak", "John Hickenlooper", "John Delaney",                        "Joe Biden", "Julian Castro", "Kamala Harris",                         "Kirsten Gillibrand", "Marianne Williamson",                         "Michael Bennet", "Pete Buttigieg", "Seth Moulton",                        "Steve Bullock", "Tim Ryan", "Tom Steyer",                        "Tulsi Gabbard", "Wayne Messam")  

Now we do some filtering and data cleaning for polls. See the inline comments
for some explanations, but basically we're using only polls of known quality,
that cover the time period for which we have media data, and only national
polls.

polls <- polls %>%    # Convert date to date format    mutate(end_date = as.Date(end_date, format = "%m/%d/%y")) %>%    filter(      # include only polls of at least modest quality      fte_grade %in% c("C-", "C", "C+", "B-", "B", "B+", "A-", "A", "A+"),       # only include polls ending on or after 12/30/2018      end_date >= as.Date("12/30/2018", "%m/%d/%Y"),      # only include *Democratic* primary polls      party == "DEM",       # only include the selected candidates      candidate_name %in% candidates,      # only national polls      is.na(state),      # Exclude some head-to-head results, etc.      notes %nin% c("head-to-head poll",                     "HarrisX/SR Democrat LV, definite voter",                     "open-ended question")    ) %>%    mutate(      # Have to add 1 to the date to accommodate tsibble's yearweek()      # starting on Monday rather than Sunday like our other data sources      week = as.Date(yearweek(end_date + 1)) - 1,      # Convert candidate names to factor so I can relabel them      candidate_name = factor(candidate_name, levels = candidates, labels = candidates_clean)    )   

Now we aggregate by week, forming a weekly polling average by candidate. If we
were trying to build a forecast, we would do this in a better way that wouldn't
have so much variation. For now, all I do is weight the results by
(logged) sample size. Note that pct refers to the percentage of the "votes"
the candidate received in the poll.

polls_agg <- polls %>%    group_by(week, candidate_name) %>%    summarize(      pct_polls = weighted.mean(pct, log(sample_size))    )  

For a quick sanity check, let's plot these data to see if things line up (
I omit the relatively lower-polling candidates for simplicity).

library(ggplot2)  top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",                       "Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",                      "Cory Booker")  ggplot(filter(polls_agg, candidate_name %in% top_candidates),          aes(x = week, y = pct_polls, color = candidate_name)) +    geom_line() +    theme_nice()  


Okay, it's a bit more variable than
other
aggregators
but it's showing us the same basic trends.

Media

We have two data frames with media coverage info, cable_mentions and
online_mentions. These are in much better shape to begin with, but we do need
to combine them and make a couple changes. Each row in these data represent
a candidate and week, so there are $weeks \times candidates$ rows.

This is a good example of a time
to use an inner join. Note that our key variables are the proportion of
all news clips/articles that mention any candidate that mention the candidate
in question. In other words, we're ignoring variation in how much the primary
gets discussed in the news and instead focusing on how big each candidate's
share of the coverage is.

media <-     inner_join(cable_mentions, online_mentions, by = c("date", "name")) %>%    mutate(      # Create new variables that put the media coverage variables on       # same scale as poll numbers      pct_cable = 100 * pct_of_all_candidate_clips,      pct_online = 100 * pct_of_all_candidate_stories    )  

Let's look at the trends for cable news…

library(ggplot2)  top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",                       "Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",                      "Cory Booker")  ggplot(filter(media, name %in% top_candidates),          aes(x = date, y = pct_cable, color = name)) +    geom_line() +    theme_nice()  


This looks a bit similar to the polling trends, although more variable over
time.

And now online news…

library(ggplot2)  top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",                       "Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",                      "Cory Booker")  ggplot(filter(media, name %in% top_candidates),          aes(x = date, y = pct_online, color = name)) +    geom_line() +    theme_nice()  


This one's a bit more all over the place, with the minor candidates espcially
having higher highs.

Combine data

Now we just need to get all this information in the same place for analysis.
More inner joins!

joined <- inner_join(polls_agg, media,                        by = c("candidate_name" = "name", "week" = "date"))  

Now we have everything in a single data frame where each row represents one
week and one candidate. To make things work for statistical analysis, I'm going
to do a couple conversions — one to the panel_data format, from my
panelr package, and another to pdata.frame format, from the plm package.
We'll be using both packages for analysis.

library(panelr)  # panel_data needs a number or ordered factor as wave variable  joined$wave <- as.ordered(joined$week)   joined_panel <- panel_data(ungroup(joined), id = candidate_name, wave = wave)  joined_pdata <- as_pdata.frame(joined_panel)  

Analysis

Okay, so we have multiple time series for each candidate: their status in the
polls, how much of the cable news coverage they're getting, and how
much of the online news coverage they're getting. We'd like to know whether
any of these are causing the others. Most interesting is whether the news
coverage drives better results in the polls.

The kind of analyses we can do all have in common the idea of comparing each
candidate to himself or herself in the past. If Elizabeth Warren's share of
news coverage goes from 10% to 12%, up 2 percentage points, where do we expect
her share in the polls to go? If it goes from 15% to 17%, then it goes up 2
percentage points as well. This is treated equivalently to if Andrew Yang goes
from 0% of news to 2% of news and then sees his polls goes from 1% to 3%.

Of course, this still doesn't sort out the problem of reverse causality. If we
see that news coverage and polls change at the same time, it's not obvious
which caused the other (and we'll ignore the possibility that something else
caused both for the time being). There are several methods for dealing with
this and I'll focus on ones that use past values of polls to predict future
ones.

Fixed effects models

Fixed effects models are a common way to remove the influence of certain kinds
of confounding variables, like a candidate's pre-existing popularity. It
doesn't fix the problem of confounders that change over time (like a change in
the candidate's campaign strategy or a new scandal), but it's a workhorse
model for longitudinal data.

The process we're looking at is dynamic, meaning candidates' support
in the past affects the present; people don't pick their favorite candidate
every week, they have a favorite candidate who will remain in that position
unless something changes. We model this statistically by using last week's
polling average as a predictor of this week's polling average.
In the panel data literature, using so-called fixed effects models with a
lagged value of the dependent variable in the model is a big no-no. This is
because something called
Nickell bias,
which basically means that models like this give you wrong results in a
predictable way.

Luckily, these data are not quite the same as the kind that the Nickell bias
affects the most. We have 24 candidates with up to 38 weeks of data for each.
The Nickell bias tends to be most problematic when you have relatively few
time points and relatively many people (in this case candidates). So we'll
start with fixed effects models and assume the Nickell bias isn't too serious.

I'm going to use the wbm() function from my panelr package to do this
analysis.

fix_mod <- wbm(pct_polls ~ lag(pct_polls) +                   pct_cable + lag(pct_cable) +                   pct_online + lag(pct_online),                 data = joined_panel, model = "fixed")  summary(fix_mod)  
MODEL INFO:  Entities: 24  Time periods: 2019-01-13-2019-09-15  Dependent variable: pct_polls  Model type: Linear mixed effects  Specification: within    MODEL FIT:  AIC = 2233.15, BIC = 2269.64  Pseudo-R² (fixed effects) = 0.03  Pseudo-R² (total) = 0.98  Entity ICC = 0.98    -------------------------------------------------------------                           Est.   S.E.   t val.     d.f.      p  --------------------- ------- ------ -------- -------- ------  (Intercept)              3.85   1.45     2.65    23.01   0.01  lag(pct_polls)           0.64   0.03    24.74   678.01   0.00  pct_cable                0.08   0.01     6.29   678.01   0.00  lag(pct_cable)           0.05   0.01     4.05   678.01   0.00  pct_online              -0.03   0.01    -2.29   678.01   0.02  lag(pct_online)         -0.02   0.01    -1.39   678.01   0.17  -------------------------------------------------------------    p values calculated using Satterthwaite d.f.     RANDOM EFFECTS:  ------------------------------------------       Group         Parameter    Std. Dev.   ---------------- ------------- -----------   candidate_name   (Intercept)     7.108         Residual                      1.007     ------------------------------------------  

Here's what the output is saying:

  • First of all, there's evidence of momentum. If your poll numbers went up
    last week, all else being equal they'll probably be up this week too.
  • Gains in cable news coverage both this week and last week are associated with
    gains in the polls this week.
  • Gains in online news coverage this week are associated (very weakly) with
    declines in the polls this week, assuming no change in cable news coverage.

I will note that as far as the online coverage is concerned, if I drop cable
news coverage from the model then suddenly online coverage appears to have a
positive effect. I think what's going on there is both online and cable news
cover candidates in a way that helps them, but online coverage is sometimes
harmful in a way that is not true of online coverage. Either that or there's
just a lot more noise in the online data.

This was the simplest analysis I can do. I can also try to remove any trends
in the data to try to account for something that isn't in the model that drives
some candidates up or down over time. Basically, for each candidate we subtract
their over-time trend from each week's polling numbers and news coverage and see
if deviations from their trend predict each other.

The risk with this approach is that
it really is news that has most of the influence and you're modeling away
some of the "real" effects along with the stuff you don't want around.

fix_mod <- wbm(pct_polls ~ lag(pct_polls) +                   pct_cable + lag(pct_cable) +                   pct_online + lag(pct_online),                 data = joined_panel, model = "fixed",                 detrend = TRUE)  summary(fix_mod)  
MODEL INFO:  Entities: 24  Time periods: 2019-01-13-2019-09-15  Dependent variable: pct_polls  Model type: Linear mixed effects  Specification: within    MODEL FIT:  AIC = 2169.99, BIC = 2206.48  Pseudo-R² (fixed effects) = 0.91  Pseudo-R² (total) = 0.97  Entity ICC = 0.7    -------------------------------------------------------------                           Est.   S.E.   t val.     d.f.      p  --------------------- ------- ------ -------- -------- ------  (Intercept)              1.01   0.34     2.95    16.39   0.01  lag(pct_polls)           0.69   0.02    29.60   339.36   0.00  pct_cable                0.08   0.01     6.43   683.75   0.00  lag(pct_cable)           0.04   0.01     3.52   688.28   0.00  pct_online              -0.03   0.01    -2.11   692.90   0.04  lag(pct_online)         -0.01   0.01    -1.00   690.93   0.32  -------------------------------------------------------------    p values calculated using Satterthwaite d.f.     RANDOM EFFECTS:  ------------------------------------------       Group         Parameter    Std. Dev.   ---------------- ------------- -----------   candidate_name   (Intercept)     1.559         Residual                      1.012     ------------------------------------------  

Okay, same story here. Some good evidence of cable news helping and some very
weak evidence of online news possibly hurting.

Driven by minor candidates?

Responding to Grossmann's tweet, Jonathan Ladd raises an interesting question:

There are a couple of ways to look at this. First of all, let's think about
this as less of a Biden vs. all others phenomenon and more about whether this
effect of news on candidate support is concentrated among those with relatively
low support.

We can deal with this via an interaction effect, seeing whether the effects
are stronger or weaker among candidates with higher/lower absolute levels of
support. I need to fit a slightly different model here to accommodate the
inclusion of the lagged dependent variable without subtracting its mean (as is
done for the conventional fixed effects analysis). Our focus will be on the
"within" effects and cross-level interactions in the output below.

int_mod <- wbm(pct_polls ~                    pct_cable + lag(pct_cable) +                   pct_online + lag(pct_online) | lag(pct_polls) |                   lag(pct_polls) * pct_cable +                   lag(pct_polls) * lag(pct_cable) +                   lag(pct_polls) * pct_online +                   lag(pct_polls) * lag(pct_online),                 data = joined_panel, model = "w-b")  summary(int_mod)  
MODEL INFO:  Entities: 24  Time periods: 2019-01-13-2019-09-15  Dependent variable: pct_polls  Model type: Linear mixed effects  Specification: within-between    MODEL FIT:  AIC = 2109.28, BIC = 2173.14  Pseudo-R² (fixed effects) = 0.98  Pseudo-R² (total) = 0.98  Entity ICC = 0.21    WITHIN EFFECTS:  -------------------------------------------------------------                           Est.   S.E.   t val.     d.f.      p  --------------------- ------- ------ -------- -------- ------  pct_cable                0.09   0.02     4.68   672.05   0.00  lag(pct_cable)           0.02   0.02     1.20   671.39   0.23  pct_online              -0.01   0.02    -0.54   673.16   0.59  lag(pct_online)          0.04   0.02     2.55   672.97   0.01  -------------------------------------------------------------    BETWEEN EFFECTS:  ---------------------------------------------------------------                             Est.   S.E.   t val.     d.f.      p  ----------------------- ------- ------ -------- -------- ------  (Intercept)               -0.16   0.17    -0.92    18.51   0.37  imean(pct_cable)           0.31   0.03     9.27    38.40   0.00  imean(pct_online)          0.00   0.02     0.03    17.29   0.98  lag(pct_polls)             0.63   0.02    26.58   589.12   0.00  ---------------------------------------------------------------    CROSS-LEVEL INTERACTIONS:  ----------------------------------------------------------------------------                                          Est.   S.E.   t val.     d.f.      p  ------------------------------------ ------- ------ -------- -------- ------  pct_cable:lag(pct_polls)                0.00   0.00     0.43   671.50   0.67  lag(pct_cable):lag(pct_polls)           0.00   0.00     4.05   674.01   0.00  pct_online:lag(pct_polls)              -0.00   0.00    -2.20   671.51   0.03  lag(pct_online):lag(pct_polls)         -0.01   0.00    -5.66   674.15   0.00  ----------------------------------------------------------------------------    p values calculated using Satterthwaite d.f.     RANDOM EFFECTS:  ------------------------------------------       Group         Parameter    Std. Dev.   ---------------- ------------- -----------   candidate_name   (Intercept)    0.4972         Residual                      0.955     ------------------------------------------  

Okay so there's a lot going on here. First of all, we see that the instantaneous
effect of changes in cable news coverage does not appear to depend on the
candidate's previous standing in the polls. For the other interaction terms,
we have some evidence of the effects changing depending on the candidate's
standing in the polls.

Let's examine them one by one, with help from
my interactions package. I'll show predicted values of poll numbers depending
on different values of news coverage to give a gist of what's going on.

Last week's cable news coverage

Each line represents the predicted standing in this week's polls at different
levels of last week's standing in the polls. What we really care about is the
slope of the lines.

library(interactions)  interact_plot(int_mod, `lag(pct_cable)`, `lag(pct_polls)`,                modx.values = c(2, 10, 20),                 x.label = "Last week's % change in cable news coverage",                y.label = "This week's polling average",                legend.main = "Last week's polling average")  


So what we see here is that the higher a candidate's standing in the polls,
the more they benefit from news coverage! This stands somewhat in
contradiction to Ladd's speculation. Another way to think about it is that
these changes in news coverage tend to have more staying power for candidates
with more support.

Last week's online coverage

For last week's online coverage, we see in the model output that for a candidate
with hypothetical zero polling support, increases in online news coverage are
good for future polling, but there's a negative interaction term. Let's look
at how that plays out.

interact_plot(int_mod, `lag(pct_online)`, `lag(pct_polls)`,                modx.values = c(2, 10, 20),                 x.label = "Last week's % change in online news coverage",                y.label = "This week's polling average",                legend.main = "Last week's polling average")  


Here we see that for higher polling candidates, the lagged changes in
online coverage are a detriment while for lower polling candidates, such changes
are a much-needed (small) boost.

This week's online coverage

Let's do the same test with the effect of this week's online coverage on
this week's polls.

interact_plot(int_mod, pct_online, `lag(pct_polls)`,                modx.values = c(2, 10, 20),                 x.label = "This week's % change in online news coverage",                y.label = "This week's polling average",                legend.main = "Last week's polling average")  


Quite similar to last week's online coverage, except not even the low-polling
candidates seem to benefit.

Just drop Biden from the analysis

Another thing we can do is just drop Biden, who for most of the campaign cycle
has dominated the polls and news coverage.

no_biden <- wbm(pct_polls ~ lag(pct_polls) +                   pct_cable + lag(pct_cable) +                   pct_online + lag(pct_online),                 data = filter(joined_panel, candidate_name != "Joe Biden"),                 model = "fixed")  summary(no_biden)  
MODEL INFO:  Entities: 23  Time periods: 2019-01-13-2019-09-15  Dependent variable: pct_polls  Model type: Linear mixed effects  Specification: within    MODEL FIT:  AIC = 1879.42, BIC = 1915.49  Pseudo-R² (fixed effects) = 0.07  Pseudo-R² (total) = 0.97  Entity ICC = 0.96    -------------------------------------------------------------                           Est.   S.E.   t val.     d.f.      p  --------------------- ------- ------ -------- -------- ------  (Intercept)              2.70   0.92     2.93    22.01   0.01  lag(pct_polls)           0.65   0.02    26.80   643.01   0.00  pct_cable                0.10   0.01     7.91   643.02   0.00  lag(pct_cable)           0.04   0.01     2.98   643.01   0.00  pct_online              -0.02   0.01    -1.92   643.02   0.05  lag(pct_online)          0.01   0.01     1.30   643.01   0.20  -------------------------------------------------------------    p values calculated using Satterthwaite d.f.     RANDOM EFFECTS:  ------------------------------------------       Group         Parameter    Std. Dev.   ---------------- ------------- -----------   candidate_name   (Intercept)     4.414         Residual                     0.8483     ------------------------------------------  

And in this case, the results are basically the same, although the benefits of
news coverage are perhaps a bit stronger.

A more advanced model

Let's push a bit further to make sure we're not making a mistake on the basic
claim that (cable) news coverage appears to be beneficial. A more robust
approach is to use an analysis that more deliberately addresses these issues
of reverse causality and endogeneity.

Normally, I'd reach for the dynamic panel
models featured in my dpm package, but these can't handle data with so many
time points and so few people. Instead, I'll use the
Arellano-Bond estimator1,
which the models in dpm were meant to replace — they are both unbiased,
but Arellano-Bond models tend to be inefficient. In other words, this method
is more conservative.

For this, I need the plm package and its pgmm() function. I'll skip the
technicalities and just say the interpretations will be similar to what I just
did, but the underlying algorithm is more rigorous at ruling out reverse
causality.

library(plm)  ab_mod <- pgmm(pct_polls ~ lag(pct_polls, 1) +                            pct_cable + lag(pct_cable) +                           pct_online + lag(pct_online) |                            lag(pct_polls, 2:15),                   data = joined_pdata, effect = "individual", model = "twosteps",                 transformation = "ld")  summary(ab_mod)  
Oneway (individual) effect Two steps model    Call:  pgmm(formula = pct_polls ~ lag(pct_polls, 1) + pct_cable + lag(pct_cable) +       pct_online + lag(pct_online) | lag(pct_polls, 2:15), data = joined_pdata,       effect = "individual", model = "twosteps", transformation = "ld")    Unbalanced Panel: n = 24, T = 11-37, N = 731    Number of Observations Used: 1361    Residuals:       Min.   1st Qu.    Median      Mean   3rd Qu.      Max.   -11.44106  -0.27460   0.00000  -0.00144   0.25889   9.50637     Coefficients:                      Estimate Std. Error z-value  Pr(>|z|)      lag(pct_polls, 1)  0.8953903  0.0196164 45.6449 < 2.2e-16 ***  pct_cable          0.0741411  0.0165264  4.4862  7.25e-06 ***  lag(pct_cable)     0.0109705  0.0065706  1.6696   0.09499 .    pct_online        -0.0108026  0.0120645 -0.8954   0.37057      lag(pct_online)    0.0095829  0.0140126  0.6839   0.49405      ---  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1    Sargan test: chisq(437) = 18.55703 (p-value = 1)  Autocorrelation test (1): normal = -2.282824 (p-value = 0.022441)  Autocorrelation test (2): normal = -1.182395 (p-value = 0.23705)  Wald test for coefficients: chisq(5) = 146865.5 (p-value = < 2.22e-16)  

Okay so what does this all mean? Basically, the same story we saw with the
other, simpler analyses.

Conclusions

Does news coverage help candidates in the Democratic primary race? Probably.
There are some limitations of the analyses at hand. It is possible, for
instance, that there is something else that changes the news coverage. In fact,
that is likely — early on, it appeared Elizabeth Warren drove news coverage
by releasing new policy proposals on a fairly frequent schedule. Did the
policy proposals themselves boost her support rather than the news coverage of
them? That's hard to separate, especially given the kind of birds-eye view
we're taking here. We're not saying what's in the news coverage.

Matt Grossmann suggested sentiment analysis:

and that's probably a wise choice. Maybe once I'm off the job market! 😄


  1. Actually, I'll use the Blundell-Bond estimator, which is a tweaked version that is a bit more efficient.
    ^

To leave a comment for the author, please follow the link and comment on their blog: R on Jacob Long.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Comments