[R-bloggers] EARL London 2019 Conference Recap (and 8 more aRticles) |
- EARL London 2019 Conference Recap
- Meetup Recap: Survey and Measure Development in R
- Cleaning Anomalies to Reduce Forecast Error by 9% with anomalize
- Understanding Bootstrap Confidence Interval Output from the R boot Package
- More models, more features: what’s new in ‘parameters’ 0.2.0
- bamlss: A Lego Toolbox for Flexible Bayesian Regression
- Getting started with {golem}
- Tidy forecasting in R
- Does news coverage boost support for presidential candidates in the Democratic primary?
EARL London 2019 Conference Recap Posted: 30 Sep 2019 04:12 AM PDT [This article was first published on r – Appsilon Data Science | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. I had an awesome time at the Enterprise Applications of the R Language (EARL) Conference held in London in September, 2019. EARL reminded me that it is good to keep showing up at conferences. I entered and the first thing I heard was organisers at the table welcoming me "Damian is that you? Awesome to see you again!" Feels great to be a part of the amazing R community. Within minutes I already met a couple of people. I like the vibe at EARLs. I ran into some of the people from other conferences who remembered me from Insurance Data Science in Zurich and EARL 2018 in Seattle.
At EARL 2018 in Seattle I presented about analyzing satellite imagery with deep learning networks, but this time I was a pure attendee, which was a different sort of intense and fantastic. The conference kicked off with a keynote by Julia Silge, the Jane Austen loving astrophysicist and author of "Text Mining with R.". Julia shared very interesting insights about Data Scientists all over the world that were gathered in a huge Stack Overflow poll.
I have worked on 50 something Shiny dashboard applications, but I always learn something new at conferences. Christel Swift had an interesting presentation about using shiny at BBC. I also learned quite a few tricks during the workshop about R Markdown and Interactive Dashboards that took place the day before the conference.
I didn't know that it took 15 years to develop a new drug. I heard a fascinating presentation by Bayer about where data science meets the world of pharmaceuticals. The topic is close to us as we also face many challenges when working with data in the pharma industry.
I thought this slide was hilarious:
Before/between/after the presentations I had so many fascinating conversations, some of which continued into the wee hours. I think everyone, even competitors, recognized that there was so much to gain from sharing information and bouncing ideas off of each other. Many people I met were just starting with R and introducing R in their companies. I heard a lot of questions about using R in production – our pitch about big pharma that introduced R with our support fit perfectly there. Note to self: pick up a Catch Box for when we host a conference. You can throw it at people instead of awkwardly leaning over crowds of people trying to hand them the microphone. It was entertaining each time they tossed the Catch Box at an audience member.
We are in the middle of the WhyR Warsaw conference at the moment. We're so excited to host Dr. Kenneth Benoit from the London School of Economics and creator of the quanteda R package. I will co-present with him on the topic of Natural Language Processing for non-programmers. But that is a post for another time! Thanks for stopping by. Questions? Comments? You can find me on Twitter @D_Rodziewicz.
Article EARL London 2019 Conference Recap comes from Appsilon Data Science | End to End Data Science Solutions. To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End to End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
Meetup Recap: Survey and Measure Development in R Posted: 30 Sep 2019 03:00 AM PDT [This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Have you ever taken a survey at the doctor or for a job interview and wondered what exactly that data was used for? There is a long-standing series of methodologies, many coming from psychology, on how to reliably measure "latent" traits, such as depression or loyalty, from self-report survey data. While measurement is a common method in fields ranging from kinesiology to education, it's usually conducted in proprietary tools like SPSS or MPlus. There is really not much training available online for survey development in R, but the program is beyond capable of conducting it through packages like Over the summer, I presented the following hour-long workshop on survey development in R to the Greater Cleveland R Users meetup group. The video and slides are below. You can also access all code, files and assets used in the presentation on RStudio Cloud. Slides: To learn more about survey development, check out my DataCamp course, "Survey and Measure Development in R." The first chapter is free. Is your meetup looking to learn about survey measurement? Would your organization like to build validated survey instruments? Does your organization do it, but wants to move to R? Drop me a line. To leave a comment for the author, please follow the link and comment on their blog: George J. Mount. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
Cleaning Anomalies to Reduce Forecast Error by 9% with anomalize Posted: 29 Sep 2019 07:45 PM PDT [This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In this tutorial, we'll show how we used R Packages Covered:
Cleaning Anomalies to Reduce Forecast Error by 9%We can often improve forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the Forecast WorkflowWe'll use the following workflow to remove time series anomalies prior to forecasting.
Step 1 – Load LibrariesFirst, load the following libraries to follow along. Step 2 – Get the DataThis tutorial uses the Let's take one package with some extreme events. We'll hone in on We'll Here's a visual representation of the forecast experiment setup. Training data will be any data before "2018-01-01". Step 3 – Workflow for Cleaning AnomaliesThe workflow to clean anomalies:
Here's a visual of the "observed" (uncleaned) vs the "observed_cleaned" (cleaned) training sets. We'll see what influence these anomalies have on a forecast regression (next). Step 4 – Forecasting Downloads of the Lubridate PackageFirst, we'll make a function, Step 4.1 – Before Cleaning with anomalizeWe'll first perform a forecast without cleaning anomalies (high leverage points).
Forecast vs Actual ValuesThe forecast is overplotted against the actual values. We can see that the forecast is shifted vertically, an effect of the high leverage points. Forecast Error CalculationThe mean absolute error (MAE) is 1570, meaning on average the forecast is off by 1570 downloads each day. Step 4.2 – After Cleaning with anomalizeWe'll next perform a forecast this time using the repaired data from
Forecast vs Actual ValuesThe forecast is overplotted against the actual values. The cleaned data is shown in Yellow. Zooming in on the forecast region, we can see that the forecast does a better job following the trend in the test data. Forecast Error CalculationThe mean absolute error (MAE) is 1435, meaning on average the forecast is off by 1435 downloads each day. 8.6% Reduction in Forecast ErrorUsing the new ConclusionForecasting with clean anomalies is a good practice that can provide substantial improvement to forecasting accuracy by removing high leverage points. The new Data Science TrainingInterested in Learning Anomaly Detection?Business Science offers two 1-hour labs on Anomaly Detection:
Interested in Improving Your Forecasting?Business Science offers a 1-hour lab on increasing Forecasting Accuracy:
Interested in Becoming an Expert in Data Science for Business?Business Science offers a 3-Course Data Science for Business R-Track designed to take students from no experience to an expert data scientists (advanced machine learning and web application development) in under 6-months.
Appendix – Forecast Downloads FunctionThe
To leave a comment for the author, please follow the link and comment on their blog: business-science.io. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
Understanding Bootstrap Confidence Interval Output from the R boot Package Posted: 29 Sep 2019 05:00 PM PDT [This article was first published on Rstats on pi: predict/infer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Nuances of BootstrappingMost applied statisticians and data scientists understand that bootstrapping is a method that mimics repeated sampling by drawing some number of new samples (with replacement) from the original sample in order to perform inference. However, it can be difficult to understand output from the software that carries out the bootstrapping without a more nuanced understanding of how uncertainty is quantified from bootstrap samples. To demonstrate the possible sources of confusion, start with the data described in Efron and Tibshirani's (1993) text on bootstrapping (page 19). We have 15 paired observations of student LSAT scores and GPAs. We want to estimate the correlation between LSAT and GPA scores. The data are the following:
The correlation turns out to be 0.776. For reasons we'll explore, we want to use the nonparametric bootstrap to get a confidence interval around our estimate of \(r\). We do so using the
For step 1, the following function is created: Steps 2 and 3 are performed as follows: Looking at the
To understand this output, let's review statistical inference, confidence intervals, and the bootstrap. Statistical InferenceThe usual test statistic for determining if \(r \neq 0\) is: \[ where \[ In our case: \[ Dividing \(r\) by \(SE_r\) yields our \(t\) statistic \[ We compare this to a \(t\) distribution with \(n-2 = 13\) degrees of freedom and easily find it to be significant. In words: If the null hypothesis were true, and we repeatedly draw samples of size \(n\), and we calculate \(r\) each time, then the probability that we would observe an estimate of \(|r| = 0.776\) or larger is less than 5%. An important caveat. The above formula for the standard error is only correct when \(r = 0\). The closer we get to \(\pm 1\), the less correct it is. Confidence IntervalsWe can see why the standard error formula above becomes less correct the further we get from zero by considering the 95% confidence interval for our estimate. The usual formula you see for a confidence interval is the estimate plus or minus the 97.5th percentile of the normal or \(t\) distribution times the standard error. In this case, the \(t\)-based formula would be: \[ If we were to sample 15 students repeatedly from the population and calculate this confidence interval each time, the interval should include the true population value 95% of the time. So what happens if we use the standard formula for the confidence interval? \[ Recall that correlations are bounded in the range \([-1, +1]\), but our 95% confidence interval contains values greater than one! Alternatives:
The next sections review the nonparametric and parametric bootstrap. Nonparametric BootstrapWe do not know the true population distribution of LSAT and GPA scores. What we have instead is our sample. Just like we can use our sample mean as an estimate of the population mean, we can use our sample distribution as an estimate of the population distribution. In the absence of supplementary information about the population (e.g. that it follows a specific distribution like bivariate normal), the empirical distribution from our sample contains as much information about the population distribution as we can get. If statistical inference is typically defined by repeated sampling from a population, and our sample provides a good estimate of the population distribution, we can conduct inferential tasks by repeatedly sampling from our sample. (Nonparametric) bootstrapping thus works as follows for a sample of size N:
Note that the sampling is done with replacement. As an aside, most results from traditional statistics are based on the assumption of random sampling with replacement. Usually, the population we sample from is large enough that we do not bother noting the "with replacement" part. If the sample is large relative to the population, and sampling without replacement is used, we would typically be advised to use a finite population correction. This is just to say that the "with replacement" requirement is a standard part of the definition of random sampling. Let's take our data as an example. We will draw 500 bootstrap samples, each of size \(n = 15\) chosen with replacement from our original data. The distribution across repeated samples is: Note a few things about this distribution.
The non-normality of the sampling distribution means that, if we divide \(r\) by the bootstrap standard error, we will not get a statistic that is distributed standard normal or \(t\). Instead, we decide that it is a better idea to summarize our uncertainty using a confidence interval. Yet we also want to make sure that our confidence intervals are bounded within the \([-1, +1]\) range, so the usual formula will not work. Before turning to different methods for obtaining bootstrap confidence intervals, for completeness the next section describes the parametric bootstrap. Parametric BootstrapThe prior section noted that, in the absence of supplementary information about the population, the empirical distribution from our sample contains as much information about the population distribution as we can get. An example of supplementary information that may improve our estimates would be that we know the LSAT and GPA scores are distributed bivariate normal. If we are willing to make this assumption, we can use our sample to estimate the distribution parameters. Based on our sample, we find: \[ The distribution looks like the following: We can draw 500 random samples of size 15 from this specific bivariate normal distribution and calculate the correlation between the two variables for each. The distribution of the correlation estimates across the 500 samples represents our parametric bootstrap sampling distribution. It looks like the following: The average correlation across the 500 samples was 0.767, and the standard deviation (our estimate of the standard error) was 0.111. This is smaller than our non-parametric bootstrap estimate of the standard error, 0.131, which is reflective of the fact that our knowledge of the population distribution gives us more information. This in turn reduces sampling variability. Of course, we often will not feel comfortable saying that the population distribution follows a well-defined shape, and hence we will typically default to the non-parametric version of the bootstrap. Bootstrap Confidence IntervalsRecall that the usual formula for estimating a confidence around a statistic \(\theta\) is something like: \[ We saw that using the textbook standard error estimate for a correlation led us astray because we ended up with an interval outside of the range of plausible values. There are a variety of alternative approaches to calculating confidence intervals based on the bootstrap. Standard Normal IntervalThe first approach starts with the usual formula for calculating a confidence interval, using the normal distribution value of 1.96 as the multiplier of the standard error. However, there are two differences. First, we use our bootstrap estimate of the standard error in the formula. Second, we make an adjustment for the estimated bias, -0.005: In our example, we get \[ This matches R's output (given our hand calculations did some rounding along the way). Problems:
We generally won't use this method. Studentized (t) IntervalsRecall that, when we calculate a \(t\)-statistic, we mean-center the original statistic and divide by the sample estimate of the standard error. That is, \[ where \(\hat{\theta}\) is the sample estimate of the statistic, \(\theta\) is the "true" population value (which we get from our null hypothesis), and \(\widehat{SE}_{\theta}\) is the sample estimate of the standard error. There is an analog to this process for bootstrap samples. In the bootstrap world, we can convert each bootstrap sample into a \(t\)-score as follows: \[ Here \(\tilde{\theta}\) is the statistic estimated from a single bootstrap sample, and \(\hat{\theta}\) is the estimate from the original (non-bootstrap) sample. But where does \(\widehat{SE}_{\hat{\theta}}\) come from? Just like for a \(t\)-test, where we estimated the standard error using our one sample, we estimate the standard error separately for each bootstrap sample. That is, we need an estimate of the bootstrap sample variance. (Recall the message from the R output above). If we are lucky enough to have a formula for a sample standard error, we use that in each sample. For the mean, each bootstrap sample would return:
We don't have such a formula that works for any correlation, so we need another means to estimate the variance. The delta method is one choice. Alternatively, there is the nested bootstrap. Nested bootstrap algorithm:
We now have the information we need to calculate the studentized confidence interval. The formula for the studentized bootstrap confidence interval is: \[ The terms are:
For each bootstrap sample, we calculated a \(t\) statistic. The \(q_{1-\alpha/2}\) and \(q_{\alpha/2}\) are identified by taking the appropriate quantile of these \(t\) estimates. This is akin to creating our own table of \(t\)-statistics, rather than using the typical tables for the \(t\) distribution you'd find in text books. What does this look like in R? We need a second function for bootstrapping inside our bootstrap. The following will work. We now have our bootstrap estimates of \(t\), and we can use the quantiles of the distribution to plug into the formula. We find that \(q_{1-\alpha/2} = 8.137\) and that \(q_{\alpha/2} = -1.6\). Substituting: \[ Checking our by-hand calculations, the studentized confidence interval from Problems:
Basic Bootstrap Confidence IntervalAnother way of writing a confidence interval: \[ In non-bootstrap confidence intervals, \(\theta\) is a fixed value while the lower and upper limits vary by sample. In the basic bootstrap, we flip what is random in the probability statement. Define \(\tilde{\theta}\) as a statistic estimated from a bootstrap sample. We can write \[ Recall that the bias of a statistic is the difference between its expected value (mean) across many samples and the true population value: \[ We estimate this using our bootstrap samples, \(\mathbb{E}(\tilde{\theta}) – \hat{\theta}\), where \(\hat{\theta}\) is the estimate from the original sample (before bootstrapping). We can add in the bias-correction term to each side of our inequality as follows. \[ Some more algebra eventually leads to: \[ The right-hand side is our formula for the basic bootstrap confidence interval. Because we started out with \(\tilde{\theta}\) as the random variable, we can use our bootstrap quantiles for the values of \(q_{1-\alpha/2}\) and \(q_{\alpha/2}\). To do so, arrange the estimates in order from lowest to highest, then use a percentile function to find the value at the 2.5th and 97.5th percentiles (given two-tailed \(\alpha = .05\)). We find that \(q_{1-\alpha/2} = 0.962\) and that \(q_{\alpha/2} = 0.461\). Substituting into the inequality: \[ The basic bootstrap interval is \([0.59, 1.091]\). To confirm: But we're still outside the range we want. Percentile Confidence IntervalsHere's an easy solution. Line up the bootstrap estimates from lowest to highest, then take the 2.5th and 97.5th percentile. Compare: (The slight difference is due to Looks like we have a winner. Our confidence interval will necessarily be limited to the range of plausible values. But let's look at one other. Bias Corrected and Accelerated (BCa) Confidence IntervalsBCa intervals require estimating two terms: a bias term and an acceleration term. Bias is by now a familiar concept, though the calculation for the BCa interval is a little different. For BCa confidence intervals, estimate the bias correction term, \(\hat{z}_0\), as follows: \[ where \(\#\) is the counting operator. The formula looks complicated but can be thought of as estimating something close to the median bias transformed into normal deviates (\(\Phi^{-1}\) is the inverse standard normal cdf). The acceleration term is estimated as follows: \[ where \(\hat{\theta}_{(\cdot)}\) is the mean of the bootstrap estimates and \(\hat{\theta}_{(i)}\) the estimate after deleting the \(i\)th case. The process of estimating a statistic \(n\) times, each time dropping one of the \(i \in N\) observations, is known as the jackknife estimate. The purpose of the acceleration term is to account for situations in which the standard error of an estimator changes depending on the true population value. This is exactly what happens with the correlation (the SE estimator we provided at the start of the post only works when \(r = 0\)). An equivalent way of thinking about this is that it accounts for skew in the sampling distribution, like what we have seen in the prior histograms. Armed with our bias correction and acceleration term, we now estimate the quantiles we will use for establishing the confidence limits. \[ \[ where \(\alpha\) is our Type-I error rate, usually .05. Our confidence limits are: \[ Based on the formulas above, it should be obvious that \(a_1\) and \(a_2\) reduces to the percentile intervals when the bias and acceleration terms are zero. The effect of the bias and acceleration corrections is to change the percentiles we use to establish our limits. If we perform all of the above calculations, we get the following: We get a warning message that BCa intervals may be unstable. This is because the accuracy of the bias and acceleration terms require a large number of bootstrap samples and, especially when using the jackknife to get the acceleration parameter, this can be computationally intensive. If so, there is another type of confidence interval known as the ABC interval that provides a satisfactory approximation to BCa intervals that is less computationally demanding. Type TransformationsWhat does it mean that calculations and intervals are on the original scale? There are sometimes advantages to transforming a statistic so that it is on a different scale. An example is the correlation coefficient. We mentioned briefly above that the usual way of performing inference is to use the Fisher-\(z\) transformation. \[ This transformation is normally distributed with standard error \(\frac{1}{\sqrt{N – 3}}\), so we can construct confidence intervals the usual way and then reverse-transform the limits using the function's inverse. For Fisher-\(z\), the inverse of the transformation function is: \[ If we prefer to work with the transformed statistic, we can include the transformation function and its inverse in the We can use these functions within a call to
Recall that not specifying either function returns: Specifying the transformation only returns: Specifying the transformation and its inverse returns the following: ConclusionIt is hoped that this post clarifies the output from To leave a comment for the author, please follow the link and comment on their blog: Rstats on pi: predict/infer. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
More models, more features: what’s new in ‘parameters’ 0.2.0 Posted: 29 Sep 2019 05:00 PM PDT [This article was first published on R on easystats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The easystats project continues to grow, expanding its capabilities and features, and the The primary goal of this package is to provide utilities for processing the parameters of various statistical models. It is useful for end-users as well as developers, as it is a lightweight and open-developed package. The main function, Improved Documentation
Improved SupportBesides stabilizing and improving the functions for the most popular models ( Improved PrintingFor models with special components, in particular zero-inflated models, Join the teamThere is still room for improvement, and some new exciting features are already planned. Feel free to let us know how we could further improve this package! Note that easystats is a new project in active development, looking for contributors and supporters. Thus, do not hesitate to contact one of us if you want to get involved
To leave a comment for the author, please follow the link and comment on their blog: R on easystats. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
bamlss: A Lego Toolbox for Flexible Bayesian Regression Posted: 29 Sep 2019 03:00 PM PDT [This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Modular R tools for Bayesian regression are provided by bamlss: From classic MCMC-based GLMs and GAMs to distributional models using the lasso or gradient boosting. CitationUmlauf N, Klein N, Simon T, Zeileis A (2019). "bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond)." arXiv:1909.11784, arXiv.org E-Print Archive. https://arxiv.org/abs/1909.11784 AbstractOver the last decades, the challenges in applied regression and in predictive modeling have been changing considerably: (1) More flexible model specifications are needed as big(ger) data become available, facilitated by more powerful computing infrastructure. (2) Full probabilistic modeling rather than predicting just means or expectations is crucial in many applications. (3) Interest in Bayesian inference has been increasing both as an appealing framework for regularizing or penalizing model estimation as well as a natural alternative to classical frequentist inference. However, while there has been a lot of research in all three areas, also leading to associated software packages, a modular software implementation that allows to easily combine all three aspects has not yet been available. For filling this gap, the R package bamlss is introduced for Bayesian additive models for location, scale, and shape (and beyond). At the core of the package are algorithms for highly-efficient Bayesian estimation and inference that can be applied to generalized additive models (GAMs) or generalized additive models for location, scale, and shape (GAMLSS), also known as distributional regression. However, its building blocks are designed as "Lego bricks" encompassing various distributions (exponential family, Cox, joint models, …), regression terms (linear, splines, random effects, tensor products, spatial fields, …), and estimators (MCMC, backfitting, gradient boosting, lasso, …). It is demonstrated how these can be easily recombined to make classical models more flexible or create new custom models for specific modeling challenges. SoftwareCRAN package: https://CRAN.R-project.org/package=bamlss Quick overviewTo illustrate that the The logit model is a basic labor force participation model, a standard application in microeconometrics. Here, the data are loaded from the Then, the model can be estimated with The summary is based on the MCMC samples, which suggest "significant" effects for all covariates, except for variable To show a more flexible regression model we fit a distributional scale-location model to the well-known simulated motorcycle accident data, provided as Here, the relationship between head acceleration and time after impact is captured by smooth relationships in both mean and variance. See also Flexible count regression for lightning reanalysisFinally, we show a more challenging case study. Here, emphasis is given to the illustration of the workflow. For more details on the background for the data and interpretation of the model, see Section 5 in the full paper linked above. The goal is to establish a probabilistic model linking positive counts of cloud-to-ground lightning discharges in the European Eastern Alps to atmospheric quantities from a reanalysis dataset. The lightning measurements form the response variable and regressors are taken from the atmospheric quantities from ECMWF's ERA5 reanalysis data. Both have a temporal resolution of 1 hour for the years 2010-2018 and a spatial mesh size of approximately 32 km. The subset of the data analyzed along with the fitted To model only the lightning counts with at least one lightning discharge we employ a negative binomial count distribution, truncated at zero. The data can be loaded as follows and the regression formula set up: The expectation But, of course, the model can also be refitted: To explore this model in some more detail, we show a couple of visualizations. First, the contribution to the log-likelihood of individual terms during gradient boosting is depicted. Subsequently, we show traceplots of the MCMC samples (left) along with autocorrelation for two splines the term Next, the effects of the terms Finally, estimated probabilities for observing 10 or more lightning counts (within one grid box) are computed and visualized. The reconstructions for four time points on September 15-16, 2001 are shown. To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 29 Sep 2019 11:30 AM PDT [This article was first published on Rtask, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. A little blog post about where to look if you want to get started with {golem}, and an invitation to code with us in October. go-what? If you've never heard about it before, {golem} is a tool for building production-grade Shiny applications. With {golem}, Shiny developers have a toolkit for making a stable, easy-to-maintain, and robust for production web applications L'article Getting started with {golem} est apparu en premier sur Rtask. To leave a comment for the author, please follow the link and comment on their blog: Rtask. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 28 Sep 2019 05:00 PM PDT [This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The fable package for doing tidy forecasting in R is now on CRAN. Like tsibble and feasts, it is also part of the tidyverts family of packages for analysing, modelling and forecasting many related time series (stored as tsibbles). For a brief introduction to tsibbles, see this post from last month. Here we will forecast Australian tourism data by state/region and purpose. This data is stored in the There are 304 combinations of Region, State and Purpose, each one defining a time series of 80 observations. To simplify the outputs, we will abbreviate the state names. Forecasting a single time seriesAlthough the fable package is designed to handle many time series, we will be begin by demonstrating its use on a single time series. For this purpose, we will extract the tourism data for holidays in the Snowy Mountains region of NSW. For this data set, a reasonable benchmark forecast method is the seasonal naive method, where forecasts are set to be equal to the last observed value from the same quarter. Alternative models for this series are ETS and ARIMA models. All these can be included in a single call to the The returned object is called a "mable" or model table, where each cell corresponds to a fitted model. Because we have only fitted models to one time series, this mable has only one row. To forecast all models, we pass the object to the The return object is a "fable" or forecast table with the following characteristics:
The If you want to compute the prediction intervals, the Forecasting many seriesTo scale this up to include all series in the Now the mable includes models for every combination of keys in the We can extract information about some specific model using the When the mable is passed to the Note the use of natural language to specify the forecast horizon. The Plots of individual forecasts can also be produced, although filtering is helpful to avoid plotting too many series at once. Forecast accuracy calculationsTo compare the forecast accuracy of these models, we will create a training data set containing all data up to 2014. We will then forecast the remaining years in the data set and compare the results with the actual values. Here we have introduced an ensemble forecast ( Now to check the accuracy, we use the But because we have generated distributional forecasts, it is also interesting to look at the accuracy using CRPS (Continuous Rank Probability Scores) and Winkler Scores (for 95% prediction intervals). In this case, the Moving from forecast to fableMany readers will be familiar with the forecast package and will wonder about the differences between forecast and fable. Here are some of the main differences.
Subsequent posts will explore other features of the fable package. To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||
Does news coverage boost support for presidential candidates in the Democratic primary? Posted: 28 Sep 2019 05:00 PM PDT [This article was first published on R on Jacob Long, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Matt Grossmann noted the close relationship between
This got me thinking about what the available data can bring to bear on this The GDELT project Now I'm going to walk through how to get these data into R. Skip to the Getting the dataAs mentioned, FiveThirtyEight has compiled most of the data we're interested in, Now we have the data, but we still have to get it in shape. First, we deal with PollsThese data are formatted such that every row is a unique combination I first create two vectors of candidate names. The first is the Now we do some filtering and data cleaning for Now we aggregate by week, forming a weekly polling average by candidate. If we For a quick sanity check, let's plot these data to see if things line up ( Okay, it's a bit more variable than MediaWe have two data frames with media coverage info, This is a good example of a time Let's look at the trends for cable news… This looks a bit similar to the polling trends, although more variable over And now online news… This one's a bit more all over the place, with the minor candidates espcially Combine dataNow we just need to get all this information in the same place for analysis. Now we have everything in a single data frame where each row represents one AnalysisOkay, so we have multiple time series for each candidate: their status in the The kind of analyses we can do all have in common the idea of comparing each Of course, this still doesn't sort out the problem of reverse causality. If we Fixed effects modelsFixed effects models are a common way to remove the influence of certain kinds The process we're looking at is dynamic, meaning candidates' support Luckily, these data are not quite the same as the kind that the Nickell bias I'm going to use the Here's what the output is saying:
I will note that as far as the online coverage is concerned, if I drop cable Adjusting for trendsThis was the simplest analysis I can do. I can also try to remove any trends The risk with this approach is that Okay, same story here. Some good evidence of cable news helping and some very Driven by minor candidates?Responding to Grossmann's tweet, Jonathan Ladd raises an interesting question:
There are a couple of ways to look at this. First of all, let's think about We can deal with this via an interaction effect, seeing whether the effects Okay so there's a lot going on here. First of all, we see that the instantaneous Let's examine them one by one, with help from Last week's cable news coverageEach line represents the predicted standing in this week's polls at different So what we see here is that the higher a candidate's standing in the polls, Last week's online coverageFor last week's online coverage, we see in the model output that for a candidate Here we see that for higher polling candidates, the lagged changes in This week's online coverageLet's do the same test with the effect of this week's online coverage on Quite similar to last week's online coverage, except not even the low-polling Just drop Biden from the analysisAnother thing we can do is just drop Biden, who for most of the campaign cycle And in this case, the results are basically the same, although the benefits of A more advanced modelLet's push a bit further to make sure we're not making a mistake on the basic Normally, I'd reach for the dynamic panel For this, I need the Okay so what does this all mean? Basically, the same story we saw with the ConclusionsDoes news coverage help candidates in the Democratic primary race? Probably. Matt Grossmann suggested sentiment analysis:
and that's probably a wise choice. Maybe once I'm off the job market!
To leave a comment for the author, please follow the link and comment on their blog: R on Jacob Long. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment