[R-bloggers] Impressions from e-Rum2020 (and 4 more aRticles)

Impressions from e-Rum2020
Time Series Analysis: Forecasting Sales Data with Autoregressive (AR) Models
Introducing Polished.tech
Future-Proofing Your Data Science Team
one bridge further

Impressions from e-Rum2020

Posted: 30 Jun 2020 03:10 AM PDT

[This article was first published on Mirai Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

e-Rum2020 conference connects many hundreds of R enthusiasts in virtual space!

With the Covid-19 pandemic turning plans upside down on a global scale, what should have happened at the end of May in Milano as eRum2020 had to be converted into a virtual event from June 17-20, and was as such re-branded to e-Rum2020. Therefore, our first mention is for the organization committee: A big thank you for not giving up and making it happen.

The videoconferencing tool selected for e-Rum2020 was Hopin, which contributed a lot to the great success of the event. Everything worked out well, without major issues and offering a feeling not too far from a physical event. Being the first large event of this kind for the R community, we believe e-Rum2020 has set a high standard for future events. Obviously, the main drawback of being virtual is the lack of the valuable social aspects that are part of an in-person conference. However, on the positive side, a virtual event has a much broader reach, opening it to attendees who would not have had the time, chance, resources or interest to travel to Milan.

Wednesday 17/06

Besides the reception and institutional talks, a lot of focus was on Life Sciences applications. It was interesting to see how R is used to improve human health, especially in the keynote by Stephanie Hicks. On the more technical R novelties, we would highlight the talk from Henrik Bengtsson on progressr: an inclusive, unifying API for progress updates.

Thursday 18/06

The afternoon keynote by Sharon Machlis about data journalism and R was excellent. She showed many useful packages and tools she uses for her journalism work and the talk was very entertaining. In the parallel sessions, Dmytro Perepolkin stood out with his talk about polite web scraping. In the afternoon invited session, Colin Fay's talk about what, why and how to test Shiny projects was the most noteworthy one and a must-watch for those who do Shiny app development.

e-Rum2020

Friday 19/06

One of the highlights was Tomas Kalibera's keynote about R 4.0, especially with the Q&A afterwards. It is always enriching to see and understand all the work done by the R core group behind the scenes, and grasp the fine balance between the core team wanting to be more strict and package developers being eager for more flexibility and access to R internals. Later on, Colin Gillespie's invited talk was jokingly complaining about CRAN not being harsh enough, ultimately highlighting the importance of enforcing strict paradigms and standards.

Saturday 20/06 (Workshops day)

This was a special day for Mirai. First, we were supporting the morning workshop "Is R ready for Production? Let's develop a Professional Shiny Application!" by Andrea Melloncelli. Then, we had our own hands-on tutorial in the afternoon "Bring your R Application Safely to Production. Collaborate, Deploy, Automate.". Miraiers Riccardo and Peter showcased Git(Hub)-based application development workflows and CI/CD pipelines with a focus on collaboration, automation and best practices. In reply to the astonishing good feedback we received, we promise a follow-up post specifically about the workshop, with more details and pointers to all workshop-related materials, including the event's recordings. Stay tuned!

e-Rum2020

Hot topics

Overall, there were two recurrent topics, which were mentioned in almost all sessions and in some keynote speeches, which gives us some hints for the future.

Shiny is gaining more and more weight in the R world. The need of nice visualizations and reactivity will be present in the years to follow. Particularly interesting were the talks from Andrie De Vries on "Creating drag-and-drop Shiny applications using sortable" and Alex Gold on "Design Patterns For Big Shiny Apps". Colin Fay's talk mentioned above also showed this trend.
Tools and techniques to put R in production. R can not be analyzed as an independent software anymore. The talk from Mat Bannert, R alongside Airflow, Docker and GitlabCI, was a nice illustration of it. Our workshop follows this direction and aims to establish a sort of pattern or best practices to do R in production.

We hope you enjoyed the new experience and learned as much as we did. Stay tuned because the post about our workshop is coming soon.

To leave a comment for the author, please follow the link and comment on their blog: Mirai Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Time Series Analysis: Forecasting Sales Data with Autoregressive (AR) Models

Posted: 30 Jun 2020 12:00 AM PDT

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Forecasting the future has always been one of man's biggest desires and many approaches have been tried over the centuries. In this post we will look at a simple statistical method for time series analysis, called AR for Autoregressive Model. We will use this method to predict future sales data and will rebuild it to get a deeper understanding of how this method works, so read on!

Let us dive directly into the matter and build an AR model out of the box. We will use the inbuilt BJsales dataset which contains 150 observations of sales data (for more information consult the R documentation). Conveniently enough AR models can be built directly in base R with the ar.ols() function (OLS stands for Ordinary Least Squares which is the method used to fit the actual model). Have a look at the following code:

  data <- BJsales  head(data)  ## [1] 200.1 199.5 199.4 198.9 199.0 200.2    N <- 3 # how many periods lookback  n_ahead <- 10 # how many periods forecast    # build autoregressive model with ar.ols()  model_ar <- ar.ols(data, order.max = N) # ar-model  pred_ar <- predict(model_ar, n.ahead = n_ahead)  pred_ar$pred  ## Time Series:  ## Start = 151   ## End = 160   ## Frequency = 1   ##  [1] 263.0299 263.3366 263.6017 263.8507 264.0863 264.3145 264.5372  ##  [8] 264.7563 264.9727 265.1868    plot(data, xlim = c(1, length(data) + 15), ylim = c(min(data), max(data) + 10))  lines(pred_ar$pred, col = "blue", lwd = 5)

Well, this seems to be good news for the sales team: rising sales! Yet, how does this model arrive at those numbers? To understand what is going on we will now rebuild the model. Basically, everything is in the name already: auto-regressive, i.e. a (linear) regression on (a delayed copy of) itself (auto from Ancient Greek self)!

So, what we are going to do is create a delayed copy of the time series and run a linear regression on it. We will use the lm() function from base R for that (see also Learning Data Science: Modelling Basics). Have a look at the following code:

  # reproduce with lm()  df_data <- data.frame(embed(data, N+1) - mean(data))  head(df_data)  ##        X1      X2      X3      X4  ## 1 -31.078 -30.578 -30.478 -29.878  ## 2 -30.978 -31.078 -30.578 -30.478  ## 3 -29.778 -30.978 -31.078 -30.578  ## 4 -31.378 -29.778 -30.978 -31.078  ## 5 -29.978 -31.378 -29.778 -30.978  ## 6 -29.678 -29.978 -31.378 -29.778    model_lm <- lm(X1 ~., data = df_data) # lm-model  coeffs <- cbind(c(model_ar$x.intercept, model_ar$ar), coef(model_lm))  coeffs <- cbind(coeffs, coeffs[ , 1] - coeffs[ , 2])  round(coeffs, 12)  ##                   [,1]       [,2] [,3]  ## (Intercept)  0.2390796  0.2390796    0  ## X2           1.2460868  1.2460868    0  ## X3          -0.0453811 -0.0453811    0  ## X4          -0.2042412 -0.2042412    0    data_pred <- df_data[nrow(df_data), 1:N]  colnames(data_pred) <- names(model_lm$coefficients)[-1]  pred_lm <- numeric()  for (i in 1:n_ahead) {    data_pred <- cbind(predict(model_lm, data_pred), data_pred)    pred_lm <- cbind(pred_lm, data_pred[ , 1])    data_pred <- data_pred[ , 1:N]    colnames(data_pred) <- names(model_lm$coefficients)[-1]  }    preds <- cbind(pred_ar$pred, as.numeric(pred_lm) + mean(data))  preds <- cbind(preds, preds[ , 1] - preds[ , 2])  colnames(preds) <- NULL  round(preds, 9)  ## Time Series:  ## Start = 151   ## End = 160   ## Frequency = 1   ##         [,1]     [,2] [,3]  ## 151 263.0299 263.0299    0  ## 152 263.3366 263.3366    0  ## 153 263.6017 263.6017    0  ## 154 263.8507 263.8507    0  ## 155 264.0863 264.0863    0  ## 156 264.3145 264.3145    0  ## 157 264.5372 264.5372    0  ## 158 264.7563 264.7563    0  ## 159 264.9727 264.9727    0  ## 160 265.1868 265.1868    0

As you can see, the coefficients and predicted values are the same (except for some negligible rounding errors)!

A few things warrant further attention: When building the linear model in line 17 the formula is created dynamically on the fly because the dependent variable is in the last column which number depends on N (the number of lookback periods). To be more precise, it is not just a simple linear regression but a multiple regression because each column (which represent different time delays) goes into the model as a separate (independent) variable. Additionally, the regression is performed on the demeaned data, meaning that you subtract the mean.

So, under the hood what sounds so impressive ("Autoregressive model".. wow!) is nothing else but good ol' linear regression. So, for this method to work, there must be some autocorrelation in the data, i.e. some repeating linear pattern.

As you can imagine there are instances where this will not work. For example, in financial time series there is next to no autocorrelation (otherwise it would be too easy, right! – see also my question and answers on Quant.SE here: Why aren't econometric models used more in Quant Finance?).

In order to use this model to predict n_ahead periods ahead the predict function first uses the last N periods and then uses the new predicted values as input for the next prediction, and so forth n_ahead times. After that, the mean is added again. Obviously, the farther we predict into the future the more uncertain the forecast becomes because the basis of the prediction comprises more and more values that were predicted themselves. The values for both parameters were taken here for demonstration purposes only. A realistic scenario would be to take more lookback periods than predicted periods and you would, of course, take domain knowledge into account, e.g. when you have monthly data take at least twelve periods as your N.

This post only barely scratched the surface of forecasting time series data. Basically, many of the standard approaches of statistics and machine learning can be modified so that they can be used on time series data. Yet, even the most sophisticated method is not able to foresee external shocks (like the current COVID-19 pandemic) and feedback loops when the very forecasts change the actual behaviour of people.

So, all methods have to be taken with a grain of salt because of those systematic challenges. You should always keep that in mind when you get the latest sales forecast!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Introducing Polished.tech

Posted: 29 Jun 2020 05:00 PM PDT

[This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Polished.tech is our new software service that makes it easier than ever to add modern authentication to your Shiny apps.

Implementing authentication from scratch is inefficient and increases the probability of security vulnerabilities. Hand rolling custom logic to encrypt credentials, reset passwords, verify email addresses, etc. is a tedious, error-prone process. Wouldn't it be nice if an R package handled this boilerplate code for you?

Yea, we thought so too. That's why we created the polished R package. Polished provides sign in and registration pages with all the accompanying bells and whistles your users expect from a modern web app. Polished is secure, customizable to your brand, allows social sign in (with Google, Microsoft, and Facebook), and more. Check out all available features at polished.tech and try out a demo Shiny app using polished.tech here.

Polished has been available for installation from GitHub for about a year now. Over this past year, the biggest drawback to using polished was that polished required a substantial level of effort and domain experience to setup and maintain.

With the introduction of polished.tech, polished is now much easier to setup, maintain, and update. Before polished.tech, you had to provision a PostreSQL database and a plumber API to use polished. With polished.tech, we host the database and API for you. Enabling polished user authentication is now as easy as installing the polished R package, creating a polished.tech account, and copying and pasting a few lines of code. Check out the official getting started docs for details.

I personally am extremely excited about polished.tech. It has been a long time in the making. It is free to create a polished.tech account, and there is a free tier for basic usage. If you are looking for a modern authentication solution for your Shiny apps, I would be thrilled if you try out polished.tech.

To leave a comment for the author, please follow the link and comment on their blog: Posts on Tychobra.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Future-Proofing Your Data Science Team

Posted: 29 Jun 2020 05:00 PM PDT

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tomorrowland

Photo by Brian McGowan on Unsplash

This is a guest post from RStudio's partner, Mango Solutions

As RStudio's Carl Howe recently discussed in his blog post on equipping remote data science teams, with the rapidly evolving COVID-19 crisis, companies have been increasingly forced to adopt working from home policies. Our technology and digital infrastructure has never been more important. Newly formed remote data science teams need to maintain productivity and continue to drive effective stakeholder communication and business value, and the only way to achieve this is through appropriate infrastructure and well-defined ways of working.

Whether your workforce works remotely or otherwise, centralizing platforms and enabling a cloud based infrastructure for data science will lead to more opportunities for collaboration. It may even reduce IT spend in terms of equipment and maintenance overhead, thus future-proofing your data science infrastructure for the long run.

So when it comes to implementing long-lived platform, here are some things to keep in mind:

Collaboration Through a Centralized Data and Analytics Platform

A centralized platform, such as RStudio Server Pro, means all your data scientists will have access to an appropriate platform and be working within the same environment. Working in this way means that a package written by one developer can work with a minimum of effort in all your developers' environments allowing simpler collaboration. There are other ways of achieving this with technologies such as virtualenv for Python, but this requires that each project set up its own environment, thereby increasing overhead. Centralizing this effort ensures that there is a well-understood way of creating projects, and each developer is working in the same way.

When using a centralized platform, some significant best practices are:

Version control. If you are writing code of any kind, even just scripts, it should be versioned religiously and have clear commit messages. This ensures that users can see each change made in scripts if anything breaks and can reproduce your results on their own.
Packages. Whether you are working in Python or R, code should be packaged and treated like the valuable commodity it is. At Mango Solutions, a frequent challenge we address with our clients is to debug legacy code where a single 'expert' in a particular technology has written some piece of process which has become mission critical and then left the business. There is then no way to support, develop, or otherwise change this process without the whole business grinding to a halt. Packaging code and workflows helps to document and enforce dependencies, which can make legacy code easier to manage. These packages can then be maintained by RStudio Package Manager or Artifactory.
Reusability. By putting your code in packages and managing your environments with renv, you're able to make your data science reusable. Creating this institutional knowledge means that you can avoid a Data Scientist becoming a single point of failure, and, when a data scientist does leave, you won't be left with a model that nobody understands or can't run. As Lou Bajuk explained in his blog post, Does your Data Science Team Deliver Durable Value?, durable code is a significant criteria for future-proofing your data science organization.

Enabling a Cloud-based Environment

In addition to this institutional knowledge benefit, running this data science platform on a cloud instance allows us to scale up the platform easily. With the ability to deploy to Kubernetes, scaling your deployment as your data science team grows is a huge benefit while only requiring you to pay for what you need to, when you need it.

This move to cloud comes with some tangential benefits which are often overlooked. Providing your data science team with a cloud-based environment has a number of benefits:

The cost of hardware for your data science staff can be reduced to low cost laptops rather than costly high end on-premise hardware.
By providing a centralized development platform, you allow remote and mobile work which is a key discriminator for hiring the best talent.
By enhancing flexibility, you are better positioned to remain productive in unforeseen circumstances.

This last point cannot be overstated. At the beginning of the Covid-19 lockdown, a nationwide company whose data team was tied to desktops found themselves struggling to provide enough equipment to continue working through the lockdown. As a result, their data science team could not function and were unable to provide insights that would have been invaluable through these changing times. By contrast, here at Mango, our data science platform strategy allowed us to switch seamlessly to remote working, add value to our partners, and deliver insights when they were needed most.

Building agility into your basic ways of working means that you are well placed to adapt to unexpected events and adopt new platforms which are easier to update as technology moves on.

Once you have a centralized analytics platform and cloud-based infrastructure in place, how are you going to convince the business to use it? This is where the worlds of Business Intelligence and software dev-ops come to the rescue.

Analytics-backed dashboards using technologies like Shiny or Dash for Python with RStudio Connect means you can quickly and easily create front ends for business users to access results from your models. You can also easily expose APIs that allow your websites to be backed by scalable models, potentially creating new ways for customers to engage with your business.

A word of caution here: Doing this without considering how you are going to maintain and update what have now become software products can be dangerous. Models may go out of date, functionality can become irrelevant, and the business can become disillusioned. Fortunately, these are solved problems in the web world, and solutions such as containers and Kubernetes alongside CI/CD tools make this a simpler challenge. As a consultancy we have a tried and tested solutions that expose APIs from R or Python that back high-throughput websites from across a number of sectors for our customers.

Collaborative Forms of Communications

The last piece of the puzzle for your data science team to be productive has nothing to do with data science but is instead about communication. Your data science team may create insights from your data, but they are like a rudderless ship without input from the business. Understanding business problems and what has value to the wider enterprise requires good communication. This means that your data scientists have to partner with people who understand the sales and marketing strategy. And if you are to embrace the ethos of flexibility as protection against the future, then good video-conferencing and other technological communications are essential.

About Dean Wood and Mango Solutions

Dean Wood is a Data Science Leader at Mango Solutions. Mango Solutions provides complex analysis solutions, consulting, training, and application development for some of the largest companies in the world. Founded and based in the UK in 2002, the company offers a number of bespoke services for data analysis including validation of open-source software for regulated industries.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

one bridge further

Posted: 29 Jun 2020 03:20 PM PDT

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Jackie Wong, Jon Forster (Warwick) and Peter Smith have just published a paper in Statistics & Computing on bridge sampling bias and improvement by splitting.

"… known to be asymptotically unbiased, bridge sampling technique produces biased estimates in practical usage for small to moderate sample sizes (…) the estimator yields positive bias that worsens with increasing distance between the two distributions. The second type of bias arises when the approximation density is determined from the posterior samples using the method of moments, resulting in a systematic underestimation of the normalizing constant."

Recall that bridge sampling is based on a double trick with two samples x and y from two (unnormalised) densities f and g that are interverted in a ratio

$m \sum_{i=1}^n g(x_i)\omega(x_i) \Big/ n \sum_{i=1}^m f(y_i)\omega(y_i)$

of unbiased estimators of the inverse normalising constants. Hence biased. The more the less similar these two densities are. Special cases for ω include importance sampling [unbiased] and reciprocal importance sampling. Since the optimal version of the bridge weight ω is the inverse of the mixture of f and g, it makes me wonder at the performance of using both samples top and bottom, since as an aggregated sample, they also come from the mixture, as in Owen & Zhou (2000) multiple importance sampler. However, a quick try with a positive Normal versus an Exponential with rate 2 does not show an improvement in using both samples top and bottom (even when using the perfectly normalised versions)

morc=(sum(f(y)/(nx*dnorm(y)+ny*dexp(y,2)))+              sum(f(x)/(nx*dnorm(x)+ny*dexp(x,2))))/(    sum(g(x)/(nx*dnorm(x)+ny*dexp(x,2)))+           sum(g(y)/(nx*dnorm(y)+ny*dexp(y,2))))

at least in terms of bias… Surprisingly (!) the bias almost vanishes for very different samples sizes either in favour of f or in favour of g. This may be a form of genuine defensive sampling, who knows?! At the very least, this ensures a finite variance for all weights. (The splitting approach introduced in the paper is a natural solution to create independence between the first sample and the second density. This reminded me of our two parallel chains in AMIS.)

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

THE BENEFIT

Search This Blog