[R-bloggers] foreach 1.5.0 now available on CRAN (and 11 more aRticles)

[R-bloggers] foreach 1.5.0 now available on CRAN (and 11 more aRticles)

Link to R-bloggers

foreach 1.5.0 now available on CRAN

Posted: 31 Mar 2020 09:32 AM PDT

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is to announce that version 1.5.0 of the foreach package is now on CRAN. Foreach is an idiom that allows for iterating over elements in a collection, without the use of an explicit loop counter. The foreach package is now more than 10 years old, and is used by nearly 700 packages across CRAN and Bioconductor. (If you're interested in an overview of the foreach package and its history, this RStudio::conf talk by Bryan Lewis is worth watching.)

The main change is 1.5.0 is intended to help reduce errors associated with modifying global variables. Now, %dopar% loops with a sequential backend will evaluate the loop body inside a local environment. Why make that change? Let's illustrate with an example:

library(foreach)  registerDoSEQ()    a <- 0  foreach(i=1:10) %dopar% {a <- a + 1}  a  ## [1] 0

Here, the assignment inside the %dopar% is to a local copy of a, so the global variable a remains unchanged. The reason for this change is because %dopar% is intended for use in a parallel context, where modifying the global environment doesn't make sense: the work will be taking place in different R processes, sometimes on different physical machines, possibly in the cloud. In this context there is no shared global environment to manipulate, unlike the case of a sequential backend.

Because of this, it's almost always a mistake to modify global variables from %dopar%, even if it used to succeed. This change will hopefully reduce the incidence of programming errors where people prototype their code with a sequential backend, only to have it fail when they use it with a real (parallel) backend.

Note that the behaviour of the %do% operator, which is intended for a sequential backend, remains the same. It matches that of a regular for loop:

a <- 0  foreach(i=1:10) %do% {a <- a + 1}  a  ## [1] 10

If you have any questions or comments, please email me or open an issue at the GitHub repo.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Blogging A to Z: The A to Z of tidyverse

Posted: 31 Mar 2020 08:06 AM PDT

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Announcing my theme for this year's blogging A to Z!

The tidyverse is a set of R packages for data science. The big thing about the tidyverse is making sure your data are tidy. What does that mean?

  1. Each row is an observation
  2. Each column is a variable
  3. Each cell contains only one value
When I first learned about the tidy approach, I thought, "Why is this special? Isn't that what we should be doing?" But thinking about keeping your data tidy has really changed the way I approach my job, and has helped me solve some tricky data wrangling issues. When you really embrace this approach, merging data, creating new variables, and summarizing cases becomes much easier. And the syntax used is the tidyverse is much more intuitive than much of the code in R, making it easier to memorize many of the functions; they follow a predictable grammar, so you don't need to constantly look things up.
See you tomorrow for the first post – A is for arrange!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Visualizing decision tree partition and decision boundaries

Posted: 31 Mar 2020 06:00 AM PDT

[This article was first published on r – paulvanderlaken.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Grant McDermott develop this new R package I had thought of: parttree

parttree includes a set of simple functions for visualizing decision tree partitions in R with ggplot2. The package is not yet on CRAN, but can be installed from GitHub using:

# install.packages("remotes")  remotes::install_github("grantmcdermott/parttree")

Using the familiar ggplot2 syntax, we can simply add decision tree boundaries to a plot of our data.

In this example from his Github page, Grant trains a decision tree on the famous Titanic data using the parsnip package. And then visualizes the resulting partition / decision boundaries using the simple function geom_parttree()

library(parsnip)  library(titanic) ## Just for a different data set  set.seed(123) ## For consistent jitter    titanic_train$Survived = as.factor(titanic_train$Survived)    ## Build our tree using parsnip (but with rpart as the model engine)  ti_tree =    decision_tree() %>%    set_engine("rpart") %>%    set_mode("classification") %>%    fit(Survived ~ Pclass + Age, data = titanic_train)    ## Plot the data and model partitions  titanic_train %>%    ggplot(aes(x=Pclass, y=Age)) +    geom_jitter(aes(col=Survived), alpha=0.7) +    geom_parttree(data = ti_tree, aes(fill=Survived), alpha = 0.1) +    theme_minimal()

Super awesome!

This visualization precisely shows where the trained decision tree thinks it should predict that the passengers of the Titanic would have survived (blue regions) or not (red), based on their age and passenger class (Pclass).

This will be super helpful if you need to explain to yourself, your team, or your stakeholders how you model works. Currently, only rpart decision trees are supported, but I am very much hoping that Grant continues building this functionality!

To leave a comment for the author, please follow the link and comment on their blog: r – paulvanderlaken.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Updates to R GUIs: BlueSky, jamovi, JASP, & RKWard

Posted: 31 Mar 2020 03:03 AM PDT

[This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Graphical User Interfaces (GUIs) for the R language help beginners get started learning R, help non-programmers get their work done, and help teams of programmers and non-programmers work together by turning code into menus and dialog boxes. There has been quite a lot of progress on R GUIs since my last post on this topic. Below I describe some of the features added to several R GUIs.

BlueSky Statistics

BlueSky Statistics has added mixed-effects linear models. Its dialog shows an improved model builder that will be rolled out to the other modeling dialogs in future releases. Other new statistical methods include quantile regression, survival analysis using both Kaplan-Meier and Cox Proportional Hazards models, Bland-Altman plots, Cohen's Kappa, Intraclass Correlation, odds ratios and relative risk for M by 2 tables, and sixteen diagnostic measures such as sensitivity, specificity, PPV, NPV, Youden's Index, and the like. The ability to create complex tables of statistics was added via the powerful arsenal package. Some examples of the types of tables you can create with it are shown here.

Several new dialogs have been added to the Data menu. The Compute Dummy Variables dialog creates dummy (aka indicator) variables from factors for use in modeling. That approach offers greater control over how the dummies are created than you would have when including factors directly in models.

A new Factor Levels menu item leads to many of the functions from the forcats package. They allow you to reorder factor levels by count, by occurrence in the dataset, by functions of another variable, allow you to lump low-frequency levels into a single "Other" category, and so on. These are all helpful in setting the order and nature of, for example, bars in a plot or entries in a table.

The BlueSky Data Grid now has icons that show the type of variable i.e. factor, ordered factor, string, numeric, date or logical. The Output Viewer adds icons to let you add notes to the output (not full R Markdown yet), and a trash can icon lets you delete blocks of output.

A comprehensive list of the changes to this release is located here and my updated review of it is here.

jamovi

New modules expand jamovi's capabilities to include time-based survival analysis, Bland-Altman analysis & plots, behavioral change analysis, advanced mediation analysis, differential item analysis, and quantiles & probabilities from various continuous distributions.

jamovi's new Flexplot module greatly expands the types of graphs it can create, letting you take a single graph type and repeat it in rows and/or columns making it easy to visualize how the data is changing across groups (called facet, panel, or lattice plots).

You can read more about Flexplot here, and my recently-updated review of jamovi is here.

JASP

The JASP package has added two major modules, machine learning, and network analysis. The machine learning module includes boosting, K-nearest neighbors, and random forests for both regression and classification problems. For regression, it also adds regularized linear regression. For clustering, it covers hierarchical, K-means, random forest, density-based, and fuzzy C-means methods. It can generate models and add predictions to your dataset, but it still cannot save models for future use. The main method it is missing is a single decision tree model. While less accurate predictors, a simple tree model can often provide insight that is lacking from other methods.

Another major addition to JASP is Network Analysis. It helps you to study the strengths of interactions among people, cell phones, etc. With so many people working from home during the Coronavirus pandemic, it would be interesting to see what this would reveal about how our patterns of working together have changed.

A really useful feature in JASP is its Data Library. It greatly speeds your ability to try out a new feature by offering a completely worked-out example including data. When trying out the network analysis feature, all I had to do was open the prepared example to see what type of data it would use. With most other data science software, you're left to dig about in a collection of datasets looking for a good one to test a particular analysis. Nicely done!

I've updated my full review of JASP, which you can read here.

RKWard

The main improvement to the RKWard GUI for R is adding support for R Markdown. That makes it the second GUI to support R Markdown after R Commander. Both the jamovi and BlueSky teams are headed that way. RKWard's new live preview feature lets you see text, graphics, and markdown as you work. A comprehensive list of new features is available here, and my full review of it is here.

Conclusion

R GUIs are gaining features at a rapid pace, quickly closing in on the capabilities of commercial data science packages such as SAS, SPSS, and Stata. I encourage R GUI users to contribute their own additions to the menus and dialog boxes of their favorite(s). The development teams are always happy to help with such contributions. To follow the progress of these and other R GUIs, subscribe to my blog, or follow me on twitter.

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Contagiousness of COVID-19 Part I: Improvements of Mathematical Fitting (Guest Post)

Posted: 31 Mar 2020 12:00 AM PDT

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Learning Machines proudly presents a guest post by Martijn Weterings from the Food and Natural Products research group of the Institute of Life Technologies at the University of Applied Sciences of Western Switzerland in Sion.


The topic of this post will be the fitting with the R-package optim. Food? That sounds like a rather unlikely match for writing a post on a blog about quantitative analysis, however, there is a surprising overlap between these disciplinary fields. For example, whether you model the transport of a flavour molecule or transport of a virus, the type of mathematical equations and the ways to treat the data are a lot similar.

This contribution will be split into two parts. In the first part, we pick up on the earlier fitting described in a previous blog-post here (see Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?). These fits are sometimes difficult to perform. How can we analyse that difficult behaviour and how can we make further improvements? In the second part, we will see that all these efforts to make a nice performing algorithm to perform the fitting is actually not much useful for the current case. Just because we use a mathematical model, which sounds rigorous, does not mean that our conclusions/predictions are trustworthy.

These two parts will be accompanied by the R-script covid.r.

COVID-19: a data epidemic

With the outbreak of COVID-19 one thing that is certain is that never before a virus has gone so much viral on the internet. Especially, a lot of data about the spread of the virus is going around. A large amount of data is available in the form of fancy coronavirus-trackers that look like weather forecasts or overviews of sports results. Many people have started to try predicting the evolution of the epidemiological curve and along with that the reproduction number R_0, but can this be done with this type of data?

In this blog-post, we describe the fitting of the data with the SIR model and explain the tricky parts of the fitting methodology and how we can mitigate some of the problems that we encounter.

The general problem is that the fitting-algorithm is not always finding it's way to the best solution. Below is a graph that shows an out of the box fit of the data with the optim package (it's the one from the previous blog post Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)? ). Next to it, we show a result that is more optimal. Why did we not find this result directly with the optim package?

Why doesn't the algorithm converge to the optimal solution?

There are two main reasons why the model is not converging well.

Early stopping of the algorithm

The first reason is that the optim algorithm (which is updating model parameters starting from an initial guess and moving towards the optimal solution) is stopping too early before it has found the right solution.

How does the optim package find a solution? The gradient methods used by the optim package find the optimum estimate by repeatedly improving the current estimate and finding a new solution with a lower residual sum of squares (RSS) each time. Gradient methods do this by computing for a small change of the parameters in which direction the RSS will change the fastest and then, in the case of the BFGS method used by the optim package, it computes (via a line search method) where in that direction the lowest value for the RSS is. This is repeated until no further improvement can be made, or when the improvement is below some desired/sufficient minimal level.

In the two images below we see how the algorithm solves stepwise the fit, for a SIR model that uses the parameters K = \beta - \gamma and R_0 = \beta/\gamma (these parameters had been explained in the previous blog post and are repeated in this post below). The images are contour plot (lines) and surface plot (colours) for the value of the RSS as a function of the model parameters. The minimum is around K = 0.4 and R_0 = 1.004 where eventually the algorithm should end.

We see in these images effects that make it difficult for the algorithm to approach the optimum quickly in few steps, or it may even get blocked before that point (also it may end up in a local optimum, which is a bit different case, although we have it here as well and there's a local optimum with a value for R_0 < 1).

Computation of the gradient If the function that we use for the optimization does not provide an expression for the gradient of the function (which is needed to find the direction of movement) then the optim package will compute this manually by taking the values at nearby points.

How much nearby do these points need to be? The optim package uses the scale of the parameters for this. This scale does not always work out of the box and when it is too large then the algorithm is not making an accurate computation of the gradient.

In the image below we see this by the path taken by the algorithm is shown by the red and black arrows. The red arrows show the path when we do not fine-tune the optimization, the black path shows the path when we reduce the scale of the parameters manually. This is done with the control parameter. In the code of the file covid.R you see this in the function:

  OptTemp <- optim(new, RSS2,                    method = "L-BFGS-B",                   lower = c(0,1.00001),                   hessian = TRUE,                   control = list(parscale = c(10^-4,10^-4),                                  factr = 1))    

By using parscale = c(10^-4,10^-4) we let the algorithm compute the gradient at a smaller scale (we could actually also use the ndeps parameter). In addition, we used factr = 1, which is a factor that determines the point when the algorithm stops (in this case when the improvement is less than one times the machine precision).

So by changing the parameter parscale we can often push the algorithm to get closer to the optimal solution.

A zigzag path towards the optimum may occur when the surface plot of the RSS has a sort of long stretched valley shape. Then the algorithm is computing a path that moves towards the optimum like a sort of snowboarder on a half-pipe, taking lots of movements along the axis in the direction of the curvature of the half-pipe, and much less movement along the axis downhill towards the bottom.

In the case above we had let the algorithm start at K = 0.4 and R_0 = 1.004 and this was chosen on purpose for the illustration. But we do not always make such a good initial guess. In the image below we see what happens when we had chosen K = 0.4 and R_0 = 2 as starting condition (note that image should be stretched out along the y-axis due to the different ranges of 1.80 < K < 2.00 and 0.34920 < R_0 < 0.34924 in which case the change of the RSS is much faster/stronger in the direction left-right than the direction up-down).

The red curve, which shows the result of the algorithm without the fine-tuning, stops already after one step around K = 0.35 where it hits the bottom of the curvature of the valley/half-pipe and is not accurately finding out that there is still a long path/gradient in the other direction. We can improve the situation by changing the parscale parameter, in which case the algorithm will more precisely determine the slope and continue it's path (see the black arrows). But in the direction of the y-axis, it does this only in small steps, so it will only slowly converge to the optimal solution.

We can often improve this situation by changing the relative scale of the parameters, however, in this particular case, it is not easy, because of the L-shape of the 'valley' (see the above image). We could change the relative scales of the parameters to improve convergence in the beginning, but then the convergence at the end becomes more difficult.

Ill-conditioned problem

The second reason for the bad convergence behaviour of the algorithm is that the problem is ill-conditioned. That means that a small change of the data will have a large influence on the outcome of the parameter estimates.

In that case, the data is not very useful to differentiate between different parameters of the model. A large range of variation in the parameters can more or less explain the same data.

An example of this is in the image below, where we see that for different values of R0 we can still fit the data without much difference in the residual sum of squares (RSS). We get every time a value for K around 0.35 to 0.40 (and the shape of the curve is not much dependent on the value of R_0).

This value for K relates to the initial growth rate. Let's look at the differential equations to see why variations in R_0 have so little effect on the begin of the curve. In terms of the parameters K = \beta - \gamma and R_0 = \beta/\gamma the equations are now:

    \[\begin{array}{rclc} \frac{dS}{dt}  &=& \frac{-S/N R_0}{R_0-1} K I &\\ \\ \frac{dI}{dt}  &=&  \frac{S/N R_0 - 1}{R_0-1} K I &\underbrace{\approx  \,\,  KI}_{\llap{\scriptsize\text{if $S/N \approx 1$}}} \\ \\ \frac{dR}{dt}  &=&  \frac{1}{R_0-1} K I & \end{array}\]

Here we see that, when S/N is approximately equal to 1 (which is the case in the beginning), then we get approximately dI/dt = KI and the beginning of the curve will be approximately exponential.

Thus, for a large range of values of R_0, the beginning of the epidemiological curve will resemble an exponential growth that is independent of the value of R_0. In the opposite direction: when we observe exponential growth (initially) then we can not use this observation to derive a value for R_0.

With these ill-conditioned problems, it is often difficult to get the algorithm to converge to the minimum. This is because changes in some parameter (in our case R_0) will result in only a small improvement of the RSS and a large range of the parameters have more or less the same RSS.

The error/variance of the parameter estimates

So if small variations in the data occur, due to measurements errors, how much impact will this have on the estimates of the parameters? Here we show the results for two different ways to do determine this. In the file covid.R the execution of the methods will be explained in more detail.

Using an estimate of the Fisher information. We can determine an estimate for (lower bound of) the variance of the parameters by considering the Cramér-Rao bound, which states that the variance of (unbiased) parameter estimates are equal to or larger than the inverse of the Fisher Information matrix. The Fisher information is a matrix with the second-order partial derivatives of the log-likelihood function evaluated at the true parameter values.

The log-likelihood function is this thing:

We do not know this loglikelihood function and it's dependence on the parameters K and R_0 because we do not have the true parameter values and also we do not know the variance of the random error of the data points (the \sigma term in the likelihood function). But we can estimate it based on the Hessian, a matrix with the second-order partial derivatives of our objective function evaluated at our final estimate.

  #####################  ##  ## computing variance with Hessian  ##  ###################    ### The output of optim will store values for RSS and the hessian  mod <- optim(c(0.3, 1.04), RSS2, method = "L-BFGS-B", hessian = TRUE,               control = list(parscale = c(10^-4,10^-4), factr = 1))    # unbiased estimate of standard deviation  # we divide by n-p  # where n is the number of data points  # and p is the number of estimated parameters  sigma_estimate <- sqrt(mod$value/(length(Infected)-2))    # compute the inverse of the hessian  # The hessian = the second order partial derivative of the objective function  # in our case this is the RSS  # we multiply by 1/(2 * sigma^2)  covpar <- solve(1/(2*sigma_estimate^2)*mod$hessian)  covpar  #              [,1]          [,2]  #[1,]  1.236666e-05 -2.349611e-07  #[2,] -2.349611e-07  9.175736K and R0 e-09    ## the variance of R0 is then approximately  ##    covpar[2,2]^0.5  #[1] 9.579006e-05  

Using a Monte Carlo estimation. A formula to compute exactly the propagation of errors/variance in the data to the errors/variance in the estimates of the parameters is often very complex. The Hessian will only give us a lower bound (I personally find it more useful to see any potential strong correlation between parameters), and it is not so easy to implement. There is however a very blunt but effective way to get an idea of the propagation of errors and that is by performing a random simulation.

The full details of this method are explained in the covid.R file. Here we will show just the results of the simulation:

In this simulation, we simulated 100 times new data based on a true model with parameter values K = 0.4020 and R_0 = 1.0053 and with the variance of data points corresponding to the observed RSS of our fit. We also show in the right graph how the parameters \beta and \gamma are distributed for the same simulation. The parameters \beta and \gamma are strongly correlated. This results in them having a large marginal/individual error, but the values \beta - \gamma and \beta/\gamma have much less relative variation (this is why we changed the fitting parameters from \beta and \gamma to K and R_0).

Fitting by using the latest data and by using the analytical solution of the model

Now, we are almost at the end of this post, and we will make a new attempt to fit again the epidemiological curve, but now based on more new data.

What we do this time is make some small adaptations:

  • The data is the number of total people that have gotten sick. This is different from the I (infectious) and R (recovered) output of the model. We make the comparison of the modelled I+R with the data (the total that have gone sick).
  • In this comparison, we will use a scaling factor because the reported number of infected/infectious people is an underestimation of the true value, and this latter value is what the model computes. We use two scaling factors one for before and one for after February 12 (because at that time the definition for reporting cases had been changed).
  • We make the population size a fitting variable. This will correct for the two assumptions that we have homogeneous mixing among the entire population of China and that 100% of the population is susceptible. In addition, we make the infected people at the start a fitting variable. In this model, we will fit I+R. There is data for a separate I but it is not such an accurate variable (because the recovery and the infectious phase is not easy to define/measure/determine).

Because the computation of all these parameters is too difficult in a single optim function we solve the parameters separately in a nested way. In the most inner loop, we solve the scaling parameters (which can be done with a simple linear model), in the middle loop we solve the K and R_0 with the optim function, in the outer loop we do a brute force search for the optimal starting point of I(0).

To obtain a starting condition we use a result from Harko, Lobo and Mak 2014 (Exact analytical solutions of the Susceptible-Infected-Recovered (SIR) epidemic model and of the SIR model with equal death and birth rates) who derived expressions for S, I and R in terms of a single differential equation. The equation below is based on their equations but expressed in slightly different terms:

    \[\frac{dS}{dt} =  \beta (\frac{R(0)}{N}-1) \cdot S -\gamma \cdot S \, \text{log}(S/N)  +  \beta S^2/N\]

We can solve this equation as a linear equation which gives us a good starting condition (small sidenote: using some form of differential equation is a general way of getting starting conditions, but the dS/dt might be noisy, in that case, one could integrate the expression).

The further details of the computation can be found in the covid.R script. Below you see a result of the outer loop where we did a brute force search (which gives an optimum around I(0) = 10.8 for N = 105) and next to it a fitted curve for the parameters N = 4.155 \cdot 10^5, I(0) = 45, K = 0.218 and R_0 = 1.087.

In this new fit, we get again a low reproduction number R_0 = 1.087. One potential reason for this is that due to the measures that have been taken, the Chinese have been able to reduce the rate of the spread of the virus. The model is unaware of this and interprets this as a reduction that is due to immunity (decrease of susceptible people). However, only a very small fraction of the people have gained immunity (about 20% of the population got sick if we consider N = 4.155 \cdot 10^5). For the virus to stop spreading at already such a low fraction of sick people it must mean that the R_0 is very low.

Thus, an estimation of the parameters, based on this type of data, is difficult. When we see a decrease in the growth rate then one or more of the following four effects play a role: (1) The number of susceptible people has decreased sufficiently to overcome the reproduction rate R_0. This relative decrease in susceptible people happens faster when the total number of people is smaller. (2) Something has changed about the conditions, the reproduction rate is not constant in time. For instance, with respiratory infections, it is common that the transfer rates depend on weather and are higher during winter. (3) The measures that are being taken against the disease are taking effect. (4) The model is too simple with several assumptions that overestimate the effect of the initial growth rate. This growth rate is very high  data-recalc-dims=25%" title="Rendered by QuickLaTeX.com" height="13" width="35" style="vertical-align: 0px;"/> per day, and we observe a doubling every three days. This means that the time between generations is very short, something that is not believed to be true. It may be likely that the increase in numbers is partially due to variable time delay in the occurrence of the symptoms as well as sampling bias.

For statisticians, it is difficult to estimate what causes the changes in the epidemic curves. We should need more detailed information in order to fill in the gaps which do not seem to go away by having just more data (and this coronavirus creates a lot of data, possibly too much). But as human beings under threat of a nasty disease, we can at least consider ourselves lucky that we have a lot of options how the disease can fade away. And we can be lucky that we see a seemingly/effective reproduction rate that is very low, and also only a fraction of the population is susceptible.

Wrap up

So now we have done all this nice mathematics and we can draw accurately a modelled curve through all our data points. But is this useful when we model the wrong data with the wrong model? The difference between statistics and mathematics is that statisticians need to look beyond the computations.

  • We need to consider what the data actually represents, how is it sampled, whether there are biases and how strongly they are gonna influence our analysis. We should actually do this ideally before we start throwing computations at the data. Or such computations will at most be exploratory analysis, but they should not start to live their own life without the data.
  • And we need to consider how good a representation our models are. We can make expressions based on the variance in the data, but the error is also determined by the bias in our models.

At the present time, COVID-19 is making an enormous impact on our lives, with an unclear effect for the future (we even do not know when the measures are gonna stop, end of April, end of May, maybe even June?). Only time will tell what the economic aftermath of this coronavirus is gonna be, and how much it's impact will be for our health and quality of life. But one thing that we can assure ourself about is that the ominous view of an unlimited exponential growth (currently going around on social media) is not data-driven.

In this post, I have explained some mathematics about fitting. However, I would like to warn for the blunt use of these mathematical formula's. Just because we use a mathematical model does not mean that our conclusions/predictions are trustworthy. We need to challenge the premises which are the underlying data and models. So in a next post, "Contagiousness of COVID-19 Part 2: Why the Result of Part 1 is Useless", I hope to explain what sort of considerations about the data and the models one should take into account and make some connections with other cases where statistics went in a wrong direction.

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

What is a dgCMatrix object made of? (sparse matrix format in R)

Posted: 30 Mar 2020 10:44 PM PDT

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I've been working with sparse matrices in R recently (those created using Matrix::Matrix with the option sparse=TRUE) and found it difficult to track down documentation about what the slots in the matrix object are. This post describes the slots in a class dgCMatrix object.

(Click here for full documentation of the Matrix package (and it is a lot–like, 215 pages a lot).)

Background

It turns out that there is some documentation on dgCMatrix objects within the Matrix package. One can access it using the following code:

  library(Matrix)  ?`dgCMatrix-class`  

According to the documentation, the dgCMatrix class

…is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the "standard" class for sparse numeric matrices in the Matrix package.

An example

We'll use a small matrix as a running example in this post:

  library(Matrix)  M <- Matrix(c(0, 0,  0, 2,                6, 0, -1, 5,                0, 4,  3, 0,                0, 0,  5, 0),              byrow = TRUE, nrow = 4, sparse = TRUE)  rownames(M) <- paste0("r", 1:4)  colnames(M) <- paste0("c", 1:4)  M  # 4 x 4 sparse Matrix of class "dgCMatrix"  #    c1 c2 c3 c4  # r1  .  .  .  2  # r2  6  . -1  5  # r3  .  4  3  .  # r4  .  .  5  .  

Running str on x tells us that the dgCMatrix object has 6 slots. (To learn more about slots and S4 objects, see this section from Hadley Wickham's Advanced R.)

  str(M)  # Formal class 'dgCMatrix' [package "Matrix"] with 6 slots  # ..@ i       : int [1:7] 1 2 1 2 3 0 1  # ..@ p       : int [1:5] 0 1 2 5 7  # ..@ Dim     : int [1:2] 4 4  # ..@ Dimnames:List of 2  # .. ..$ : chr [1:4] "r1" "r2" "r3" "r4"  # .. ..$ : chr [1:4] "c1" "c2" "c3" "c4"  # ..@ x       : num [1:7] 6 4 -1 3 5 2 5  # ..@ factors : list()  

x, i and p

If a matrix M has nn non-zero entries, then its x slot is a vector of length nn containing all the non-zero values in the matrix. The non-zero elements in column 1 are listed first (starting from the top and ending at the bottom), followed by column 2, 3 and so on.

  M  # 4 x 4 sparse Matrix of class "dgCMatrix"  #    c1 c2 c3 c4  # r1  .  .  .  2  # r2  6  . -1  5  # r3  .  4  3  .  # r4  .  .  5  .    M@x  # [1]  6  4 -1  3  5  2  5    as.numeric(M)[as.numeric(M) != 0]  # [1]  6  4 -1  3  5  2  5  

The i slot is a vector of length nn. The kth element of M@i is the row index of the kth non-zero element (as listed in M@x). One big thing to note here is that the first row has index ZERO, unlike R's indexing convention. In our example, the first non-zero entry, 6, is in the second row, i.e. row index 1, so the first entry of M@i is 1.

  M  # 4 x 4 sparse Matrix of class "dgCMatrix"  #    c1 c2 c3 c4  # r1  .  .  .  2  # r2  6  . -1  5  # r3  .  4  3  .  # r4  .  .  5  .    M@i  # [1] 1 2 1 2 3 0 1  

If the matrix has nvars columns, then the p slot is a vector of length nvars + 1. If we index the columns such that the first column has index ZERO, then M@p[1] = 0, and M@p[j+2] - M@p[j+1] gives us the number of non-zero elements in column j.

In our example, when j = 2, M@p[2+2] - M@p[2+1] = 5 - 2 = 3, so there are 3 non-zero elements in column index 2 (i.e. the third column).

  M  # 4 x 4 sparse Matrix of class "dgCMatrix"  #    c1 c2 c3 c4  # r1  .  .  .  2  # r2  6  . -1  5  # r3  .  4  3  .  # r4  .  .  5  .    M@p  # [1] 0 1 2 5 7  

With the x, i and p slots, one can reconstruct the entries of the matrix.

Dim and Dimnames

These two slots are fairly obvious. Dim is a vector of length 2, with the first and second entries denoting the number of rows and columns the matrix has respectively. Dimnames is a list of length 2: the first element being a vector of row names (if present) and the second being a vector of column names (if present).

factors

This slot is probably the most unusual of the lot, and its documentation was a bit difficult to track down. From the CRAN documentation, it looks like factors is

… [an] Object of class "list" – a list of factorizations of the matrix. Note that this is typically empty, i.e., list(), initially and is updated automagically whenever a matrix factorization is
computed.

My understanding is if we perform any matrix factorizations or decompositions on a dgCMatrix object, it stores the factorization under factors so that if asked for the factorization again, it can return the cached value instead of recomputing the factorization. Here is an example:

  M@factors  # list()    Mlu <- lu(M)  # perform triangular decomposition  str(M@factors)  # List of 1  # $ LU:Formal class 'sparseLU' [package "Matrix"] with 5 slots  # .. ..@ L  :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots  # .. .. .. ..@ i       : int [1:4] 0 1 2 3  # .. .. .. ..@ p       : int [1:5] 0 1 2 3 4  # .. .. .. ..@ Dim     : int [1:2] 4 4  # .. .. .. ..@ Dimnames:List of 2  # .. .. .. .. ..$ : chr [1:4] "r2" "r3" "r4" "r1"  # .. .. .. .. ..$ : NULL  # .. .. .. ..@ x       : num [1:4] 1 1 1 1  # .. .. .. ..@ uplo    : chr "U"  # .. .. .. ..@ diag    : chr "N"  # .. ..@ U  :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots  # .. .. .. ..@ i       : int [1:7] 0 1 0 1 2 0 3  # .. .. .. ..@ p       : int [1:5] 0 1 2 5 7  # .. .. .. ..@ Dim     : int [1:2] 4 4  # .. .. .. ..@ Dimnames:List of 2  # .. .. .. .. ..$ : NULL  # .. .. .. .. ..$ : chr [1:4] "c1" "c2" "c3" "c4"  # .. .. .. ..@ x       : num [1:7] 6 4 -1 3 5 5 2  # .. .. .. ..@ uplo    : chr "U"  # .. .. .. ..@ diag    : chr "N"  # .. ..@ p  : int [1:4] 1 2 3 0  # .. ..@ q  : int [1:4] 0 1 2 3  # .. ..@ Dim: int [1:2] 4 4  

Here is an example which shows that the decomposition is only performed once:

  set.seed(1)  M <- runif(9e6)  M[sample.int(9e6, size = 8e6)] <- 0  M <- Matrix(M, nrow = 3e3, sparse = TRUE)    system.time(lu(M))  #   user  system elapsed  # 13.527   0.161  13.701    system.time(lu(M))  #   user  system elapsed  #      0       0       0  

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Close Encounters of the R Kind

Posted: 30 Mar 2020 05:00 PM PDT

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Affiliation

Harrison – Center for Strategic and Budgetary Analysis, Washington DC

Cara – Department of the Air Force (Studies, Analyses, and Assessments – AF/A9), Washington DC

Disclaimer

The views expressed in this article represent the personal views of the author and are not necessarily the views of the Department of Defense (DoD) or the Department of the Air Force.

This post is an effort to condense the 'buzz' surrounding the explosion of open source solutions in all facets of analysis – to include those done by Military Operations Research Society ( MORS) members and those they support – by describing our experiences with the R programming language.

The impact of R in the statistical world (and by extension, data science) has been big: worthy of an entire issue of SIGNIFICANCE Magazine (RSS). Surprisingly, R is not a new computing language; modeled on S and Scheme,- the technology at the core of R is over forty years old. This longevity is, in itself, noteworthy. Additionally, fee-based and for-profit companies have begun to incorporate R with their products. While statistics is the focus of R, with the right packages – and know-how – it can also be used for a much broader spectrum, to include machine learning, optimization, and interactive web tools.

In this post, we go back and forth discussing our individual experiences.

Getting Started with R

Harrison: I got started with R in earnest shortly before retiring from the U.S. Navy in 2016. I knew that I was going to need a programming language to take with me into my next career. The reason I chose R was not particularly analytical; the languages that I had done the most work in during grad school – MATLAB and Java – were not attractive in that the first required licensing fees and the second was – to me at the time – too 'low level' for the type of analysis I wanted to perform. I had used SPlus in my statistics track, but never really 'took' to it while in school. Several tools to 'bridge' the gap between Excel and R had been recommended to me by a friend, including RStudio and Rcommander.

Onboard ship many years ago, I learned to eat with chopsticks by requesting that the wardroom staff stop providing me with utensils, substituting a bag of disposable chopsticks I purchased in Singapore. Turns out when deprived of other options, you can learn very fast. Learning R basics was the same; instead of silverware, it was removing the shortcuts to my usual tools on my home laptop (Excel). I simply did every task that I could from the mundane to the elegant in R.

Cara: I started dabbling with R in 2017 when I had about a year and a half left in my PhD journey, after I decided to pursue a post-doctoral government career. Sitting comfortably in academia with abundant software licenses for almost a decade, I had no reason until that point to consider abandoning my SAS discipleship (other than abhorrent graphics capability, which I bolstered with use of SigmaPlot). My military background taught me not to expect access to expensive software in a government gig, and I had a number of friends and colleagues already using R and associated tools, so I installed it on my home computer. Other than being mildly intrigued by the software version naming convention, I stubbornly clung to SAS to finish my doctoral research, however.

It was not until I became a government OR analyst in late 2018 that I really started my R journey (mostly out of necessity at this point). R does not require a license, and furthermore, is approved for use on government computers. I found R to be an easier and more natural programming language to learn coming from a SAS background than Python, Javascript, or C++ would have been.

How has using R shaped your practice?

Harrison: There is a lot of talk about how various tools perform in the sense of runtime, precision, graphics, etc. These are considerations, but they are completely eclipsed by the following: We don't talk as a community about how much the tools we use shape our thinking. I frequently tell my colleagues that the fundamental unit in Excel is called a cell because it is your mind prison. There's actually some truth to that. R is vectorized, so for most functions, passing an array gives an appropriate array output. When you work day-in and day-out with vectors, you stop thinking about individual operations start to think in terms of sentences. The magrittr %>% operator, which takes the expression on the left as the first argument to the function on the right, makes this possible. Analysis begins to feel more like writing sentences – or even short poems – than writing computing code.

Early in my work with R, I was told by a colleague that "R might be good but the graphics are terrible". This was a bit of a shock, as graphics has been one of the main selling points of the language, and I didn't want to be making seedy graphs. From that point on, I made it a point to make the best graphics I possibly could, usually – but not always – using methods and extensions found in the ggplot2 package. It is no exaggeration to say that I spend roughly 20% of my analysis time picking colors and other aesthetics for plots. If you are willing to take the time, you can get the graphics to sing; there are color schemes based on The Simpsons and Futurama, and fonts based on xkcd comics.

Cara: When I began teaching myself R – and using it daily – I thought I was merely learning the syntax of a new programming language. With the analytic capability inherent with R and the flexibility of development environments, however, it is really more of a way of thinking. Fold in the powerful (and mostly free!) resources and passionate following of analysts and data scientists, and you get an R community that I truly enjoy being a part of.

The R environment, even as a novice user, can have positive impacts on your workflow. For example, beyond syntax, my earliest explorations in R taught me that if you are going to do something more than once, write a function. I had never truly internalized that idea, even after a decade of using SAS. Another thing I learned relatively early on – get to know the dplyr package, and use it! I had been coding in R for about 6 months before I was really introduced to functions like dplyr::filter(), dplyr::select(), and dplyr::mutate(); these are powerful functions that can save a ton of code. I've been analyzing data for over a decade and I've never come across a dataset that was already in the form I needed. Prior to using the dplyr package, however, I was spending a lot of time manipulating data using no functions and a lot of lines of code. Beyond time savings, dplyr helps you think about your data more creatively. As a very basic example, dplyr::summarise() is a more powerful option than mean() used alone, especially for multiple calculations in a single data table. And once you master the Wonder Twin-esque combination of using group_by() with summarise(), you'll be amazed at what you can (quickly) reveal through exploratory analysis. Data wrangling is (and always will be) a fact of life. The more efficiently you manipulate data, however, the more time you have to spend on the seemingly more exciting aspects of any project.

Disadvantages of R

Harrison: This piece is not a 'sales pitch' for R; but rather a sober consideration of what the tradeoffs an organization needs to consider when choosing an analytic platform writ large:

  1. Compatibility and Editing. Because R is a computing language, graphics built in R are not editable by non-R users, as opposed to Excel graphs. This can be a challenge in the frequent case where the reviewers are not the same people that created the plots. If you made the plot, you are going to have to be the one who does the editing, unless there is another R user who understands your particular technique in the office.

  2. No license costs do not mean that it's free: I frequently like to say that I haven't spent a dime on analytics software since I retired from the Navy; this is strictly true, but also misleading. I have spent considerable time learning the best practices in R over the past 4 years. An organization that is looking to make this choice needs to realize upfront that the savings in fees will be largely eaten up by extra manpower to learn how to make it work. The reward for investing the time in increasing the ability of your people to code is twofold; first, it makes them closer in touch with the actual analysis, and secondly, it allows for bespoke applications.

Cara: I work in a pretty dynamic world as a government operations research analyst (ORSA); we don't typically have dedicated statisticians, programmers, data scientists, modelers, or data viz specialists. Most of the time, we are all functioning in some or all of those capacities. As a former engineer with a dynamic background, this suits me well. However, it also means that things change from day to day, from project to project, and as the government analytic world changes (rapidly). I do not have the flexibility to use one software package exclusively. Further, I face challenges within DoD related to systems, software, classification, and computing infrastructure that most people in academia or industry do not. In my organization, there has been a relatively recent and rapid shift in the analytic environment. We formerly leaned heavily on Excel-based regressions and descriptive statistics, usually created by a single analysts that answer a single question, and in many cases these models were not particular dynamic or scalable. We now focus on using open-source tools in a team construct, sometimes with industry partners, to create robust models that are designed to answer multiple questions from a variety of perspectives; scale easily to mirror operational requirements; fit with other models; and transition well to high performance computing environments.

The two open-source tools we (i.e., my division) currently use most for programming are R and Python. We have had success combining data analysis, statistics, and graphical models to create robust tools coded as RShiny apps. Recently, we chose to code in Python for a project that involved machine learning and high performance computing. I do not propose to debate the strengths and weaknesses of either R or Python in this forum; rather, I challenge you to consider carefully the implications of programming language choice for any project with a cradle to grave perspective.

References

  1. Getting started with R can be daunting. We recommend the following references.
    Stack Overflow. This invaluable resource is a bulletin board exchange of programming ideas and tips. The real skill required to use it effectively is knowing how to write an effective question. "I hate ggplot" or "My R code doesn't work" are not useful; try "Could not subset closure" or "ggplot axis font size" instead.

  2. Vignettes. Well-developed R packages have vignettes, which are very useful in seeing both an explanation of the code as well as an example. Two very good references are the ggplot2 gallery and the dplyr vignette Finally, the RViews blog is a great way to keep up-to-date with practice.

  3. Books. Although I tend to acquire books with reckless abandon, the ones I actually keep and use have withstood careful consideration and have generally pegged the daily utility meter. Try R for Data Science by Wickham and Grolemund (O'Reilly Publishing 2017) and Elegant Graphics for Data Analysis by Wickham (Springer 2016); available both as print copies or electronic editions.

  4. Podcasts. For those moments in your life when you need some data science-related enrichment, the producers of DataCamp host an excellent podcast called DataFramed. Fifty-nine episodes have been recorded so far; find them on soundcloud, Spotify, YouTube, or VFR direct from the creator's listening notes.

  5. RStudio Cheatsheets. Sometimes you need densely constructed (read: compact yet surprisingly in-depth), straightforward references. RStudio creates (and updates) these two-pagers for the most popular and versatile R packages to be great portable references for programmers – think of them as a combined dictionary and thesaurus for learning R. Fun fact: they can be downloaded in multiple languages.

  6. Forums. (1) Data Science Center of Education (DSCOE) is a CAC-enabled collaboration site that hosts data science tutorials developed by Army analysts, mostly using R, and supports a week-long R immersion course offered at Center for Army Analysis (CAA) twice a year. The DSCOE forum is managed collaboratively by the CAA, U.S. Army Cyber Command (ARCYBER), Naval Postgraduate School (NPS), and the United Stated Military Academy (USMA). Contributions are both welcome and encouraged. (2) R-bloggers, created in 2015, is an R centric forum designed to foster connection, collaboration, and resource sharing within the R community. The utility of this forum lies in its array of technical resources that will benefit both new and practiced users. (3) Data Science DC, for those in the NCR, was formed via the concatenation of numerous meetup groups – including RMeetup DC – and a major proponent of a number of events, including hackathons and the DCR conference (held annually in the fall).

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

COVID-19 in Belgium

Posted: 30 Mar 2020 05:00 PM PDT

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Markus Spiske

Photo by Markus Spiske

Introduction

The Novel COVID-19 Coronavirus is still spreading quickly in several countries and it does not seem like it is going to stop anytime soon as the peak has not yet been reached in many countries.

Since the beginning of its expansion, a large number of scientists across the world have been analyzing this Coronavirus from different perspectives and with different technologies with the hope of coming up with a cure in order to stop its expansion and limit its impact on citizens.

Top R resources on Coronavirus

In the meantime, epidemiologists, statisticians and data scientists are working towards a better understanding of the spread of the virus in order to help governments and health agencies in taking the most optimal decisions. This led to the publication of a great deal of online resources about the virus, which I collected and organized in an article covering the top R resources on Coronavirus. This article is a collection of the best resources I've had the chance to discover, with a brief summary for each of them. It includes Shiny apps, dashboards, R packages, blog posts and datasets.

Publishing this collection led many readers to submit their piece of work, which made the article even more complete and more insightful for anyone interested in analyzing the virus from a quantitative perspective. Thanks to everyone who contributed and who helped me in collecting and summarizing these R resources about COVID-19!

Given my field of expertise, I am not able to help in this fight against the virus from a medical point of view. However, I still wanted to contribute as much as I could. From understanding better the disease to bringing scientists and doctors together to build something bigger and more impactful, I truly hope that this collection will, to a small extent, help to fight the pandemic.

Coronavirus dashboard for your own country

Besides receiving analyses, blog posts, R code and Shiny apps from people across the world, I realized that many people were trying to create a dashboard tracking the spread of the Coronavirus for their own country. So in addition to the collection of top R resources, I also published an article detailing the steps to follow to create a dashboard specific to a country. See how to create such dashboard in this article and an example with Belgium.

The code has been made available on GitHub and is open source so everyone can copy it and adapt it to their own country. The dashboard was intentionally kept simple so anyone with a minimum knowledge in R could easily replicate it, and advanced users could enhance it according to their needs.

Motivations, limitations and structure of the article

By seeing and organizing many R resources about COVID-19, I am fortunate enough to have read a lot of excellent analyses on the disease outbreak, the impact of different health measures, forecasts of the number of cases, projections about the length of the pandemic, hospitals capacity, etc.

Furthermore, I must admit that some countries such as China, South Korea, Italy, Spain, UK and Germany received a lot of attention as shown by the number of analyses done on these countries. However, to my knowledge and at the date of publication of this article, I am not aware of any analysis of the spread of the Coronavirus specifically for Belgium.1 The present article aims at filling that gap.

Throughout my PhD thesis in statistics, my main research interest is about survival analysis applied to cancer patients (more information in the research section of my personal website). I am not an epidemiologist and I have no extensive knowledge in modelling disease outbreaks via epidemiological models.

I usually write articles only about things I consider myself familiar with, mainly statistics and its applications in R. At the time of writing this article, I was however curious where Belgium stands regarding the spread of this virus, I wanted to play with this kind of data in R (which is new to me) and see what comes out.

In order to satisfy my curiosity while not being an expert, in this article I am going to replicate analyses done by more knowledgeable people and apply them to my country, that is, Belgium. From all the analyses I have read so far, I decided to replicate the analyses done by Tim Churches and Prof. Dr. Holger K. von Jouanne-Diedrich. This article is based on a mix of their articles which can be found here and here. They both present a very informative analysis on how to model the outbreak of the Coronavirus and show how contagious it is. Their articles also allowed me to gain an understanding of the topic and in particular an understanding of the most common epidemiological model. I strongly advise interested readers to also read their more recent articles for more advanced analyses and for an even deeper understanding of the spread of the COVID-19 pandemic.

Other more complex analyses are possible and even preferable, but I leave this to experts in this field. Note also that the following analyses take into account only the data until the date of publication of this article, so the results should not be viewed, by default, as current findings.

In the remaining of the article, we first introduce the model which will be used to analyze the Coronavirus outbreak in Belgium. We also briefly discuss and show how to compute an important epidemiological measure, the reproduction number. We then use our model to analyze the outbreak of the disease in the case where there would be no public health intervention. We conclude the article by summarizing more advanced tools and techniques that could be used to further model COVID-19 in Belgium.

Analysis of Coronavirus in Belgium

A classic epidemiological model: the SIR model

Before diving into the real-life application, we first introduce the model that will be used.

There are many epidemiological models but we will use one of the simplest, the SIR model. Tim Churches' explanation of this model and how to fit it using R is so nice, I will reproduce it here with a few minor changes.

The basic idea behind the SIR model (Susceptible – Infectious – Recovered) of communicable disease outbreaks is that there are three groups (also called compartments) of people:

  • those who are healthy but susceptible to the disease: S
  • the infectious (and thus, infected) people: I
  • people who have recovered: R
SIR model. Source: Wikipedia

SIR model. Source: Wikipedia

To model the dynamics of the outbreak we need three differential equations to describe the rates of change in each group, parameterised by:

  • \(\beta\) which controls the transition between S and I
  • \(\gamma\) which controls the transition between I and R

Formally, this gives:

\[\frac{dS}{dt} = – \frac{\beta IS}{N}\]

\[\frac{dI}{dt} = \frac{\beta IS}{N} – \gamma I\]

\[\frac{dR}{dt} = \gamma I\]

Before fitting the SIR model to the data, the first step is to express these differential equations as an R function, with respect to time t.

SIR <- function(time, state, parameters) {    par <- as.list(c(state, parameters))    with(par, {      dS <- -beta * I * S / N      dI <- beta * I * S / N - gamma * I      dR <- gamma * I      list(c(dS, dI, dR))    })  }

Fitting a SIR model to the Belgium data

To fit the model to the data we need two things:

  1. a solver for these differential equations
  2. an optimiser to find the optimal values for our two unknown parameters, \(\beta\) and \(\gamma\)

The function ode() (for ordinary differential equations) from the {deSolve} R package makes solving the system of equations easy, and to find the optimal values for the parameters we wish to estimate, we can just use the optim() function built into base R.

Specifically, what we need to do is minimise the sum of the squared differences between \(I(t)\), which is the number of people in the infectious compartment \(I\) at time \(t\), and the corresponding number of cases as predicted by our model \(\hat{I}(t)\). This quantity is known as the residual sum of squares (RSS):

\[RSS(\beta, \gamma) = \sum_t \big(I(t) – \hat{I}(t) \big)^2\]

In order to fit a model to the incidence data for Belgium, we need a value N for the initial uninfected population. The population of Belgium in November 2019 was 11,515,793 people, according to Wikipedia. We will thus use N = 11515793 as the initial uninfected population.

Next, we need to create a vector with the daily cumulative incidence for Belgium, from February 4 (when our daily incidence data starts), through to March 30 (last available date at the time of publication of this article). We will then compare the predicted incidence from the SIR model fitted to these data with the actual incidence since February 4. We also need to initialise the values for N, S, I and R. Note that the daily cumulative incidence for Belgium is extracted from the {coronavirus} R package developed by Rami Krispin.

# devtools::install_github("RamiKrispin/coronavirus")  library(coronavirus)  data(coronavirus)    `%>%` <- magrittr::`%>%`    # extract the cumulative incidence  df <- coronavirus %>%    dplyr::filter(Country.Region == "Belgium") %>%    dplyr::group_by(date, type) %>%    dplyr::summarise(total = sum(cases, na.rm = TRUE)) %>%    tidyr::pivot_wider(      names_from = type,      values_from = total    ) %>%    dplyr::arrange(date) %>%    dplyr::ungroup() %>%    dplyr::mutate(active = confirmed - death - recovered) %>%    dplyr::mutate(      confirmed_cum = cumsum(confirmed),      death_cum = cumsum(death),      recovered_cum = cumsum(recovered),      active_cum = cumsum(active)    )    # put the daily cumulative incidence numbers for Belgium from  # Feb 4 to March 30 into a vector called Infected  library(lubridate)  Infected <- subset(df, date >= ymd("2020-02-04"))$confirmed_cum    # Create an incrementing Day vector the same length as our  # cases vector  Day <- 1:(length(Infected))    # now specify initial values for N, S, I and R  N <- 11515793  init <- c(    S = N - Infected[1],    I = Infected[1],    R = 0  )

Then we need to define a function to calculate the RSS, given a set of values for \(\beta\) and \(\gamma\).

# define a function to calculate the residual sum of squares  # (RSS), passing in parameters beta and gamma that are to be  # optimised for the best fit to the incidence data  RSS <- function(parameters) {    names(parameters) <- c("beta", "gamma")    out <- ode(y = init, times = Day, func = SIR, parms = parameters)    fit <- out[, 3]    sum((Infected - fit)^2)  }

Finally, we can fit the SIR model to our data by finding the values for \(\beta\) and \(\gamma\) that minimise the residual sum of squares between the observed cumulative incidence (observed in Belgium) and the predicted cumulative incidence (predicted by our model). We also need to check that our model has converged, as indicated by the message shown below:

# now find the values of beta and gamma that give the  # smallest RSS, which represents the best fit to the data.  # Start with values of 0.5 for each, and constrain them to  # the interval 0 to 1.0    # install.packages("deSolve")  library(deSolve)    Opt <- optim(c(0.5, 0.5),    RSS,    method = "L-BFGS-B",    lower = c(0, 0),    upper = c(1, 1)  )    # check for convergence  Opt$message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Convergence is confirmed. Now we can examine the fitted values for \(\beta\) and \(\gamma\):

Opt_par <- setNames(Opt$par, c("beta", "gamma"))  Opt_par
##      beta     gamma   ## 0.5986056 0.4271395

Remember that \(\beta\) controls the transition between S and I (i.e., susceptible and infectious) and \(\gamma\) controls the transition between I and R (i.e., infectious and recovered). However, those values do not mean a lot but we use them to get the fitted numbers of people in each compartment of our SIR model for the dates up to March 30 that were used to fit the model, and compare those fitted values with the observed (real) data.

sir_start_date <- "2020-02-04"    # time in days for predictions  t <- 1:as.integer(today() - ymd(sir_start_date))    # get the fitted values from our SIR model  fitted_cumulative_incidence <- data.frame(ode(    y = init, times = t,    func = SIR, parms = Opt_par  ))    # add a Date column and the observed incidence data  library(dplyr)  fitted_cumulative_incidence <- fitted_cumulative_incidence %>%    mutate(      Date = ymd(sir_start_date) + days(t - 1),      Country = "Belgium",      cumulative_incident_cases = Infected    )    # plot the data  library(ggplot2)  fitted_cumulative_incidence %>%    ggplot(aes(x = Date)) +    geom_line(aes(y = I), colour = "red") +    geom_point(aes(y = cumulative_incident_cases), colour = "blue") +    labs(      y = "Cumulative incidence",      title = "COVID-19 fitted vs observed cumulative incidence, Belgium",      subtitle = "(Red = fitted incidence from SIR model, blue = observed incidence)"    ) +    theme_minimal()

From the above graph we see that the number of observed confirmed cases follows, unfortunately, the number of confirmed cases expected by our model. The fact that both trends are overlapping indicates that the pandemic is clearly in an exponential phase in Belgium. More data would be needed to see whether this trend is confirmed in the long term.

The following graph is similar than the previous one, except that the y-axis is measured on a log scale. This kind of plot is called a semi-log plot or more precisely a log-linear plot because only the y-axis is transformed with a logarithm scale. Transforming the scale in log has the advantage that it is more easily readable in terms of difference between the observed and expected number of confirmed cases and it also shows how the number of observed confirmed cases differs from an exponential trend.

fitted_cumulative_incidence %>%    ggplot(aes(x = Date)) +    geom_line(aes(y = I), colour = "red") +    geom_point(aes(y = cumulative_incident_cases), colour = "blue") +    labs(      y = "Cumulative incidence (log scale)",      title = "COVID-19 fitted vs observed cumulative incidence, Belgium",      subtitle = "(Red = fitted incidence from SIR model, blue = observed incidence)"    ) +    theme_minimal() +    scale_y_log10()

The plot indicates that, at the beginning of the pandemic and until March 12, the number of confirmed cases stayed below what would be expected in an exponential phase. In particular, the number of confirmed cases stayed constant at 1 case from February 4 to February 29. From March 13 and until March 30, the number of confirmed cases kept increasing at a rate close to an exponential rate.

We also notice a small jump between March 12 and March 13, which may potentially indicate an error in the data collection, or a change in the testing/screening methods.

Reproduction number \(R_0\)

Our SIR model looks like a good fit to the observed cumulative incidence data in Belgium, so we can now use our fitted model to calculate the basic reproduction number \(R_0\), also referred as basic reproduction ratio.

The reproduction number gives the average number of susceptible people who are infected by each infectious person. In other words, the reproduction number refers to the number of healthy people that get infected per number of sick people. When \(R_0 > 1\) the disease starts spreading in a population, but not if \(R_0 < 1\). Usually, the larger the value of \(R_0\), the harder it is to control the epidemic.

Formally, we have:

\[R_0 = \frac{\beta}{\gamma}\]

We can compute it in R:

Opt_par
##      beta     gamma   ## 0.5986056 0.4271395
R0 <- as.numeric(Opt_par[1] / Opt_par[2])  R0
## [1] 1.401429

An \(R_0\) of 1.4 is slightly below the values calculated by others for COVID-19 and the \(R_0\) for SARS and MERS, which are similar diseases also caused by coronavirus. This is mainly due to the fact that the number of confirmed cases stayed constant and equal to 1 at the beginning of the pandemic.

A \(R_0\) of 1.4 means that, on average in Belgium, 1.4 persons are infected for each infected person.

Using our model to analyze the outbreak if there was no intervention

It is instructive to use our model fitted to the first 56 days of available data on confirmed cases in Belgium, to see what would happen if the outbreak were left to run its course, without public health intervention.

# time in days for predictions  t <- 1:120    # get the fitted values from our SIR model  fitted_cumulative_incidence <- data.frame(ode(    y = init, times = t,    func = SIR, parms = Opt_par  ))    # add a Date column and join the observed incidence data  fitted_cumulative_incidence <- fitted_cumulative_incidence %>%    mutate(      Date = ymd(sir_start_date) + days(t - 1),      Country = "Belgium",      cumulative_incident_cases = I    )    # plot the data  fitted_cumulative_incidence %>%    ggplot(aes(x = Date)) +    geom_line(aes(y = I), colour = "red") +    geom_line(aes(y = S), colour = "black") +    geom_line(aes(y = R), colour = "green") +    geom_point(aes(y = c(Infected, rep(NA, length(t) - length(Infected)))),      colour = "blue"    ) +    scale_y_continuous(labels = scales::comma) +    labs(y = "Persons", title = "COVID-19 fitted vs observed cumulative incidence, Belgium") +    scale_colour_manual(name = "", values = c(      red = "red", black = "black",      green = "green", blue = "blue"    ), labels = c(      "Susceptible",      "Recovered", "Observed incidence", "Infectious"    )) +    theme_minimal()

The same graph in log scale for the y-axis and with a legend for better readability:

# plot the data  fitted_cumulative_incidence %>% ggplot(aes(x = Date)) + geom_line(aes(y = I, colour = "red")) +       geom_line(aes(y = S, colour = "black")) + geom_line(aes(y = R, colour = "green")) +       geom_point(aes(y = c(Infected, rep(NA, length(t) - length(Infected))), colour = "blue")) +       scale_y_log10(labels = scales::comma) + labs(y = "Persons (log scale)", title = "COVID-19 fitted vs observed cumulative incidence, Belgium") +       scale_colour_manual(name = "", values = c(red = "red", black = "black", green = "green",           blue = "blue"), labels = c("Susceptible", "Observed incidence", "Recovered",           "Infectious")) + theme_minimal()

More summary statistics

Other interesting statistics can be computed from the fit of our model. For example:

  • the date of the peak of the pandemic
  • the number of severe cases
  • the number of people in need of intensive care
  • the number of deaths
fit <- fitted_cumulative_incidence    # peak of pandemic  fit[fit$I == max(fit$I), c("Date", "I")]
##          Date        I  ## 87 2020-04-30 525315.7
# severe cases  max_infected <- max(fit$I)  max_infected / 5
## [1] 105063.1
# cases with need for intensive care  max_infected * 0.06
## [1] 31518.94
# deaths with supposed 0.7% fatality rate  max_infected * 0.007
## [1] 3677.21

Given these predictions, with the exact same settings and no intervention at all to limit the spread of the pandemic, the peak in Belgium is expected to be reached by the end of April. About 525,000 people would be infected by then, which translates to about 105,000 severe cases, about 32,000 persons in need of intensive care and up to 3,700 deaths (assuming a 0.7% fatality rate, as suggested by this source).

At this point, we understand why such strict containment measures and regulations are taken in Belgium!

Note that those predictions should be taken with a lot of caution. On the one hand, as mentioned above, they are based on rather unrealistic assumptions (for example, no public health interventions, fixed reproduction number \(R_0\), etc.). More advanced projections are possible with the {projections} package, among others (see this section for more information on this matter). On the other hand, we still have to be careful and strictly follow public health interventions because previous pandemics such as H1N1 and Spanish flu have shown that incredibly high numbers are not impossible!

The purpose of this article was to give an illustration of how such analyses are done in R with a simple epidemiological model. Those are the numbers our simple model produces and we hope they are wrong.

Additional considerations

As previously mentioned, the SIR model and the analyses done above are rather simplistic and may not give a true representation of the reality. In the following sections, we highlight five improvements that could be done to enhance theses analyses and lead to a better overview of the spread of the Coronavirus in Belgium.

Ascertainment rates

In the previous analyses and graphs, it is assumed that the number of confirmed cases represent all the cases that are infectious. This is far from reality as only a proportion of all cases are screened, detected and counted in the official figures. This proportion is known as the ascertainment rate.

The ascertainment rate is likely to vary during the course of an outbreak, in particular if testing and screening efforts are increased, or if detections methods are changed. Such changing ascertainment rates can be easily incorporated into the model by using a weighting function for the incidence cases.

In his first article, Tim Churches demonstrates that a fixed ascertainment rates of 20% makes little difference to the modelled outbreak with no intervention, except that it all happens a bit more quickly.

More sophisticated models

More sophisticated models could also be used to better reflect real-life transmission processes. For instance, another classical model in disease outbreak is the SEIR model. This extended model is similar to the SIR model, where S stands for Susceptible and R stands for Recovered, but the infected people are divided into two compartments:

  1. E for the Exposed/infected but asymptomatic
  2. I for the Infected and symptomatic

These models belong to the continuous-time dynamic models that assume fixed transition rates. There are other stochastic models that allow for varying transition rates depending on attributes of individuals, social networking, etc.

Modelling the epidemic trajectory using log-linear models

As noted above, the initial exponential phase of an outbreak, when shown in a log-linear plot (the y-axis on a log scale and the x-axis without transformation), appears (somewhat) linear. This suggests that we can model epidemic growth, and decay, using a simple log-linear model of the form:

\[log(y)=rt+b\]

where y is the incidence, r is the growth rate, t is the number of days since a specific point in time (typically the start of the outbreak), and b is the intercept. In this context, two log-linear models, one to the growth phase before the peak, and one to the decay phase after the peak are fitted to the epidemic (incidence cases) curve.

The doubling and halving time estimates which you very often hear in the news can be estimated from these log-linear models. Furthermore, these log-linear models can also be used on the epidemic trajectory to estimate the reproduction number \(R_0\) in the growth and decay phases of the epidemic.

The {incidence} package in R, part of the R Epidemics Consortium (RECON) suite of packages for epidemic modelling and control, makes the fitting of this kind of models very convenient.

Estimating changes in the effective reproduction number \(R_e\)

In our model, we set a reproduction number \(R_0\) and kept it constant. It would nonetheless be useful to estimate the current effective reproduction number \(R_e\) on a day-by-day basis so as to track the effectiveness of public health interventions, and possibly predict when an incidence curve will start to decrease.

The {EpiEstim} package in R can be used to estimate \(R_e\) and allow to take into consideration human travel from other geographical regions in addition to local transmission (Cori et al. 2013; Thompson et al. 2019).

More sophisticated projections

In addition to naïve predictions based on a simple SIR model, more advanced and complex projections are also possible, notably, with the {projections} package. This packages uses data on daily incidence, the serial interval and the reproduction number to simulate plausible epidemic trajectories and project future incidence.

Conclusion

This article started with (i) a description of a couple of R resources on the Coronavirus pandemic (i.e., a collection and a dashboard) that can be used as background materials and (ii) the motivations behind this article. We then detailed the most common epidemiological model, i.e. the SIR model, before actually applying it on Belgium incidence data.

This resulted in a visual comparison of the fitted and observed cumulative incidence in Belgium. It showed that the COVID-19 pandemic is clearly in an exponential phase in Belgium in terms of number of confirmed cases.

We then explained what is the reproduction number and how to compute it in R. Finally, our model was used to analyze the outbreak of the Coronavirus if there was no public health intervention at all.

Under this (probably too) simplistic scenario, the peak of the COVID-19 in Belgium is expected to be reached by the end of April, 2020, with around 525,000 infected people and about 3,700 deaths. These very alarmist naïve predictions highlight the importance of restrictive public health actions taken by governments, and the urgence for citizens to follow these health actions in order to mitigate the spread of the virus in Belgium (or at least slow it enough to allow health care systems to cope with it).

We concluded this article by describing five improvements that could be implemented to further analyze the disease outbreak.

Thanks for reading. I hope this article gave you a good understanding of the spread of the COVID-19 Coronavirus in Belgium. Feel free to use this article as a starting point for analyzing the outbreak of this disease in your own country. See also a collection of top R resources on Coronavirus to gain even further knowledge.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Get updates every time a new article is published by subscribing to this blog.

Related articles:

References

Cori, Anne, Neil M Ferguson, Christophe Fraser, and Simon Cauchemez. 2013. "A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics." American Journal of Epidemiology 178 (9). Oxford University Press: 1505–12.

Thompson, RN, JE Stockwin, RD van Gaalen, JA Polonsky, ZN Kamvar, PA Demarsh, E Dahlqwist, et al. 2019. "Improved Inference of Time-Varying Reproduction Numbers During Infectious Disease Outbreaks." Epidemics 29. Elsevier: 100356.


  1. Feel free to let me know in the comments or by contacting me if you performed some analyses specifically for Belgium and which I could include in my article covering the top R resources on the Coronavirus.↩

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Can unbalanced randomization improve power?

Posted: 30 Mar 2020 05:00 PM PDT

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Of course, we're all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn't contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks.

I have tried in some small way to get involved with researchers who are trying to improve outcomes for patients who are showing the symptoms or test positive for COVID-19. One group that reached out to me is concerned with how patients with heart conditions will be adversely affected by the disease, and is evaluating a number of drug treatments that could improve their outcomes.

Given that we know that outcomes under usual care are not that great for heart patients, there is a desire to try to get possible treatments to as many people as possible, even in a randomized control trial. One question that came up in the design of this study was whether there would be efficiency gains by using a 1:2 randomization scheme? That is, should we randomize two patients to the experimental drug treatment for every one patient we randomize to the usual care group? In the case of a binary outcome, it appears that we will only lose efficiency if we use anything other than a 1:1 randomization.

Brief public service announcement: simstudy update

When it became clear that I needed to explore the implications of unbalanced randomization for this project, I realized that the simstudy package, which supports much of the simulations on this blog, could not readily handle anything other than 1:1 randomization. I had to quickly rectify that shortcoming. There is a new argument ratio in the trtAssign function where you can now specify any scheme for any number of treatment arms. This is available in version 1.16, which for the moment can be found only on github (kgoldfeld/simstudy).

Here is an example based on a 1:2:3 allocation. I'm not sure if that would ever be appropriate, but it shows the flexibility of the new argument. One counter-intuitive aspect of this implementation is that the balance argument is set to TRUE, indicating that the allocation to the groups will be perfect, or as close as possible to the specified ratios. If balance is FALSE, the ratios are used as relative probabilities instead.

library(simstudy)  library(parallel)    RNGkind("L'Ecuyer-CMRG")  set.seed(16)    dx <- genData(600)  dx <- trtAssign(dx, nTrt = 3, balanced = TRUE,                   ratio = c(1,2,3), grpName = "rx")    dx[, table(rx)]
## rx  ##   1   2   3   ## 100 200 300

Unbalanced designs with a binary outcome

The outcome in the COVID-19 study is a composite binary outcome (at least one of a series of bad events has to occur within 30 days to be considered a failure). Here, I am considering the effect of different randomization schemes on the power of the study. We assumed in the usual care arm 40% of the patients would have a bad outcome and that the drug treatment would reduce the bad outcomes by 30% (so that 28% of the drug treatment arm would have a bad outcome).

If we generate a single data set under these assumptions, we can fit a logistic regression model to recover these parameters.

estCoef <- function(n, formula, ratio) {        def <- defDataAdd(varname = "y", formula = formula, dist = "binary")        dx <- genData(n)    dx <- trtAssign(dx, grpName = "rx", ratio = ratio)    dx <- addColumns(def, dx)        coef(summary(glm(y~rx, family = binomial, data = dx)))  }    estCoef(n = 244*2, formula = "0.4 - 0.3 * 0.4 * rx", ratio = c(1, 1))
##               Estimate Std. Error   z value     Pr(>|z|)  ## (Intercept) -0.4328641  0.1310474 -3.303111 0.0009561867  ## rx          -0.4577476  0.1924533 -2.378487 0.0173838304

The probabilities of a bad outcome for the usual care group and drug treatment group are

c(usual = 1/(1 + exp(0.433)), drug = 1/(1+exp(0.433 + 0.458))) 
##     usual      drug   ## 0.3934102 0.2909035

Assessing power

In order to assess power, we need to generate many data sets and keep track of the p-values. The power is calculated by estimating the proportion of p-values that fall below 0.05.

Here is the analytic solution for a 1:1 ratio.

power.prop.test(p1 = .4, p2 = .7*.4, power = .80)
##   ##      Two-sample comparison of proportions power calculation   ##   ##               n = 243.4411  ##              p1 = 0.4  ##              p2 = 0.28  ##       sig.level = 0.05  ##           power = 0.8  ##     alternative = two.sided  ##   ## NOTE: n is number in *each* group

The sample size estimate based on 80% suggests we would need 244 patients per arm, or 488 total patients. If we use this estimated \(n\) in a simulation for power (using 1000 datasets), we should be close to 80%:

est.power <- function(n, ratio, p1, reduction) {        formula = paste(p1, "* (1 -", reduction, "* rx)")    p.val <- estCoef(n, formula, ratio)["rx", 4]    return(p.val)      }    pvals <- unlist(mclapply(1:1000,       function(x) est.power(488, c(1, 1), 0.4, 0.3)))    mean(pvals < 0.05)
## [1] 0.814

The power experiment

Now we are ready to evaluate the question that motivated all of this. If we start to change the ratio from 1:1 to 1:2, to 1:3, etc., what happens to the power? And does this pattern change based on the assumptions about failure rates in the usual care arm and the expected reductions in the treatment arm? Here is the code that will allow us to explore these questions:

res <- list()    for (p1 in c(.2, .3, .4, .5)) {    for (r in c(.2, .25, .3)) {            p2 <- (1- r) * p1      n <- ceiling(power.prop.test(p1 = p1, p2 = p2, power = .80)$n)*2            for (i in c(1:5)) {                pvals <- mclapply(1:1000, function(x) est.power(n, c(1, i), p1, r))        pvals <- unlist(pvals)                dres <- data.table(n, control_p = p1, pct_reduction = r,                            control = 1, rx = i, power = mean( pvals < .05))                res <- append(res, list(dres))              }    }  }    res <- rbindlist(res)

Repeating the power simulation for a variety of assumptions indicates that, at least in the case of a binary outcome, using an unbalanced design does not improve the quality of the research even though it might get more patients the drug treatment:

ggplot(data = res, aes(x = rx, y = power)) +    geom_line(color = "blue") +    facet_grid(control_p ~ pct_reduction, labeller = label_both) +    theme(panel.grid = element_blank()) +    scale_x_continuous(name = "ratio of treatment to control",                        breaks = c(1:5), labels = paste0(c(1:5),":1")) +    scale_y_continuous(limits = c(.5,.9), breaks = c(.6, .7, .8))

Continuous outcomes

In the case of binary outcomes, reducing sample size in the control group reduces our ability to efficiently estimate the proportion of events, even though we may improve estimation for the treatment group by adding patients. In the case of a continuous outcome, we may be able to benefit from a shift of patients from one group to another if the variability of responses differs across groups. In particular, arms with more variability could benefit from a larger sample. Next time, I'll show some simulations that indicate this might be the case.

Stay well.

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Predicting 1,000,000 of COVID-19 done with R before March 18 – call for help

Posted: 30 Mar 2020 10:48 AM PDT

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Due to the unprecedented technological progress that has been made in years past, the current climate allows us to monitor this pandemic better than any other pandemic in the past. We will argue, however, that R was instrumental in predicting when the 1,000,000th case of COVID-19 will occur, as demonstrated here in our collaboration spread out on three continents:

https://www.sciencedirect.com/science/article/pii/S2590113320300079

Since India is currently in lockdown and the correction process is in India, it has not been concluded as of writing.

The first draft of our paper was prepared on March 18 and can be accessed here: http://www.cs.laurentian.ca/wkoczkodaj/p/?C=M;O=D
Open this link and click twice on "last modified" to see the data (the computing was done a few days earlier). 
Our heuristic developed for the prediction could not be implemented so quickly had it not been for our use of R. The function 'nls' is crucial for modelling only the front incline part of the Gaussian function (also known as Gaussian). Should this pandemic not stop, or at the very least slow down, one billion cases could occur by the end of May 2020.

The entire world waits for the inflection point (https://en.wikipedia.org/wiki/Inflection_point) and if you help us, we may be able to reach this point sooner.

A few crucial R commands are: 
modE <- nls(dfcov$all ~ a * exp(bdfcov$report), data = dfcov,
start = list(a = 100, b = 0.1))
a <- summary(modE)$parameters[1]
b <- summary(modE)$parameters[2]
summary(modE)
x <- 1:m + dfcov$report[length(dfcov$report)]
modEPr <- a * exp(b
x)

Waldemar W. Koczkodaj (email: wkoczkodaj [AT] cs.laurentian.ca)
(for the entire Collaboration)

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Analyzing Remote Sensing Data using Image Segmentation

Posted: 30 Mar 2020 10:45 AM PDT

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post summarizes material posted as an Additional Topic to accompany the book Spatial Data Analysis in Ecology and Agriculture using R, Second Edition. The full post, together with R code and data, can be found in the Additional Topics section of the book's website, http://psfaculty.plantsciences.ucdavis.edu/plant/sda2.htm

1. Introduction
The idea is best described with images. Here is a display in mapview of an agricultural region in California's Central Valley. The boundary of the region is a rectangle developed from UTM coordinates, and it is interesting that the boundary of the region is slightly tilted with respect to the network of roads and field boundaries. There are several possible explanations for this, but it does not affect our presentation, so we will ignore it.
There are clearly several different cover types in the region, including fallow fields, planted fields, natural regions, and open water. Our objective is to develop a land classification of the region based on Landsat data. A common method of doing this is based on the so-called false color image. In a false color image, radiation in the green wavelengths is displayed as blue, radiation in the red wavelengths is displayed as green, and radiation in the infrared wavelengths is displayed as red. Here is a false color image of the region shown in the black rectangle.  

Because living vegetation strongly reflects radiation in the infrared region and absorbs it in the red region, the amount of vegetation contained in a pixel can be estimated by using the normalized difference vegetation index, or NDVI, defined as

where IR is the pixel intensity of the infrared band and R is the pixel intensity of the red band. Here is the NDVI of the region in the false color image above.
If you compare the images you can see that the areas in  black rectangle above that appear green have a high NDVI, the areas that appear brown have a low NDVI, and the open water in the southeast corner of the image actually has a negative NDVI, because water strongly absorbs infrared radiation. Our objective is to generate a classification of the land surface into one of five classes: dense vegetation, sparse vegetation, scrub, open land, and open water. Here is the classification generated using the method I am going to describe.

There are a number of image analysis packages available in R. I elected to use the OpenImageR and SuperpixelImageSegmentation packages, both of which were developed by L. Mouselimis (2018, 2019a). One reason for this choice is that the objects created by these packages have a very simple and transparent data structure that makes them ideal for data analysis and, in particular, for integration with the raster package.  The packages are both based on the idea of image segmentation using superpixels . According to Stutz et al. (2018), superpixels were introduced by Ren and Malik (2003). Stutz et al. define a superpixel as "a group of pixels that are similar to each other in color and other low level properties." Here is the false color image above divided into superpixels.

The superpixels of can then be represented by groups of pixels all having the same pixel value, as shown here. This is what we will call image segmentation.

In the next section we will introduce the concepts of superpixel image segmentation and the application of these packages. In Section 3 we will discuss the use of these concepts in data analysis. The intent of image segmentation as implemented in these and other software packages is to provide support for human visualization. This can cause problems because sometimes a feature that improves the process of visualization detracts from the purpose of data analysis. In Section 4 we will see how the simple structure of the objects created by Mouselimis's two packages can be used to advantage to make data analysis more powerful and flexible. This post is a shortened version of the Additional Topic that I described at the start of the post.

2. Introducing superpixel image segmentation
The discussion in this section is adapted from the R vignette Image Segmentation Based on Superpixels and Clustering (Mouselimis, 2019b). This vignette describes the R packages OpenImageR and SuperpixelImageSegmentation. In this post I will only discuss the OpenImageR package. It uses simple linear iterative clustering (SLIC, Achanta et al., 2010), and a modified version of this algorithm called SLICO (Yassine et al., 2018), and again for brevity I will only discuss the former. To understand how the SLIC algorithm works, and how it relates to our data analysis objective, we need a bit of review about the concept of color space. Humans perceive color as a mixture of the primary colors red, green, and blue (RGB), and thus a "model" of any given color can in principle be described as a vector in a three-dimensional vector space. The simplest such color space is one in which each component of the vector represents the intensity of one of the three primary colors. Landsat data used in land classification conforms to this model. Each band intensity is represented as an integer value between 0 and 255. The reason is that this covers the complete range of values that can be represented by a sequence of eight binary (0,1) values because 28 – 1 = 255.  For this reason it is called the eight-bit RGB color model.
There are other color spaces beyond RGB, each representing a transformation for some particular purpose of the basic RGB color space. A common such color space is the CIELAB color space, defined by the International Commission of Illumination (CIE). You can read about this color space here. In brief, quoting from this Wikipedia article, the CIELAB color space "expresses color as three values: L* for the lightness from black (0) to white (100), a* from green (−) to red (+), and b* from blue (−) to yellow (+). CIELAB was designed so that the same amount of numerical change in these values corresponds to roughly the same amount of visually perceived change."
Each pixel in an image contains two types of information: its color, represented by a vector in a three dimensional space such as the RGB or CIELAB color space, and its location, represented by a vector in two dimensional space (its x and y coordinates). Thus the totality of information in a pixel is a vector in a five dimensional space. The SLIC algorithm performs a clustering operation similar to K-means clustering on the collection of pixels as represented in this five-dimensional space.
The SLIC algorithm measures total distance between pixels as the sum of two components, dlab, the distance in the CIELAB color space, and dxy, the distance in pixel (x,y) coordinates. Specifically, the distance Ds is computed as

where S is the grid interval of the pixels and m is a compactness parameter that allows one to emphasize or de-emphasize the spatial compactness of the superpixels. The algorithm begins with a set of K cluster centers regularly arranged in the (x,y) grid and performs K-means clustering of these centers until a predefined convergence measure is attained.
The following discussion assumes some basic familiarity with the raster package. If you are not familiar with this package, an excellent introduction is given here. After downloading the data files, you can use them to first construct RasterLayer objects and then combine them into a RasterBrick.

library(raster)
> b2 <- raster("region_blue.grd")
> b3 <- raster("region_green.grd")
> b4 <- raster("region_red.grd")
> b5 <- raster("region_IR.grd")
> region.brick <- brick(b2, b3, b4, b5)

We will need to save the number of rows and columns for present and future use.
> print(nrows <- region.brick@nrows)
[1] 187
> print(ncols <- region.brick@ncols)

[1] 329
Next we plot the region.brick object (shown above) and write it to disk.
> # False color image
> plotRGB(region.brick, r = 4, g = 3, b = 2,
+   stretch = "lin")

> # Write the image to disk
> jpeg("FalseColor.jpg", width = ncols,
+   height = nrows)

> plotRGB(region.brick, r = 4, g = 3, b = 2,
+   stretch = "lin")

> dev.off()

We will now use SLIC for image segmentation. The code here is adapted directly from the superpixels vignette.

> library(OpenImageR)
> False.Color <- readImage("FalseColor.jpg")
> Region.slic = superpixels(input_image =
+   False.Color, method = "slic", superpixel = 80,
+   compactness = 30, return_slic_data = TRUE,
+   return_labels = TRUE, write_slic = "",
+   verbose = FALSE)
Warning message:
The input data has values between 0.000000 and 1.000000. The image-data will be multiplied by the value: 255!
> imageShow(Region.slic$slic_data)


 
We will discuss the warning message in Section 3. The function imageShow() is part of the OpenImageR package.
The OpenImageR function superpixels() creates superpixels, but it does not actually create an image segmentation in the sense that we are using the term is defined in Section 1. This is done by functions in the SuperPixelImageSegmentation package (Mouselimis, 2018), whose discussion is also contained in the superpixels vignette. This package is also described in the Additional Topic. As pointed out there, however, some aspects that make the function well suited for image visualization detract from its usefulness for data analysis, so we won't discuss it further in this post. Instead, we will move directly to data analysis using the OpenImageR package.

3. Data analysis using superpixels
To use the functions in the OpenImageR package for data analysis, we must first understand the structure of objects created by these packages. The simple, open structure of the objects generated by the function superpixels() makes it an easy matter to use these objects for our purposes. Before we begin, however, we must discuss a preliminary matter: the difference between a Raster* object and a raster object. In Section 2 we used the function raster() to generate the blue, green, red and IR band objects b2, b3, b4, and b5, and then we used the function brick() to generate the object region.brick. Let's look at the classes of these objects.

> class(b3)
[1] "RasterLayer"
attr(,"package")
[1] "raster"
> class(full.brick)

[1] "RasterBrick"
attr(,"package")
[1] "raster"

Objects created by the raster packages are called Raster* objects, with a capital "R". This may seem like a trivial detail, but it is important to keep in mind because the objects generated by the functions in the OpenImageR and SuperpixelImageSegmentation packages are raster objects that are not Raster* objects (note the deliberate lack of a courier font for the word "raster").
In Section 2 we used the OpenImageR function readImage() to read the image data into the object False.Color for analysis. Let's take a look at the structure of this object.

> str(False.Color)
num [1:187, 1:329, 1:3] 0.373 0.416 0.341 0.204 0.165 …

It is a three-dimensional array, which can be thought of as a box, analogous to a rectangular Rubik's cube. The "height" and "width" of the box are the number of rows and columns respectively in the image and at each value of the height and width the "depth" is a vector whose three components are the red, green, and blue color values of that cell in the image, scaled to [0,1] by dividing the 8-bit RGB values, which range from 0 to 255, by 255. This is the source of the warning message that accompanied the output of the function superpixels(), which said that the values will be multiplied by 255. For our purposes, however, it means that we can easily analyze images consisting of transformations of these values, such as the NDVI. Since the NDVI is a ratio, it doesn't matter whether the RGB values are normalized or not. For our case the RasterLayer object b5 of the RasterBrick False.Color is the IR band and the object b4 is the R band. Therefore we can compute the NDVI as

> NDVI.region <- (b5 – b4) / (b5 + b4)

Since NDVI is a ratio scale quantity, the theoretically best practice is to plot it using a monochromatic representation in which the brightness of the color (i.e., the color value) represents the value (Tufte, 1983). We can accomplish this using the RColorBrewer function brewer.pal().

> library(RColorBrewer)
> plot(NDVI.region, col = brewer.pal(9, "Greens"),
+   axes = TRUE,
main = "Region NDVI")
 

This generates the NDVI map shown above. The darkest areas have the highest NDVI. Let's take a look at the structure of the RasterLayer object NDVI.region.

> str(NDVI.region)
Formal class 'RasterLayer' [package "raster"] with 12 slots
              *                 *                   *
..@ data    :Formal class '.SingleLayerData' [package "raster"] with 13 slots
              *                 *                   *
.. .. ..@ values    : num [1:61523] 0.1214 0.1138 0.1043 0.0973 0.0883 …
              *                 *                   *
..@ ncols   : int 329
..@ nrows   : int 187

Only the relevant parts are shown, but we see that we can convert the raster data to a matrix that can be imported into our image segmentation machinery as follows. Remember that by default R constructs matrices by columns. The data in Raster* objects such as NDVI.region are stored by rows, so in converting these data to a matrix we must specify byrow = TRUE.

> NDVI.mat <- matrix(NDVI.region@data@values,
+   nrow = NDVI.region@nrows,
+  
ncol = NDVI.region@ncols, byrow = TRUE)

The function imageShow() works with data that are either in the eight bit 0 – 255 range or in the [0,1] range (i.e., the range of x between and including 0 and 1). It does not, however, work with NDVI values if these values are negative. Therefore, we will scale NDVI values to [0,1].

> m0 <- min(NDVI.mat)
> m1 <- max(NDVI.mat)
> NDVI.mat1 <- (NDVI.mat – m0) / (m1 – m0)
> imageShow(NDVI.mat)

 Here is the resulting image.

The function imageShow() deals with a single layer object by producing a grayscale image. Unlike the green NDVI plot above, in this plot the lower NDVI regions are darker.
We are now ready to carry out the segmentation of the NDVI data. Since the structure of the NDVI image to be analyzed is the same as that of the false color image, we can simply create a copy of this image, fill it with the NDVI data, and run it through the superpixel image segmentation function. If an image has not been created on disk, it is also possible (see the Additional Topic) to create the input object directly from the data.

> NDVI.data <- False.Color
> NDVI.data[,,1] <- NDVI.mat1
> NDVI.data[,,2] <- NDVI.mat1
> NDVI.data[,,3] <- NDVI.mat1

In the next section we will describe how to create an image segmentation of the NDVI data and how to use cluster analysis to create a land classification.

4. Expanding the data analysis capacity of superpixels
Here is an application of the function superpixels() to the NDVI data generated in Section 3.

> NDVI.80 = superpixels(input_image = NDVI.data,
+   method = "slic", superpixel = 80,
+   compactness = 30, return_slic_data = TRUE,
+   return_labels = TRUE, write_slic = "",
+   verbose = FALSE)
> imageShow(Region.slic$slic_data)

Here is the result.
Here the structure of the object NDVI.80 .

> str(NDVI.80)
List of 2
$ slic_data: num [1:187, 1:329, 1:3] 95 106 87 52 42 50 63 79 71 57 …
$ labels   : num [1:187, 1:329] 0 0 0 0 0 0 0 0 0 0 …

It is a list with two elements, NDVI.80$slic_data, a three dimensional array of pixel color data (not normalized), and NDVI.80$labels, a matrix whose elements correspond to the pixels of the image. The second element's name hints that it may contain values identifying the superpixel to which each pixel belongs. Let's see if this is true.

> sort(unique(as.vector(NDVI.80$labels)))
[1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[25] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[49] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

There are 72 unique labels. Although the call to superpixels() specified 80 superpixels, the function generated 72. We can see which pixels have the label value 0 by setting the values of all of the other pixels to [255, 255, 255], which will plot as white.

> R0 <- NDVI.80
for (i in 1:nrow(R0$label))
+    for (j in 1:ncol(R0$label))
+       if (R0$label[i,j] != 0)
+          R0$slic_data[i,j,] <- c(255,255,255)
> imageShow(R0$slic_data)


This produces the plot shown here.
The operation isolates the superpixel in the upper right corner of the image, together with the corresponding portion of the boundary. We can easily use this approach to figure out which value of NDVI.80$label corresponds to which superpixel. Now let's deal with the boundary. A little exploration of the NDVI.80 object suggests that pixels on the boundary have all three components equal to zero. Let's isolate and plot all such pixels by coloring all other pixels white.

> Bdry <- NDVI.80
for (i in 1:nrow(Bdry$label))
+    for (j in 1:ncol(Bdry$label))
+     if (!(Bdry$slic_data[i,j,1] == 0 &
+     Bdry$slic_data[i,j,2] == 0 &
+       Bdry$slic_data[i,j,3] == 0))

+       Bdry$slic_data[i,j,] <- c(255,255,255)
> Bdry.norm <- NormalizeObject(Bdry$slic_data)
> imageShow(Bdry$slic_data)

This shows that we have indeed identified the boundary pixels. Note that the function imageShow() displays these pixels as white with a black edge, rather than pure black.

Having done a preliminary analysis, we can organize our segmentation process into two steps. The first step will be to replace each of the superpixels generated by the OpenImageR function superpixels() with one in which each pixel has the same value, corresponding to a measure of central tendency (e.g, the mean, median , or mode) of the original superpixel. The second step will be to use the K-means unsupervised clustering procedure to organize the superpixels from step 1 into a set of clusters and give each cluster a value corresponding to a measure of central tendency of the cluster. I used the code developed to identify the labels and to generate the  boundary to construct a function make.segments() to carry out the segmentation. The first argument of make.segments() is the superpixels object, and the second is the functional measurement of central tendency. Although in this case each of the three colors of the object NDVI.80 have the same values, this may not be true for every application, so the function analyzes each color separately. Because the function is rather long, I have not included it in this post. If you want to see it, you can go to the Additional Topic linked to the book's website.

Here is the application of the function to the object NDVI.80 with the second argument set to mean.

> NDVI.means <- make.segments(NDVI.80, mean)
> imageShow(NDVI.means$slic_data)

Here is the result.

The next step is to develop clusters that represent identifiable land cover types. In a real project the procedure would be to collect a set of ground truth data from the site, but that option is not available to us. Instead we will work with the true color rendition of the Landsat scene, shown here.
The land cover is subdivided using K-means into five types: dense crop, medium crop, scrub, open, and water.

> set.seed(123)
> NDVI.clus <-
+ kmeans(as.vector(NDVI.means$slic_data[,,1]), 5)

> vege.class <- matrix(NDVI.clus$cluster,
+    nrow = NDVI.region@nrows,

+    ncol = NDVI.region@ncols, byrow = FALSE)
> class.ras <- raster(vege.class, xmn = W,
+    xmx = E, ymn = S, ymx = N, crs =
+    CRS("+proj=utm +zone=10 +ellps=WGS84"))

Next I used the raster function ratify() to assign descriptive factor levels to the clusters.

> class.ras <- ratify(class.ras)
> rat.class <- levels(class.ras)[[1]]
> rat.class$landcover <- c("Water", "Open",
+  "Scrub", "Med. Crop", "Dense Crop")

> levels(class.ras) <- rat.class
> levelplot(class.ras, margin=FALSE,
+   col.regions= c("blue", "tan",

+   "lightgreen", "green", "darkgreen"),
+    main = "Land Cover Types")

The result is shown in the figure at the start of the post. We can also overlay the original boundaries on top of the image. This is more easily done using plot() rather than levelplot(). The function plot() allows plots to be built up in a series of statements. The function levelplot() does not.

> NDVI.rasmns <- raster(NDVI.means$slic_data[,,1],
+   xmn = W,
xmx = E, ymn = S, ymx = N,
+   crs = CRS("+proj=utm +zone=10 +ellps=WGS84"))

> NDVI.polymns <- rasterToPolygons(NDVI.rasmns,
+   dissolve = TRUE)

> plot(class.ras, col = c("blue", "tan",
+   "lightgreen",
"green", "darkgreen"),
+   main = "Land Cover Types",
legend = FALSE)
> legend("bottom", legend = c("Water", "Open",
+   "Scrub", "Med. Crop",
"Dense Crop"),
+   fill = c("blue", "tan", "lightgreen", "green",

+   "darkgreen"))
> plot(NDVI.polymns, add = TRUE)

The application of image segmentation algorithms to remotely sensed image classification is a rapidly growing field, with numerous studies appearing every year. At this point, however, there is little in the way of theory on which to base an organization of the topic. If you are interested in following up on the subject, I encourage you to explore it on the Internet.

References
Achanta, R., A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk (2010). SLIC Superpixels. Ecole Polytechnique Fedrale de Lausanne Technical Report 149300.

Appelhans, T., F. Detsch, C. Reudenbach and S. Woellauer (2017).  mapview: Interactive Viewing of Spatial Data in R. R package version 2.2.0.  https://CRAN.R-project.org/package=mapview

Bivand, R., E. Pebesma, and V. Gómez-Rubio. (2008). Applied Spatial Data Analysis with R. Springer, New York, NY.

Frey, B.J. and D. Dueck (2006). Mixture modeling by affinity propagation. Advances in Neural Information Processing Systems 18:379.

Hijmans, R. J. (2016). raster: Geographic Data Analysis and Modeling. R package version 2.5-8. https://CRAN.R-project.org/package=raster

Mouselimis, L. (2018). SuperpixelImageSegmentation: Superpixel Image Segmentation. R package version 1.0.0. https://CRAN.R-project.org/package=SuperpixelImageSegmentation

Mouselimis, L. (2019a). OpenImageR: An Image Processing Toolkit. R package version 1.1.5. https://CRAN.R-project.org/package=OpenImageR

Mouselimis, L. (2019b) Image Segmentation Based on Superpixel Images and Clustering. https://cran.r-project.org/web/packages/OpenImageR/vignettes/Image_segmentation_superpixels_clustering.html.

Ren, X., and J. Malik (2003) Learning a classification model for segmentation. International Conference on Computer Vision, 2003, 10-17.

Stutz, D., A. Hermans, and B. Leibe (2018). Superpixels: An evaluation of the state-of-the-art. Computer Vision and Image Understanding 166:1-27.

Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, Conn.

Yassine, B., P. Taylor, and A. Story (2018).  Fully automated lung segmentation from chest radiographs using SLICO superpixels. Analog Integrated Circuits and Signal Processing 95:423-428.

Zhou, B. (2013) Image segmentation using SLIC superpixels and affinity propagation clustering. International Journal of Science and Research. https://pdfs.semanticscholar.org/6533/654973054b742e725fd433265700c07b48a2.pdf

 

<

p style="text-align: center"> 

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

R Tip: How To Look Up Matrix Values Quickly

Posted: 30 Mar 2020 09:41 AM PDT

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is a powerful data science language because, like Matlab, numpy, and Pandas, it exposes vectorized operations. That is, a user can perform operations on hundreds (or even billions) of cells by merely specifying the operation on the column or vector of values.

Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.

In R we can specify operations over vectors. For arithmetic this is easy, but some more complex operations you "need to know the trick."

Patrick Freeman (@PTFreeman) recently asked: what is the idiomatic way to look up a bunch of values from a matrix by row and column keys? This is actually easy to do if we first expand the data matrix into RDF-triples. If our data were in this format we could merge/join it against our desired column indices.

Let's start with an example data matrix.

# example matrix data  m <- matrix(1:9, nrow = 3)  row.names(m) <- c('R1' ,'R2', 'R3')  colnames(m) <- c('C1', 'C2', 'C3')  knitr::kable(m)
C1 C2 C3
R1 1 4 7
R2 2 5 8
R3 3 6 9

And our data-frame containing the indices we want to look-up.

  # row/columns we want  w <- data.frame(    i = c('R1', 'R2', 'R2'),    j = c('C2', 'C3', 'C2'))  knitr::kable(w)
i j
R1 C2
R2 C3
R2 C2

That is: we want to know the matrix values from [R1, C2], [R2, C3], and [R2, C2].

The trick is: how do we convert the matrix into triples? digEmAll, has a great solution to that here.

  # unpack into 3-column format from:  # https://stackoverflow.com/a/9913601  triples <- data.frame(    i = rep(row.names(m), ncol(m)),    j = rep(colnames(m), each = nrow(m)),    v = as.vector(m))  knitr::kable(triples)
i j v
R1 C1 1
R2 C1 2
R3 C1 3
R1 C2 4
R2 C2 5
R3 C2 6
R1 C3 7
R2 C3 8
R3 C3 9

What the above code has done is: write each entry of the original matrix as a separate row with the original row and column ids landed as new columns. This data format is very useful.

The above code is worth saving as a re-usable snippet, as getting it right is a clever step.

Now we can vectorize our lookup using the merge command, which produces a new joined table where the desired values have been landed as a new column.

  res <- merge(w, triples, by = c('i', 'j'), sort = FALSE)  knitr::kable(res)
i j v
R1 C2 4
R2 C3 8
R2 C2 5

And that is it: we have used vectorized and relational concepts to look up many values from a matrix very quickly.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Comments