[R-bloggers] foreach 1.5.0 now available on CRAN (and 11 more aRticles) |
- foreach 1.5.0 now available on CRAN
- Blogging A to Z: The A to Z of tidyverse
- Visualizing decision tree partition and decision boundaries
- Updates to R GUIs: BlueSky, jamovi, JASP, & RKWard
- Contagiousness of COVID-19 Part I: Improvements of Mathematical Fitting (Guest Post)
- What is a dgCMatrix object made of? (sparse matrix format in R)
- Close Encounters of the R Kind
- COVID-19 in Belgium
- Can unbalanced randomization improve power?
- Predicting 1,000,000 of COVID-19 done with R before March 18 – call for help
- Analyzing Remote Sensing Data using Image Segmentation
- R Tip: How To Look Up Matrix Values Quickly
foreach 1.5.0 now available on CRAN Posted: 31 Mar 2020 09:32 AM PDT [This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This post is to announce that version 1.5.0 of the foreach package is now on CRAN. Foreach is an idiom that allows for iterating over elements in a collection, without the use of an explicit loop counter. The foreach package is now more than 10 years old, and is used by nearly 700 packages across CRAN and Bioconductor. (If you're interested in an overview of the foreach package and its history, this RStudio::conf talk by Bryan Lewis is worth watching.) The main change is 1.5.0 is intended to help reduce errors associated with modifying global variables. Now, Here, the assignment inside the Because of this, it's almost always a mistake to modify global variables from Note that the behaviour of the If you have any questions or comments, please email me or open an issue at the GitHub repo. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Blogging A to Z: The A to Z of tidyverse Posted: 31 Mar 2020 08:06 AM PDT [This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Announcing my theme for this year's blogging A to Z! The tidyverse is a set of R packages for data science. The big thing about the tidyverse is making sure your data are tidy. What does that mean?
When I first learned about the tidy approach, I thought, "Why is this special? Isn't that what we should be doing?" But thinking about keeping your data tidy has really changed the way I approach my job, and has helped me solve some tricky data wrangling issues. When you really embrace this approach, merging data, creating new variables, and summarizing cases becomes much easier. And the syntax used is the tidyverse is much more intuitive than much of the code in R, making it easier to memorize many of the functions; they follow a predictable grammar, so you don't need to constantly look things up. See you tomorrow for the first post – A is for arrange! To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Visualizing decision tree partition and decision boundaries Posted: 31 Mar 2020 06:00 AM PDT [This article was first published on r – paulvanderlaken.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Grant McDermott develop this new R package I had thought of:
Using the familiar ggplot2 syntax, we can simply add decision tree boundaries to a plot of our data. In this example from his Github page, Grant trains a decision tree on the famous Titanic data using the Super awesome! This visualization precisely shows where the trained decision tree thinks it should predict that the passengers of the Titanic would have survived (blue regions) or not (red), based on their This will be super helpful if you need to explain to yourself, your team, or your stakeholders how you model works. Currently, only To leave a comment for the author, please follow the link and comment on their blog: r – paulvanderlaken.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Updates to R GUIs: BlueSky, jamovi, JASP, & RKWard Posted: 31 Mar 2020 03:03 AM PDT [This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Graphical User Interfaces (GUIs) for the R language help beginners get started learning R, help non-programmers get their work done, and help teams of programmers and non-programmers work together by turning code into menus and dialog boxes. There has been quite a lot of progress on R GUIs since my last post on this topic. Below I describe some of the features added to several R GUIs. BlueSky StatisticsBlueSky Statistics has added mixed-effects linear models. Its dialog shows an improved model builder that will be rolled out to the other modeling dialogs in future releases. Other new statistical methods include quantile regression, survival analysis using both Kaplan-Meier and Cox Proportional Hazards models, Bland-Altman plots, Cohen's Kappa, Intraclass Correlation, odds ratios and relative risk for M by 2 tables, and sixteen diagnostic measures such as sensitivity, specificity, PPV, NPV, Youden's Index, and the like. The ability to create complex tables of statistics was added via the powerful arsenal package. Some examples of the types of tables you can create with it are shown here. Several new dialogs have been added to the Data menu. The Compute Dummy Variables dialog creates dummy (aka indicator) variables from factors for use in modeling. That approach offers greater control over how the dummies are created than you would have when including factors directly in models. A new Factor Levels menu item leads to many of the functions from the forcats package. They allow you to reorder factor levels by count, by occurrence in the dataset, by functions of another variable, allow you to lump low-frequency levels into a single "Other" category, and so on. These are all helpful in setting the order and nature of, for example, bars in a plot or entries in a table. The BlueSky Data Grid now has icons that show the type of variable i.e. factor, ordered factor, string, numeric, date or logical. The Output Viewer adds icons to let you add notes to the output (not full R Markdown yet), and a trash can icon lets you delete blocks of output. A comprehensive list of the changes to this release is located here and my updated review of it is here. jamoviNew modules expand jamovi's capabilities to include time-based survival analysis, Bland-Altman analysis & plots, behavioral change analysis, advanced mediation analysis, differential item analysis, and quantiles & probabilities from various continuous distributions. jamovi's new Flexplot module greatly expands the types of graphs it can create, letting you take a single graph type and repeat it in rows and/or columns making it easy to visualize how the data is changing across groups (called facet, panel, or lattice plots). You can read more about Flexplot here, and my recently-updated review of jamovi is here. JASPThe JASP package has added two major modules, machine learning, and network analysis. The machine learning module includes boosting, K-nearest neighbors, and random forests for both regression and classification problems. For regression, it also adds regularized linear regression. For clustering, it covers hierarchical, K-means, random forest, density-based, and fuzzy C-means methods. It can generate models and add predictions to your dataset, but it still cannot save models for future use. The main method it is missing is a single decision tree model. While less accurate predictors, a simple tree model can often provide insight that is lacking from other methods. Another major addition to JASP is Network Analysis. It helps you to study the strengths of interactions among people, cell phones, etc. With so many people working from home during the Coronavirus pandemic, it would be interesting to see what this would reveal about how our patterns of working together have changed. A really useful feature in JASP is its Data Library. It greatly speeds your ability to try out a new feature by offering a completely worked-out example including data. When trying out the network analysis feature, all I had to do was open the prepared example to see what type of data it would use. With most other data science software, you're left to dig about in a collection of datasets looking for a good one to test a particular analysis. Nicely done! I've updated my full review of JASP, which you can read here. RKWardThe main improvement to the RKWard GUI for R is adding support for R Markdown. That makes it the second GUI to support R Markdown after R Commander. Both the jamovi and BlueSky teams are headed that way. RKWard's new live preview feature lets you see text, graphics, and markdown as you work. A comprehensive list of new features is available here, and my full review of it is here. ConclusionR GUIs are gaining features at a rapid pace, quickly closing in on the capabilities of commercial data science packages such as SAS, SPSS, and Stata. I encourage R GUI users to contribute their own additions to the menus and dialog boxes of their favorite(s). The development teams are always happy to help with such contributions. To follow the progress of these and other R GUIs, subscribe to my blog, or follow me on twitter. To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Contagiousness of COVID-19 Part I: Improvements of Mathematical Fitting (Guest Post) Posted: 31 Mar 2020 12:00 AM PDT [This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The topic of this post will be the fitting with the R-package This contribution will be split into two parts. In the first part, we pick up on the earlier fitting described in a previous blog-post here (see Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?). These fits are sometimes difficult to perform. How can we analyse that difficult behaviour and how can we make further improvements? In the second part, we will see that all these efforts to make a nice performing algorithm to perform the fitting is actually not much useful for the current case. Just because we use a mathematical model, which sounds rigorous, does not mean that our conclusions/predictions are trustworthy. These two parts will be accompanied by the R-script covid.r. COVID-19: a data epidemicWith the outbreak of COVID-19 one thing that is certain is that never before a virus has gone so much viral on the internet. Especially, a lot of data about the spread of the virus is going around. A large amount of data is available in the form of fancy coronavirus-trackers that look like weather forecasts or overviews of sports results. Many people have started to try predicting the evolution of the epidemiological curve and along with that the reproduction number , but can this be done with this type of data? In this blog-post, we describe the fitting of the data with the SIR model and explain the tricky parts of the fitting methodology and how we can mitigate some of the problems that we encounter. The general problem is that the fitting-algorithm is not always finding it's way to the best solution. Below is a graph that shows an out of the box fit of the data with the Why doesn't the algorithm converge to the optimal solution?There are two main reasons why the model is not converging well. Early stopping of the algorithmThe first reason is that the How does the In the two images below we see how the algorithm solves stepwise the fit, for a SIR model that uses the parameters and (these parameters had been explained in the previous blog post and are repeated in this post below). The images are contour plot (lines) and surface plot (colours) for the value of the RSS as a function of the model parameters. The minimum is around and where eventually the algorithm should end. We see in these images effects that make it difficult for the algorithm to approach the optimum quickly in few steps, or it may even get blocked before that point (also it may end up in a local optimum, which is a bit different case, although we have it here as well and there's a local optimum with a value for ). Computation of the gradient If the function that we use for the optimization does not provide an expression for the gradient of the function (which is needed to find the direction of movement) then the How much nearby do these points need to be? The In the image below we see this by the path taken by the algorithm is shown by the red and black arrows. The red arrows show the path when we do not fine-tune the optimization, the black path shows the path when we reduce the scale of the parameters manually. This is done with the control parameter. In the code of the file covid.R you see this in the function: OptTemp <- optim(new, RSS2, method = "L-BFGS-B", lower = c(0,1.00001), hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1)) By using So by changing the parameter A zigzag path towards the optimum may occur when the surface plot of the RSS has a sort of long stretched valley shape. Then the algorithm is computing a path that moves towards the optimum like a sort of snowboarder on a half-pipe, taking lots of movements along the axis in the direction of the curvature of the half-pipe, and much less movement along the axis downhill towards the bottom. In the case above we had let the algorithm start at and and this was chosen on purpose for the illustration. But we do not always make such a good initial guess. In the image below we see what happens when we had chosen and as starting condition (note that image should be stretched out along the y-axis due to the different ranges of and in which case the change of the RSS is much faster/stronger in the direction left-right than the direction up-down). The red curve, which shows the result of the algorithm without the fine-tuning, stops already after one step around where it hits the bottom of the curvature of the valley/half-pipe and is not accurately finding out that there is still a long path/gradient in the other direction. We can improve the situation by changing the We can often improve this situation by changing the relative scale of the parameters, however, in this particular case, it is not easy, because of the L-shape of the 'valley' (see the above image). We could change the relative scales of the parameters to improve convergence in the beginning, but then the convergence at the end becomes more difficult. Ill-conditioned problemThe second reason for the bad convergence behaviour of the algorithm is that the problem is ill-conditioned. That means that a small change of the data will have a large influence on the outcome of the parameter estimates. In that case, the data is not very useful to differentiate between different parameters of the model. A large range of variation in the parameters can more or less explain the same data. An example of this is in the image below, where we see that for different values of R0 we can still fit the data without much difference in the residual sum of squares (RSS). We get every time a value for around to (and the shape of the curve is not much dependent on the value of ). This value for relates to the initial growth rate. Let's look at the differential equations to see why variations in have so little effect on the begin of the curve. In terms of the parameters and the equations are now:
Here we see that, when is approximately equal to (which is the case in the beginning), then we get approximately and the beginning of the curve will be approximately exponential. Thus, for a large range of values of , the beginning of the epidemiological curve will resemble an exponential growth that is independent of the value of . In the opposite direction: when we observe exponential growth (initially) then we can not use this observation to derive a value for . With these ill-conditioned problems, it is often difficult to get the algorithm to converge to the minimum. This is because changes in some parameter (in our case ) will result in only a small improvement of the RSS and a large range of the parameters have more or less the same RSS. The error/variance of the parameter estimatesSo if small variations in the data occur, due to measurements errors, how much impact will this have on the estimates of the parameters? Here we show the results for two different ways to do determine this. In the file covid.R the execution of the methods will be explained in more detail. Using an estimate of the Fisher information. We can determine an estimate for (lower bound of) the variance of the parameters by considering the Cramér-Rao bound, which states that the variance of (unbiased) parameter estimates are equal to or larger than the inverse of the Fisher Information matrix. The Fisher information is a matrix with the second-order partial derivatives of the log-likelihood function evaluated at the true parameter values. The log-likelihood function is this thing: We do not know this loglikelihood function and it's dependence on the parameters and because we do not have the true parameter values and also we do not know the variance of the random error of the data points (the term in the likelihood function). But we can estimate it based on the Hessian, a matrix with the second-order partial derivatives of our objective function evaluated at our final estimate. ##################### ## ## computing variance with Hessian ## ################### ### The output of optim will store values for RSS and the hessian mod <- optim(c(0.3, 1.04), RSS2, method = "L-BFGS-B", hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1)) # unbiased estimate of standard deviation # we divide by n-p # where n is the number of data points # and p is the number of estimated parameters sigma_estimate <- sqrt(mod$value/(length(Infected)-2)) # compute the inverse of the hessian # The hessian = the second order partial derivative of the objective function # in our case this is the RSS # we multiply by 1/(2 * sigma^2) covpar <- solve(1/(2*sigma_estimate^2)*mod$hessian) covpar # [,1] [,2] #[1,] 1.236666e-05 -2.349611e-07 #[2,] -2.349611e-07 9.175736K and R0 e-09 ## the variance of R0 is then approximately ## covpar[2,2]^0.5 #[1] 9.579006e-05 Using a Monte Carlo estimation. A formula to compute exactly the propagation of errors/variance in the data to the errors/variance in the estimates of the parameters is often very complex. The Hessian will only give us a lower bound (I personally find it more useful to see any potential strong correlation between parameters), and it is not so easy to implement. There is however a very blunt but effective way to get an idea of the propagation of errors and that is by performing a random simulation. The full details of this method are explained in the covid.R file. Here we will show just the results of the simulation: In this simulation, we simulated times new data based on a true model with parameter values and and with the variance of data points corresponding to the observed RSS of our fit. We also show in the right graph how the parameters and are distributed for the same simulation. The parameters and are strongly correlated. This results in them having a large marginal/individual error, but the values and have much less relative variation (this is why we changed the fitting parameters from and to and ). Fitting by using the latest data and by using the analytical solution of the modelNow, we are almost at the end of this post, and we will make a new attempt to fit again the epidemiological curve, but now based on more new data. What we do this time is make some small adaptations:
Because the computation of all these parameters is too difficult in a single To obtain a starting condition we use a result from Harko, Lobo and Mak 2014 (Exact analytical solutions of the Susceptible-Infected-Recovered (SIR) epidemic model and of the SIR model with equal death and birth rates) who derived expressions for , and in terms of a single differential equation. The equation below is based on their equations but expressed in slightly different terms:
We can solve this equation as a linear equation which gives us a good starting condition (small sidenote: using some form of differential equation is a general way of getting starting conditions, but the might be noisy, in that case, one could integrate the expression). The further details of the computation can be found in the covid.R script. Below you see a result of the outer loop where we did a brute force search (which gives an optimum around for ) and next to it a fitted curve for the parameters , , and . In this new fit, we get again a low reproduction number . One potential reason for this is that due to the measures that have been taken, the Chinese have been able to reduce the rate of the spread of the virus. The model is unaware of this and interprets this as a reduction that is due to immunity (decrease of susceptible people). However, only a very small fraction of the people have gained immunity (about of the population got sick if we consider ). For the virus to stop spreading at already such a low fraction of sick people it must mean that the is very low. Thus, an estimation of the parameters, based on this type of data, is difficult. When we see a decrease in the growth rate then one or more of the following four effects play a role: (1) The number of susceptible people has decreased sufficiently to overcome the reproduction rate . This relative decrease in susceptible people happens faster when the total number of people is smaller. (2) Something has changed about the conditions, the reproduction rate is not constant in time. For instance, with respiratory infections, it is common that the transfer rates depend on weather and are higher during winter. (3) The measures that are being taken against the disease are taking effect. (4) The model is too simple with several assumptions that overestimate the effect of the initial growth rate. This growth rate is very high 25%" title="Rendered by QuickLaTeX.com" height="13" width="35" style="vertical-align: 0px;"/> per day, and we observe a doubling every three days. This means that the time between generations is very short, something that is not believed to be true. It may be likely that the increase in numbers is partially due to variable time delay in the occurrence of the symptoms as well as sampling bias. For statisticians, it is difficult to estimate what causes the changes in the epidemic curves. We should need more detailed information in order to fill in the gaps which do not seem to go away by having just more data (and this coronavirus creates a lot of data, possibly too much). But as human beings under threat of a nasty disease, we can at least consider ourselves lucky that we have a lot of options how the disease can fade away. And we can be lucky that we see a seemingly/effective reproduction rate that is very low, and also only a fraction of the population is susceptible. Wrap upSo now we have done all this nice mathematics and we can draw accurately a modelled curve through all our data points. But is this useful when we model the wrong data with the wrong model? The difference between statistics and mathematics is that statisticians need to look beyond the computations.
At the present time, COVID-19 is making an enormous impact on our lives, with an unclear effect for the future (we even do not know when the measures are gonna stop, end of April, end of May, maybe even June?). Only time will tell what the economic aftermath of this coronavirus is gonna be, and how much it's impact will be for our health and quality of life. But one thing that we can assure ourself about is that the ominous view of an unlimited exponential growth (currently going around on social media) is not data-driven. In this post, I have explained some mathematics about fitting. However, I would like to warn for the blunt use of these mathematical formula's. Just because we use a mathematical model does not mean that our conclusions/predictions are trustworthy. We need to challenge the premises which are the underlying data and models. So in a next post, "Contagiousness of COVID-19 Part 2: Why the Result of Part 1 is Useless", I hope to explain what sort of considerations about the data and the models one should take into account and make some connections with other cases where statistics went in a wrong direction. To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What is a dgCMatrix object made of? (sparse matrix format in R) Posted: 30 Mar 2020 10:44 PM PDT [This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. I've been working with sparse matrices in R recently (those created using (Click here for full documentation of the Background It turns out that there is some documentation on library(Matrix) ?`dgCMatrix-class` According to the documentation, the
An example We'll use a small matrix as a running example in this post: library(Matrix) M <- Matrix(c(0, 0, 0, 2, 6, 0, -1, 5, 0, 4, 3, 0, 0, 0, 5, 0), byrow = TRUE, nrow = 4, sparse = TRUE) rownames(M) <- paste0("r", 1:4) colnames(M) <- paste0("c", 1:4) M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . Running str(M) # Formal class 'dgCMatrix' [package "Matrix"] with 6 slots # ..@ i : int [1:7] 1 2 1 2 3 0 1 # ..@ p : int [1:5] 0 1 2 5 7 # ..@ Dim : int [1:2] 4 4 # ..@ Dimnames:List of 2 # .. ..$ : chr [1:4] "r1" "r2" "r3" "r4" # .. ..$ : chr [1:4] "c1" "c2" "c3" "c4" # ..@ x : num [1:7] 6 4 -1 3 5 2 5 # ..@ factors : list()
If a matrix M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@x # [1] 6 4 -1 3 5 2 5 as.numeric(M)[as.numeric(M) != 0] # [1] 6 4 -1 3 5 2 5 The M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@i # [1] 1 2 1 2 3 0 1 If the matrix has In our example, when M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@p # [1] 0 1 2 5 7 With the
These two slots are fairly obvious.
This slot is probably the most unusual of the lot, and its documentation was a bit difficult to track down. From the CRAN documentation, it looks like
My understanding is if we perform any matrix factorizations or decompositions on a M@factors # list() Mlu <- lu(M) # perform triangular decomposition str(M@factors) # List of 1 # $ LU:Formal class 'sparseLU' [package "Matrix"] with 5 slots # .. ..@ L :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:4] 0 1 2 3 # .. .. .. ..@ p : int [1:5] 0 1 2 3 4 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : chr [1:4] "r2" "r3" "r4" "r1" # .. .. .. .. ..$ : NULL # .. .. .. ..@ x : num [1:4] 1 1 1 1 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ U :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:7] 0 1 0 1 2 0 3 # .. .. .. ..@ p : int [1:5] 0 1 2 5 7 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : NULL # .. .. .. .. ..$ : chr [1:4] "c1" "c2" "c3" "c4" # .. .. .. ..@ x : num [1:7] 6 4 -1 3 5 5 2 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ p : int [1:4] 1 2 3 0 # .. ..@ q : int [1:4] 0 1 2 3 # .. ..@ Dim: int [1:2] 4 4 Here is an example which shows that the decomposition is only performed once: set.seed(1) M <- runif(9e6) M[sample.int(9e6, size = 8e6)] <- 0 M <- Matrix(M, nrow = 3e3, sparse = TRUE) system.time(lu(M)) # user system elapsed # 13.527 0.161 13.701 system.time(lu(M)) # user system elapsed # 0 0 0 To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Close Encounters of the R Kind Posted: 30 Mar 2020 05:00 PM PDT [This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Affiliation Harrison – Center for Strategic and Budgetary Analysis, Washington DC Cara – Department of the Air Force (Studies, Analyses, and Assessments – AF/A9), Washington DC Disclaimer The views expressed in this article represent the personal views of the author and are not necessarily the views of the Department of Defense (DoD) or the Department of the Air Force. This post is an effort to condense the 'buzz' surrounding the explosion of open source solutions in all facets of analysis – to include those done by Military Operations Research Society ( MORS) members and those they support – by describing our experiences with the R programming language. The impact of R in the statistical world (and by extension, data science) has been big: worthy of an entire issue of SIGNIFICANCE Magazine (RSS). Surprisingly, R is not a new computing language; modeled on S and Scheme,- the technology at the core of R is over forty years old. This longevity is, in itself, noteworthy. Additionally, fee-based and for-profit companies have begun to incorporate R with their products. While statistics is the focus of R, with the right packages – and know-how – it can also be used for a much broader spectrum, to include machine learning, optimization, and interactive web tools. In this post, we go back and forth discussing our individual experiences. Getting Started with RHarrison: I got started with R in earnest shortly before retiring from the U.S. Navy in 2016. I knew that I was going to need a programming language to take with me into my next career. The reason I chose R was not particularly analytical; the languages that I had done the most work in during grad school – MATLAB and Java – were not attractive in that the first required licensing fees and the second was – to me at the time – too 'low level' for the type of analysis I wanted to perform. I had used SPlus in my statistics track, but never really 'took' to it while in school. Several tools to 'bridge' the gap between Excel and R had been recommended to me by a friend, including RStudio and Rcommander. Onboard ship many years ago, I learned to eat with chopsticks by requesting that the wardroom staff stop providing me with utensils, substituting a bag of disposable chopsticks I purchased in Singapore. Turns out when deprived of other options, you can learn very fast. Learning R basics was the same; instead of silverware, it was removing the shortcuts to my usual tools on my home laptop (Excel). I simply did every task that I could from the mundane to the elegant in R. Cara: I started dabbling with R in 2017 when I had about a year and a half left in my PhD journey, after I decided to pursue a post-doctoral government career. Sitting comfortably in academia with abundant software licenses for almost a decade, I had no reason until that point to consider abandoning my SAS discipleship (other than abhorrent graphics capability, which I bolstered with use of SigmaPlot). My military background taught me not to expect access to expensive software in a government gig, and I had a number of friends and colleagues already using R and associated tools, so I installed it on my home computer. Other than being mildly intrigued by the software version naming convention, I stubbornly clung to SAS to finish my doctoral research, however. It was not until I became a government OR analyst in late 2018 that I really started my R journey (mostly out of necessity at this point). R does not require a license, and furthermore, is approved for use on government computers. I found R to be an easier and more natural programming language to learn coming from a SAS background than Python, Javascript, or C++ would have been. How has using R shaped your practice?Harrison: There is a lot of talk about how various tools perform in the sense of runtime, precision, graphics, etc. These are considerations, but they are completely eclipsed by the following: We don't talk as a community about how much the tools we use shape our thinking. I frequently tell my colleagues that the fundamental unit in Excel is called a cell because it is your mind prison. There's actually some truth to that. R is vectorized, so for most functions, passing an array gives an appropriate array output. When you work day-in and day-out with vectors, you stop thinking about individual operations start to think in terms of sentences. The magrittr Early in my work with R, I was told by a colleague that "R might be good but the graphics are terrible". This was a bit of a shock, as graphics has been one of the main selling points of the language, and I didn't want to be making seedy graphs. From that point on, I made it a point to make the best graphics I possibly could, usually – but not always – using methods and extensions found in the Cara: When I began teaching myself R – and using it daily – I thought I was merely learning the syntax of a new programming language. With the analytic capability inherent with R and the flexibility of development environments, however, it is really more of a way of thinking. Fold in the powerful (and mostly free!) resources and passionate following of analysts and data scientists, and you get an R community that I truly enjoy being a part of. The R environment, even as a novice user, can have positive impacts on your workflow. For example, beyond syntax, my earliest explorations in R taught me that if you are going to do something more than once, write a function. I had never truly internalized that idea, even after a decade of using SAS. Another thing I learned relatively early on – get to know the Disadvantages of RHarrison: This piece is not a 'sales pitch' for R; but rather a sober consideration of what the tradeoffs an organization needs to consider when choosing an analytic platform writ large:
Cara: I work in a pretty dynamic world as a government operations research analyst (ORSA); we don't typically have dedicated statisticians, programmers, data scientists, modelers, or data viz specialists. Most of the time, we are all functioning in some or all of those capacities. As a former engineer with a dynamic background, this suits me well. However, it also means that things change from day to day, from project to project, and as the government analytic world changes (rapidly). I do not have the flexibility to use one software package exclusively. Further, I face challenges within DoD related to systems, software, classification, and computing infrastructure that most people in academia or industry do not. In my organization, there has been a relatively recent and rapid shift in the analytic environment. We formerly leaned heavily on Excel-based regressions and descriptive statistics, usually created by a single analysts that answer a single question, and in many cases these models were not particular dynamic or scalable. We now focus on using open-source tools in a team construct, sometimes with industry partners, to create robust models that are designed to answer multiple questions from a variety of perspectives; scale easily to mirror operational requirements; fit with other models; and transition well to high performance computing environments. The two open-source tools we (i.e., my division) currently use most for programming are R and Python. We have had success combining data analysis, statistics, and graphical models to create robust tools coded as RShiny apps. Recently, we chose to code in Python for a project that involved machine learning and high performance computing. I do not propose to debate the strengths and weaknesses of either R or Python in this forum; rather, I challenge you to consider carefully the implications of programming language choice for any project with a cradle to grave perspective. References
To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 30 Mar 2020 05:00 PM PDT [This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. IntroductionThe Novel COVID-19 Coronavirus is still spreading quickly in several countries and it does not seem like it is going to stop anytime soon as the peak has not yet been reached in many countries. Since the beginning of its expansion, a large number of scientists across the world have been analyzing this Coronavirus from different perspectives and with different technologies with the hope of coming up with a cure in order to stop its expansion and limit its impact on citizens. Motivations, limitations and structure of the articleBy seeing and organizing many R resources about COVID-19, I am fortunate enough to have read a lot of excellent analyses on the disease outbreak, the impact of different health measures, forecasts of the number of cases, projections about the length of the pandemic, hospitals capacity, etc. Furthermore, I must admit that some countries such as China, South Korea, Italy, Spain, UK and Germany received a lot of attention as shown by the number of analyses done on these countries. However, to my knowledge and at the date of publication of this article, I am not aware of any analysis of the spread of the Coronavirus specifically for Belgium.1 The present article aims at filling that gap. Throughout my PhD thesis in statistics, my main research interest is about survival analysis applied to cancer patients (more information in the research section of my personal website). I am not an epidemiologist and I have no extensive knowledge in modelling disease outbreaks via epidemiological models. I usually write articles only about things I consider myself familiar with, mainly statistics and its applications in R. At the time of writing this article, I was however curious where Belgium stands regarding the spread of this virus, I wanted to play with this kind of data in R (which is new to me) and see what comes out. In order to satisfy my curiosity while not being an expert, in this article I am going to replicate analyses done by more knowledgeable people and apply them to my country, that is, Belgium. From all the analyses I have read so far, I decided to replicate the analyses done by Tim Churches and Prof. Dr. Holger K. von Jouanne-Diedrich. This article is based on a mix of their articles which can be found here and here. They both present a very informative analysis on how to model the outbreak of the Coronavirus and show how contagious it is. Their articles also allowed me to gain an understanding of the topic and in particular an understanding of the most common epidemiological model. I strongly advise interested readers to also read their more recent articles for more advanced analyses and for an even deeper understanding of the spread of the COVID-19 pandemic. Other more complex analyses are possible and even preferable, but I leave this to experts in this field. Note also that the following analyses take into account only the data until the date of publication of this article, so the results should not be viewed, by default, as current findings. In the remaining of the article, we first introduce the model which will be used to analyze the Coronavirus outbreak in Belgium. We also briefly discuss and show how to compute an important epidemiological measure, the reproduction number. We then use our model to analyze the outbreak of the disease in the case where there would be no public health intervention. We conclude the article by summarizing more advanced tools and techniques that could be used to further model COVID-19 in Belgium. Additional considerationsAs previously mentioned, the SIR model and the analyses done above are rather simplistic and may not give a true representation of the reality. In the following sections, we highlight five improvements that could be done to enhance theses analyses and lead to a better overview of the spread of the Coronavirus in Belgium. Ascertainment ratesIn the previous analyses and graphs, it is assumed that the number of confirmed cases represent all the cases that are infectious. This is far from reality as only a proportion of all cases are screened, detected and counted in the official figures. This proportion is known as the ascertainment rate. The ascertainment rate is likely to vary during the course of an outbreak, in particular if testing and screening efforts are increased, or if detections methods are changed. Such changing ascertainment rates can be easily incorporated into the model by using a weighting function for the incidence cases. In his first article, Tim Churches demonstrates that a fixed ascertainment rates of 20% makes little difference to the modelled outbreak with no intervention, except that it all happens a bit more quickly. More sophisticated modelsMore sophisticated models could also be used to better reflect real-life transmission processes. For instance, another classical model in disease outbreak is the SEIR model. This extended model is similar to the SIR model, where S stands for Susceptible and R stands for Recovered, but the infected people are divided into two compartments:
These models belong to the continuous-time dynamic models that assume fixed transition rates. There are other stochastic models that allow for varying transition rates depending on attributes of individuals, social networking, etc. Modelling the epidemic trajectory using log-linear modelsAs noted above, the initial exponential phase of an outbreak, when shown in a log-linear plot (the y-axis on a log scale and the x-axis without transformation), appears (somewhat) linear. This suggests that we can model epidemic growth, and decay, using a simple log-linear model of the form: \[log(y)=rt+b\] where y is the incidence, r is the growth rate, t is the number of days since a specific point in time (typically the start of the outbreak), and b is the intercept. In this context, two log-linear models, one to the growth phase before the peak, and one to the decay phase after the peak are fitted to the epidemic (incidence cases) curve. The doubling and halving time estimates which you very often hear in the news can be estimated from these log-linear models. Furthermore, these log-linear models can also be used on the epidemic trajectory to estimate the reproduction number \(R_0\) in the growth and decay phases of the epidemic. The Estimating changes in the effective reproduction number \(R_e\)In our model, we set a reproduction number \(R_0\) and kept it constant. It would nonetheless be useful to estimate the current effective reproduction number \(R_e\) on a day-by-day basis so as to track the effectiveness of public health interventions, and possibly predict when an incidence curve will start to decrease. The More sophisticated projectionsIn addition to naïve predictions based on a simple SIR model, more advanced and complex projections are also possible, notably, with the ConclusionThis article started with (i) a description of a couple of R resources on the Coronavirus pandemic (i.e., a collection and a dashboard) that can be used as background materials and (ii) the motivations behind this article. We then detailed the most common epidemiological model, i.e. the SIR model, before actually applying it on Belgium incidence data. This resulted in a visual comparison of the fitted and observed cumulative incidence in Belgium. It showed that the COVID-19 pandemic is clearly in an exponential phase in Belgium in terms of number of confirmed cases. We then explained what is the reproduction number and how to compute it in R. Finally, our model was used to analyze the outbreak of the Coronavirus if there was no public health intervention at all. Under this (probably too) simplistic scenario, the peak of the COVID-19 in Belgium is expected to be reached by the end of April, 2020, with around 525,000 infected people and about 3,700 deaths. These very alarmist naïve predictions highlight the importance of restrictive public health actions taken by governments, and the urgence for citizens to follow these health actions in order to mitigate the spread of the virus in Belgium (or at least slow it enough to allow health care systems to cope with it). We concluded this article by describing five improvements that could be implemented to further analyze the disease outbreak. Thanks for reading. I hope this article gave you a good understanding of the spread of the COVID-19 Coronavirus in Belgium. Feel free to use this article as a starting point for analyzing the outbreak of this disease in your own country. See also a collection of top R resources on Coronavirus to gain even further knowledge. As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here. Get updates every time a new article is published by subscribing to this blog. Related articles:
ReferencesCori, Anne, Neil M Ferguson, Christophe Fraser, and Simon Cauchemez. 2013. "A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics." American Journal of Epidemiology 178 (9). Oxford University Press: 1505–12. Thompson, RN, JE Stockwin, RD van Gaalen, JA Polonsky, ZN Kamvar, PA Demarsh, E Dahlqwist, et al. 2019. "Improved Inference of Time-Varying Reproduction Numbers During Infectious Disease Outbreaks." Epidemics 29. Elsevier: 100356.
To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Can unbalanced randomization improve power? Posted: 30 Mar 2020 05:00 PM PDT [This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Of course, we're all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn't contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks. I have tried in some small way to get involved with researchers who are trying to improve outcomes for patients who are showing the symptoms or test positive for COVID-19. One group that reached out to me is concerned with how patients with heart conditions will be adversely affected by the disease, and is evaluating a number of drug treatments that could improve their outcomes. Given that we know that outcomes under usual care are not that great for heart patients, there is a desire to try to get possible treatments to as many people as possible, even in a randomized control trial. One question that came up in the design of this study was whether there would be efficiency gains by using a 1:2 randomization scheme? That is, should we randomize two patients to the experimental drug treatment for every one patient we randomize to the usual care group? In the case of a binary outcome, it appears that we will only lose efficiency if we use anything other than a 1:1 randomization. Brief public service announcement: simstudy updateWhen it became clear that I needed to explore the implications of unbalanced randomization for this project, I realized that the Here is an example based on a 1:2:3 allocation. I'm not sure if that would ever be appropriate, but it shows the flexibility of the new argument. One counter-intuitive aspect of this implementation is that the Unbalanced designs with a binary outcomeThe outcome in the COVID-19 study is a composite binary outcome (at least one of a series of bad events has to occur within 30 days to be considered a failure). Here, I am considering the effect of different randomization schemes on the power of the study. We assumed in the usual care arm 40% of the patients would have a bad outcome and that the drug treatment would reduce the bad outcomes by 30% (so that 28% of the drug treatment arm would have a bad outcome). If we generate a single data set under these assumptions, we can fit a logistic regression model to recover these parameters. The probabilities of a bad outcome for the usual care group and drug treatment group are Assessing powerIn order to assess power, we need to generate many data sets and keep track of the p-values. The power is calculated by estimating the proportion of p-values that fall below 0.05. Here is the analytic solution for a 1:1 ratio. The sample size estimate based on 80% suggests we would need 244 patients per arm, or 488 total patients. If we use this estimated \(n\) in a simulation for power (using 1000 datasets), we should be close to 80%: The power experimentNow we are ready to evaluate the question that motivated all of this. If we start to change the ratio from 1:1 to 1:2, to 1:3, etc., what happens to the power? And does this pattern change based on the assumptions about failure rates in the usual care arm and the expected reductions in the treatment arm? Here is the code that will allow us to explore these questions: Repeating the power simulation for a variety of assumptions indicates that, at least in the case of a binary outcome, using an unbalanced design does not improve the quality of the research even though it might get more patients the drug treatment: Continuous outcomesIn the case of binary outcomes, reducing sample size in the control group reduces our ability to efficiently estimate the proportion of events, even though we may improve estimation for the treatment group by adding patients. In the case of a continuous outcome, we may be able to benefit from a shift of patients from one group to another if the variability of responses differs across groups. In particular, arms with more variability could benefit from a larger sample. Next time, I'll show some simulations that indicate this might be the case. Stay well. To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Predicting 1,000,000 of COVID-19 done with R before March 18 – call for help Posted: 30 Mar 2020 10:48 AM PDT [This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Due to the unprecedented technological progress that has been made in years past, the current climate allows us to monitor this pandemic better than any other pandemic in the past. We will argue, however, that R was instrumental in predicting when the 1,000,000th case of COVID-19 will occur, as demonstrated here in our collaboration spread out on three continents: https://www.sciencedirect.com/science/article/pii/S2590113320300079 Since India is currently in lockdown and the correction process is in India, it has not been concluded as of writing. The first draft of our paper was prepared on March 18 and can be accessed here: http://www.cs.laurentian.ca/wkoczkodaj/p/?C=M;O=D The entire world waits for the inflection point (https://en.wikipedia.org/wiki/Inflection_point) and if you help us, we may be able to reach this point sooner. A few crucial R commands are: Waldemar W. Koczkodaj (email: wkoczkodaj [AT] cs.laurentian.ca) To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Analyzing Remote Sensing Data using Image Segmentation Posted: 30 Mar 2020 10:45 AM PDT [This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This post summarizes material posted as an Additional Topic to accompany the book Spatial Data Analysis in Ecology and Agriculture using R, Second Edition. The full post, together with R code and data, can be found in the Additional Topics section of the book's website, http://psfaculty.plantsciences.ucdavis.edu/plant/sda2.htm 1. Introduction Because living vegetation strongly reflects radiation in the infrared region and absorbs it in the red region, the amount of vegetation contained in a pixel can be estimated by using the normalized difference vegetation index, or NDVI, defined as where IR is the pixel intensity of the infrared band and R is the pixel intensity of the red band. Here is the NDVI of the region in the false color image above. There are a number of image analysis packages available in R. I elected to use the OpenImageR and SuperpixelImageSegmentation packages, both of which were developed by L. Mouselimis (2018, 2019a). One reason for this choice is that the objects created by these packages have a very simple and transparent data structure that makes them ideal for data analysis and, in particular, for integration with the raster package. The packages are both based on the idea of image segmentation using superpixels . According to Stutz et al. (2018), superpixels were introduced by Ren and Malik (2003). Stutz et al. define a superpixel as "a group of pixels that are similar to each other in color and other low level properties." Here is the false color image above divided into superpixels. The superpixels of can then be represented by groups of pixels all having the same pixel value, as shown here. This is what we will call image segmentation. In the next section we will introduce the concepts of superpixel image segmentation and the application of these packages. In Section 3 we will discuss the use of these concepts in data analysis. The intent of image segmentation as implemented in these and other software packages is to provide support for human visualization. This can cause problems because sometimes a feature that improves the process of visualization detracts from the purpose of data analysis. In Section 4 we will see how the simple structure of the objects created by Mouselimis's two packages can be used to advantage to make data analysis more powerful and flexible. This post is a shortened version of the Additional Topic that I described at the start of the post. 2. Introducing superpixel image segmentation where S is the grid interval of the pixels and m is a compactness parameter that allows one to emphasize or de-emphasize the spatial compactness of the superpixels. The algorithm begins with a set of K cluster centers regularly arranged in the (x,y) grid and performs K-means clustering of these centers until a predefined convergence measure is attained. library(raster) We will need to save the number of rows and columns for present and future use. We will now use SLIC for image segmentation. The code here is adapted directly from the superpixels vignette. > library(OpenImageR)
3. Data analysis using superpixels > class(b3) Objects created by the raster packages are called Raster* objects, with a capital "R". This may seem like a trivial detail, but it is important to keep in mind because the objects generated by the functions in the OpenImageR and SuperpixelImageSegmentation packages are raster objects that are not Raster* objects (note the deliberate lack of a courier font for the word "raster"). > str(False.Color) It is a three-dimensional array, which can be thought of as a box, analogous to a rectangular Rubik's cube. The "height" and "width" of the box are the number of rows and columns respectively in the image and at each value of the height and width the "depth" is a vector whose three components are the red, green, and blue color values of that cell in the image, scaled to [0,1] by dividing the 8-bit RGB values, which range from 0 to 255, by 255. This is the source of the warning message that accompanied the output of the function superpixels(), which said that the values will be multiplied by 255. For our purposes, however, it means that we can easily analyze images consisting of transformations of these values, such as the NDVI. Since the NDVI is a ratio, it doesn't matter whether the RGB values are normalized or not. For our case the RasterLayer object b5 of the RasterBrick False.Color is the IR band and the object b4 is the R band. Therefore we can compute the NDVI as > NDVI.region <- (b5 – b4) / (b5 + b4) Since NDVI is a ratio scale quantity, the theoretically best practice is to plot it using a monochromatic representation in which the brightness of the color (i.e., the color value) represents the value (Tufte, 1983). We can accomplish this using the RColorBrewer function brewer.pal(). > library(RColorBrewer) This generates the NDVI map shown above. The darkest areas have the highest NDVI. Let's take a look at the structure of the RasterLayer object NDVI.region. > str(NDVI.region) Only the relevant parts are shown, but we see that we can convert the raster data to a matrix that can be imported into our image segmentation machinery as follows. Remember that by default R constructs matrices by columns. The data in Raster* objects such as NDVI.region are stored by rows, so in converting these data to a matrix we must specify byrow = TRUE. > NDVI.mat <- matrix(NDVI.region@data@values, The function imageShow() works with data that are either in the eight bit 0 – 255 range or in the [0,1] range (i.e., the range of x between and including 0 and 1). It does not, however, work with NDVI values if these values are negative. Therefore, we will scale NDVI values to [0,1]. > m0 <- min(NDVI.mat) Here is the resulting image. The function imageShow() deals with a single layer object by producing a grayscale image. Unlike the green NDVI plot above, in this plot the lower NDVI regions are darker. > NDVI.data <- False.Color In the next section we will describe how to create an image segmentation of the NDVI data and how to use cluster analysis to create a land classification. 4. Expanding the data analysis capacity of superpixels > NDVI.80 = superpixels(input_image = NDVI.data, Here is the result. > str(NDVI.80) It is a list with two elements, NDVI.80$slic_data, a three dimensional array of pixel color data (not normalized), and NDVI.80$labels, a matrix whose elements correspond to the pixels of the image. The second element's name hints that it may contain values identifying the superpixel to which each pixel belongs. Let's see if this is true. > sort(unique(as.vector(NDVI.80$labels))) There are 72 unique labels. Although the call to superpixels() specified 80 superpixels, the function generated 72. We can see which pixels have the label value 0 by setting the values of all of the other pixels to [255, 255, 255], which will plot as white. > R0 <- NDVI.80
> Bdry <- NDVI.80 This shows that we have indeed identified the boundary pixels. Note that the function imageShow() displays these pixels as white with a black edge, rather than pure black. Having done a preliminary analysis, we can organize our segmentation process into two steps. The first step will be to replace each of the superpixels generated by the OpenImageR function superpixels() with one in which each pixel has the same value, corresponding to a measure of central tendency (e.g, the mean, median , or mode) of the original superpixel. The second step will be to use the K-means unsupervised clustering procedure to organize the superpixels from step 1 into a set of clusters and give each cluster a value corresponding to a measure of central tendency of the cluster. I used the code developed to identify the labels and to generate the boundary to construct a function make.segments() to carry out the segmentation. The first argument of make.segments() is the superpixels object, and the second is the functional measurement of central tendency. Although in this case each of the three colors of the object NDVI.80 have the same values, this may not be true for every application, so the function analyzes each color separately. Because the function is rather long, I have not included it in this post. If you want to see it, you can go to the Additional Topic linked to the book's website. Here is the application of the function to the object NDVI.80 with the second argument set to mean. > NDVI.means <- make.segments(NDVI.80, mean) Here is the result. > set.seed(123) Next I used the raster function ratify() to assign descriptive factor levels to the clusters. > class.ras <- ratify(class.ras) The result is shown in the figure at the start of the post. We can also overlay the original boundaries on top of the image. This is more easily done using plot() rather than levelplot(). The function plot() allows plots to be built up in a series of statements. The function levelplot() does not. > NDVI.rasmns <- raster(NDVI.means$slic_data[,,1], The application of image segmentation algorithms to remotely sensed image classification is a rapidly growing field, with numerous studies appearing every year. At this point, however, there is little in the way of theory on which to base an organization of the topic. If you are interested in following up on the subject, I encourage you to explore it on the Internet. References Appelhans, T., F. Detsch, C. Reudenbach and S. Woellauer (2017). mapview: Interactive Viewing of Spatial Data in R. R package version 2.2.0. https://CRAN.R-project.org/package=mapview Bivand, R., E. Pebesma, and V. Gómez-Rubio. (2008). Applied Spatial Data Analysis with R. Springer, New York, NY. Frey, B.J. and D. Dueck (2006). Mixture modeling by affinity propagation. Advances in Neural Information Processing Systems 18:379. Hijmans, R. J. (2016). raster: Geographic Data Analysis and Modeling. R package version 2.5-8. https://CRAN.R-project.org/package=raster Mouselimis, L. (2018). SuperpixelImageSegmentation: Superpixel Image Segmentation. R package version 1.0.0. https://CRAN.R-project.org/package=SuperpixelImageSegmentation Mouselimis, L. (2019a). OpenImageR: An Image Processing Toolkit. R package version 1.1.5. https://CRAN.R-project.org/package=OpenImageR Mouselimis, L. (2019b) Image Segmentation Based on Superpixel Images and Clustering. https://cran.r-project.org/web/packages/OpenImageR/vignettes/Image_segmentation_superpixels_clustering.html. Ren, X., and J. Malik (2003) Learning a classification model for segmentation. International Conference on Computer Vision, 2003, 10-17. Stutz, D., A. Hermans, and B. Leibe (2018). Superpixels: An evaluation of the state-of-the-art. Computer Vision and Image Understanding 166:1-27. Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, Conn. Yassine, B., P. Taylor, and A. Story (2018). Fully automated lung segmentation from chest radiographs using SLICO superpixels. Analog Integrated Circuits and Signal Processing 95:423-428. Zhou, B. (2013) Image segmentation using SLIC superpixels and affinity propagation clustering. International Journal of Science and Research. https://pdfs.semanticscholar.org/6533/654973054b742e725fd433265700c07b48a2.pdf
< p style="text-align: center"> To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
R Tip: How To Look Up Matrix Values Quickly Posted: 30 Mar 2020 09:41 AM PDT [This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. R is a powerful data science language because, like Matlab, numpy, and Pandas, it exposes vectorized operations. That is, a user can perform operations on hundreds (or even billions) of cells by merely specifying the operation on the column or vector of values. Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.
In R we can specify operations over vectors. For arithmetic this is easy, but some more complex operations you "need to know the trick." Patrick Freeman (@PTFreeman) recently asked: what is the idiomatic way to look up a bunch of values from a matrix by row and column keys? This is actually easy to do if we first expand the data matrix into RDF-triples. If our data were in this format we could merge/join it against our desired column indices. Let's start with an example data matrix.
And our data-frame containing the indices we want to look-up.
That is: we want to know the matrix values from [R1, C2], [R2, C3], and [R2, C2]. The trick is: how do we convert the matrix into triples? digEmAll, has a great solution to that here.
What the above code has done is: write each entry of the original matrix as a separate row with the original row and column ids landed as new columns. This data format is very useful. The above code is worth saving as a re-usable snippet, as getting it right is a clever step. Now we can vectorize our lookup using the merge command, which produces a new joined table where the desired values have been landed as a new column.
And that is it: we have used vectorized and relational concepts to look up many values from a matrix very quickly. To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment