[R-bloggers] Building Big Shiny Apps — A Workflow (1/2) (and 4 more aRticles) |
- Building Big Shiny Apps — A Workflow (1/2)
- 10 Tips for Choosing the Optimal Number of Clusters
- R tips and tricks – higher-order functions
- Statistics Sunday: Creating a Stacked Bar Chart for Rank Data
- littler 0.3.6: Two neat enhancements
Building Big Shiny Apps — A Workflow (1/2) Posted: 27 Jan 2019 11:00 AM PST (This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers) During the As this is a rather long topic, I've divided this post into two parts: this first post will talk about the background and motivation, and the second post will present a step by step workflow and the necessary tools. MotivationThe idea behind this poster (and now this blog post) is not to talk about how to deploy and scale, but about the process of building the app. Why? A lot of blog posts and books talk about putting apps in production. Very few talks about working on building these apps. This is why I've chosen to talk about the process, workflow, and tools we use at ThinkR when building big Shiny Apps. So,we'll not talk about what to do when the app is ready, we'll talk about how to make it ready. About Big Shiny AppsIf you're reading this page, chances are you already know what a Shiny App is — a web application that communicates with R, built in R, and working with R. With Shiny, almost anybody can create a prototype for a small data product in a matter of hours. And no knowledge of HTML, CSS or JavaScript is required, making it really easy to use as you can rapidly create a POC. But what to do now that you want to build a big Shiny App? What's a big Shiny App?
Building a Big Shiny App, the ChallengesFinding a good UI (and stick with it)Choosing a UI is hard — we have a natural tendency, as coders, to be focused on the backend, i.e the algorithmic part of the application. But let's state the truth: no matter how complex and innovative your backend is, your application is bad is your UI is bad. That's the hard truth. If people can't understand how to use your application, your application doesn't work. No matter how incredible the backend is. So try to find a simple and efficient UI (I'm personally a big fan of minimalistic UI). One that people can understand and use in a matter of seconds. Don't implement features or visual elements that are not actually needed, just "in case". And spend time working on that UI, really thinking about what visual elements you are implementing. This part is a crucial part, as it will influence the rest of the work — a big app means numerous features, and it can be hard to find a way to organise all these features in an understandable, easy to use user interface. Working as a teamBig Shiny Apps usually means that several peoples will work on the application. For example, at From the tools point of view:
Here is, for example, the division of an app into modules and sub modules: From the organisational point of view
Making the app production readyThis includes two things: scaling and maintaining. As said in the disclaimer, I won't expand on the topic of scaling, as many have written about that, but here is one piece of advice: make the R process running the app do as less as possible, and in particular prevent it from doing what it's not supposed to do. Which includes: use JavaScript so that the client browser renders things (instead of making R do the work — and don't worry, basic JS is easy to learn ), use parallelization and async, and if possible, make the heavy lifting be done outside the R session running the app. Maintainance, on the other, is something to think about from the beginning. It includes being able to ensure that the application will work
That's it for today! Stay tuned for the second part of this post that will describe the workflow in details, and talk about the tools to use The post Building Big Shiny Apps — A Workflow (1/2) appeared first on (en) The R Task Force.
To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
10 Tips for Choosing the Optimal Number of Clusters Posted: 27 Jan 2019 08:59 AM PST (This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers) Clustering is one of the most common unsupervised machine learning problems. Similarity between observations is defined using some inter-observation distance measures or correlation-based distance measures. There are 5 classes of clustering methods:
My desire to write this post came mainly from reading about the clustree package, the dendextend documentation, and the Practical Guide to Cluster Analysis in R book written by Alboukadel Kassambara author of the factoextra package. Data SetI will be using a lesser known data set from the cluster package: all.mammals.milk.1956, one which I haven't looked at before. This small dataset contains a list of 25 mammals and the constituents of their milk (water, protein, fat, lactose, ash percentages) from John Hartigan, Clustering Algorithms, Wiley, 1975. First let's load the required packages. library(tidyverse) Now load the data. data("all.mammals.milk.1956") # subset dataset Let's explore and visualize the data. # Glimpse the data set Observations: 25 All the variables are expressed as numeric. What about the statistical distribution? # Summary of data set # Historgram for each attribute What's the relationship between the different attributes? Use `corrplot()` to create correlation matrix. corrplot(cor(mammals), type = "upper", method = "ellipse", tl.cex = 0.9) When you have variables which are measured in different scales it is useful to scale the data. mammals_scaled <- scale(mammals) Dimensionality reduction can help with data visualization (e.g. PCA method). res.pca <- PCA(mammals_scaled, graph = FALSE) # Visualize eigenvalues/variances These are the 5 PCs that capture 80% of the variance. The scree plot shows that PC1 captured ~ 75% of the variance. # Extract the results for variables # Contributions of variables to PC1 # Contributions of variables to PC2 # Control variable colors using their contributions to the principle axis From these visualizations it's apparrent that water and lactose tend to increase together and that protein, ash and fat increase together; the two groups being inversely related. Naïve K-means clusteringPartitioning clustering methods, like k-means and Partitioning Around Medoids (PAM), require that you specify the number of clusters to be generated. k-means clusters is probably one of the most well known partitioning methods. The idea behind k-means clustering consists of defining clusters the total within-cluster variation , which measures the compactness of the clusters is minimized. We can compute k-means in R with the kmeans() function: km2 <- kmeans(mammals_scaled, centers = 2, nstart = 30) The above example would group the data into two clusters, centers = 2, and attempt multiple initial configurations, reporting on the best one. For example, as this algorithm is sensitive to the initial positions of the cluster centroids adding nstart = 30 will generate 30 initial configurations and then average all the centroid results. Because the number of clusters (k) needs to be set before we start it can be advantageous to examine several different values of k. kmean_calc <- function(df, ...){ km2 <- kmean_calc(mammals_scaled, 2) p1 <- fviz_cluster(km2, data = mammals_scaled, frame.type = "convex") + theme_minimal() + ggtitle("k = 2") plot_grid(p1, p2, p3, p4, p5, p6, labels = c("k2", "k3", "k4", "k5", "k6", "k7")) Although this visual assessment tells us where delineations occur between clusters, it does not tell us what the optimal number of clusters is. Determining Optimal Number of ClustersA variety of measures have been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm. There are more than thirty indices and methods for identifying the optimal number of clusters so I'll just focus on a few here including the very neat clustree package. The "Elbow" MethodProbably the most well known method, the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful. set.seed(31) The Elbow Curve method is helpful because it shows how increasing the number of the clusters contribute separating the clusters in a meaningful way, not in a marginal way. The bend indicates that additional clusters beyond the third have little value. (See [here] for a more mathematically rigorous interpretation and implementation of this method). The Gap StatisticThe gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points. gap_stat <- clusGap(mammals_scaled, FUN = kmeans, nstart = 30, K.max = 24, B = 50) fviz_gap_stat(gap_stat) + theme_minimal() + ggtitle("fviz_gap_stat: Gap Statistic") The gap stats plot shows the statistics by number of clusters (k) with standard errors drawn with vertical segments and the optimal value of k marked with a vertical dashed blue line. According to this observation k = 2 is the optimal number of clusters in the data. The Silhouette MethodAnother visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k. fviz_nbclust(mammals_scaled, kmeans, method = "silhouette", k.max = 24) + theme_minimal() + ggtitle("The Silhouette Plot") This also suggests an optimal of 2 clusters. The Sum of Squares MethodAnother clustering validation method would be to choose the optimal number of cluster by minimizing the within-cluster sum of squares (a measure of how tight each cluster is) and maximizing the between-cluster sum of squares (a measure of how seperated each cluster is from the others). ssc <- data.frame( ssc %<>% gather(., key = "measurement", value = value, -kmeans) #ssc$value <- log10(ssc$value) ssc %>% ggplot(., aes(x=kmeans, y=log10(value), fill = measurement)) + geom_bar(stat = "identity", position = "dodge") + ggtitle("Cluster Model Comparison") + xlab("Number of Clusters") + ylab("Log10 Total Sum of Squares") + scale_x_discrete(name = "Number of Clusters", limits = c("0", "2", "3", "4", "5", "6", "7", "8")) From this measurement it appears 7 clusters would be the appropriate choice. NbClustThe NbClust package provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. res.nbclust <- NbClust(mammals_scaled, distance = "euclidean", factoextra::fviz_nbclust(res.nbclust) + theme_minimal() + ggtitle("NbClust's optimal number of clusters") This suggest the optimal number of clusters is 3. ClustreeThe statistical method above produce a single score that only considers a single set of clusters at a time. The clustree R package takes an alternative approach by considering how samples change groupings as the number of clusters increases. This is useful for showing which clusters are distinct and which are unstable. It doesn't explicitly tell you which choice of optimal clusters is but it is useful for exploring possible choices. Let's take a look at 1 to 11 clusters. tmp <- NULL df <- data.frame(tmp) # add a prefix to the column names # get individual PCA ind.coord <- df.pca$x df <- bind_cols(as.data.frame(df), as.data.frame(ind.coord)) clustree(df, prefix = "k") In this figure the size of each node corresponds to the number of samples in each cluster, and the arrows are coloured according to the number of samples each cluster receives. A separate set of arrows, the transparent ones, called the incoming node proportion, are also coloured and shows how samples from one group end up in another group — an indicator of cluster instability. In this graph we see that as we move from k=2 to k=3 a number of species from the lookers-left cluster are reasigned to the third cluster on the right. As we move from k=8 to k=9 we see one node with multiple incoming edges an indicator that we over-clustered the data. It can also be useful to overlay this dimension on other dimensions in the data, particularly those that come from dimensionality reduction techniques. We can do this using the clustree_overlay() function: df_subset <- df %>% select(1:8,12:13) clustree_overlay(df_subset, prefix = "k", x_value = "PC1", y_value = "PC2") I prefer to see it from the side, showing one of the x or y dimensions against the resolution dimension. overlay_list <- clustree_overlay(df_subset, prefix = "k", x_value = "PC1", overlay_list$x_side overlay_list$y_side This shows that we can an indication of the correct clustering resolution by examining the edges and we can overly information to assess the quality of the clustering. Choosing the appropriate algorithmWhat about choice of appropriate clustering algorithm? The cValid package can be used to simultaneously compare multiple clustering algorithms, to identify the best clustering approach and the optimal number of clusters. We will compare k-means, hierarchical and PAM clustering. intern <- clValid(mammals_scaled, nClust = 2:24, # Summary Clustering Methods: Connectivity and Silhouette are both measurements of connectedness while the Dunn Index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. Extracting Features of ClustersWe would like to answer questions like "what is it that makes this cluster unique from others?" and "what are the clusters that are similar to one another" As mentioned earlier it's difficult to assess the quality of results from clustering. We don't have true labels so clustering is a good EDA starting point for exploring the differences between these clusters in greater detail. Let's select five clusters and interrogate the features of these clusters. # Compute dissimilarity matrix with euclidean distances # Hierarchical clustering using Ward's method # Cut tree into 5 groups # Visualize # Execution of k-means with k=5 fviz_cluster(final, data = mammals_scaled) + theme_minimal() + ggtitle("k = 5") Let's extract the clusters and add them back to our initial data to do some descriptive statistics at the cluster level: as.data.frame(mammals_scaled) %>% mutate(Cluster = final$cluster) %>% group_by(Cluster) %>% summarise_all("mean") %>% kable() %>% kable_styling() We see that cluster 2, composed solely of the Rabbit has a high ash content. Group 3 composed of the seal and dolphin are high in fat, which makes sense given the harsh demands of such a cold climate whil group 4 has a large lactose content. mammals_df <- as.data.frame(mammals_scaled) %>% rownames_to_column() cluster_pos <- as.data.frame(final$cluster) %>% rownames_to_column() mammals_final <- inner_join(cluster_pos, mammals_df) ggRadar(mammals_final[-1], aes(group = cluster), rescale = FALSE, legend.position = "none", size = 1, interactive = FALSE, use.label = TRUE) + facet_wrap(~cluster) + scale_y_discrete(breaks = NULL) + # don't show ticks mammals_df <- as.data.frame(mammals_scaled) ggpairs(mammals_df, 1:5, mapping = ggplot2::aes(color = cluster, alpha = 0.5), # plot specific graphs from previous matrix with scatterplot g <- ggplot(mammals_df, aes(x = water, y = lactose, color = cluster)) + b <- ggplot(mammals_df, aes(x = protein, y = fat, color = cluster)) + ggplot(mammals_df, aes(x = cluster, y = protein)) + ggplot(mammals_df, aes(x = cluster, y = fat)) + ggplot(mammals_df, aes(x = cluster, y = lactose)) + ggplot(mammals_df, aes(x = cluster, y = ash)) + ggplot(mammals_df, aes(x = cluster, y = water)) + # Parallel coordiante plots allow us to put each feature on seperate column and lines connecting each column ggparcoord(data = mammals_df, columns = 1:5, groupColumn = 6, alphaLines = 0.4, title = "Parallel Coordinate Plot for the Mammals Milk Data", scale = "globalminmax", showPoints = TRUE) + theme(legend.position = "bottom") If you find this article useful feel free to share it with others or recommend this article! As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 10 Tips for Choosing the Optimal Number of Clusters was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
R tips and tricks – higher-order functions Posted: 27 Jan 2019 08:35 AM PST (This article was first published on R – Eran Raviv, and kindly contributed to R-bloggers) A higher-order function is a function that takes one or more functions as arguments, and\or returns a function as its result. This can be super handy in programming when you want to tilt your code towards readability and still keep it concise. # Generate some fake data > eps <- rnorm(10, sd= 5) > x <- c(1:10) > y <- 2+2*x + eps # Load libraries required > library(quantreg) > library(magrittr) > eps <- rnorm(10, sd= 5) > x <- c(1:10) > y <- 2+2*x + eps # create a higher order function > higher_order_function <- function(func){ + func(y ~ x) %>% summary + } > # Give as an argument the function "lm" > higher_order_function(lm) Call: func(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -12.0149 -0.7603 1.0969 2.7483 4.2373 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.3214 3.3338 0.396 0.70219 x 2.1690 0.5373 4.037 0.00375 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.88 on 8 degrees of freedom Multiple R-squared: 0.6708, Adjusted R-squared: 0.6296 F-statistic: 16.3 on 1 and 8 DF, p-value: 0.003751 # Now give as an argument the function rq (for regression quantile) > higher_order_function(rq) Call: func(formula = y ~ x) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) 3.80788 -1.26475 6.15759 x 1.83968 1.59747 2.98423 It's also quite safe to use in that if you provide a non-existent function it would not default to some unknown behavior but will return an error: > higher_order_function(mm) Error in eval(lhs, parent, parent) : object 'mm' not found However, this function can be also written as a sequence of if statements, like so > if_function <- function(x,y, which_reg){ + if (which_reg== "OLS") { lm(y~x) %>% summary } + else if (which_reg== "LAD") { rq(y~x) %>% summary } + } > if_function(x,y, which_reg= "OLS") Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -12.0149 -0.7603 1.0969 2.7483 4.2373 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.3214 3.3338 0.396 0.70219 x 2.1690 0.5373 4.037 0.00375 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.88 on 8 degrees of freedom Multiple R-squared: 0.6708, Adjusted R-squared: 0.6296 F-statistic: 16.3 on 1 and 8 DF, p-value: 0.003751 > if_function(x,y, which_reg= "LAD") Call: rq(formula = y ~ x) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) 3.80788 -1.26475 6.15759 x 1.83968 1.59747 2.98423 Using higher-order functions does not seem to create any additional computational cost: > library(microbenchmark) > microbenchmark( higher_order_function(rq), if_function(x, y, "LAD") ) Unit: milliseconds expr min lq mean median uq higher_order_function(rq) 1.463210 1.498967 1.563553 1.527253 1.624969 if_function(x, y, "LAD") 1.468262 1.498464 1.584453 1.618997 1.644462 max neval 2.280419 100 2.082765 100 > microbenchmark( higher_order_function(lm), if_function(x, y, "OLS") ) Unit: microseconds expr min lq mean median uq max higher_order_function(lm) 916.858 928.8825 946.9838 935.3930 955.791 1025.575 if_function(x, y, "OLS") 918.674 928.1260 953.2587 938.0465 958.284 1433.167 neval 100 100 So you can make your code more concise with little computational overhead.
To leave a comment for the author, please follow the link and comment on their blog: R – Eran Raviv. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
Statistics Sunday: Creating a Stacked Bar Chart for Rank Data Posted: 27 Jan 2019 06:00 AM PST (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)
So how could I present these results? One idea I had was a stacked bar chart, and it took a bit of data wrangling to do it. That is, the rankings were all in separate variables, but I want them all on the same chart. Basically, I needed to create a dataset with:
1 variable to represent the factor being ranked What I ultimately did was run frequencies for the factor variables, turn those frequency tables into data frames, and merged them together with rbind. I then created chart with ggplot. Here's some code for a simplified example, which only uses 6 factors and asks people to rank the top 3. First, let's read in our sample dataset – note that these data were generated only for this example and are not real data: library(tidyverse) ## Warning: package 'ggplot2' was built under R version 3.5.1 ranks <- read_csv("C:/Users/slocatelli/Desktop/sample_ranks.csv", col_names = TRUE) This dataset contains 7 variables – 1 respondent ID and 6 variables with ranks on factors considered important in a job: salary, recognition from employer, paid time off, insurance benefits, flexible scheduling, and opportunity to learn. I want to run frequencies for these variables, and turn those frequency tables into a data frame I can use in ggplot2. I'm sure there are much cleaner ways to do this (and please share in the comments!), but here's one not so pretty way: salary <- as.data.frame(table(ranks$Salary)) With my not-so-pretty data wrangling, the chart itself is actually pretty easy: ggplot(rank_chart, aes(fill = Var1, y = Freq, x = Name)) + Based on this chart, we can see the top factor is Salary. Insurance is slightly more important than paid time off, but these are definitely the top 2 and 3 factors. Recognition wasn't ranked by most people, but those who did considered it their #2 factor; ditto for flexible scheduling at #3. Opportunity to learn didn't make the top 3 for most respondents.
To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
littler 0.3.6: Two neat enhancements Posted: 26 Jan 2019 08:45 AM PST (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) The seventh release of littler as a CRAN package is now available, following in the now more than twelve-year history as a package started by Jeff in 2006, and joined by me a few weeks later. littler is the first command-line interface for R and predates littler lives on Linux and Unix, has its difficulties on macOS due to yet-another-braindeadedness there (who ever thought case-insensitive filesystems as a default where a good idea?) and simply does not exist on Windows (yet – the build system could be extended – see RInside for an existence proof, and volunteers are welcome!). A few examples as highlighted at the Github repo, as well as in the examples vignette. This release brings updates to the two (or three) scripts I use all the time to install or update packages. First, The
CRANberries provides a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page. The code is available via the GitHub repo, from tarballs and now of course all from its CRAN page and via Comments and suggestions are welcome at the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment