[R-bloggers] Building Big Shiny Apps — A Workflow (1/2) (and 4 more aRticles)

[R-bloggers] Building Big Shiny Apps — A Workflow (1/2) (and 4 more aRticles)

Link to R-bloggers

Building Big Shiny Apps — A Workflow (1/2)

Posted: 27 Jan 2019 11:00 AM PST

(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

During the rstudio::conf(2019L), I've presented an eposter called "Building Big Shiny Apps — A Workflow". You can find the poster here, and this blog post is an attempt at a transcription of what I've been talking about while presenting the poster.

As this is a rather long topic, I've divided this post into two parts: this first post will talk about the background and motivation, and the second post will present a step by step workflow and the necessary tools.

Motivation

The idea behind this poster (and now this blog post) is not to talk about how to deploy and scale, but about the process of building the app. Why? A lot of blog posts and books talk about putting apps in production. Very few talks about working on building these apps. This is why I've chosen to talk about the process, workflow, and tools we use at ThinkR when building big Shiny Apps.

So,we'll not talk about what to do when the app is ready, we'll talk about how to make it ready.

About Big Shiny Apps

If you're reading this page, chances are you already know what a Shiny App is — a web application that communicates with R, built in R, and working with R. With Shiny, almost anybody can create a prototype for a small data product in a matter of hours. And no knowledge of HTML, CSS or JavaScript is required, making it really easy to use as you can rapidly create a POC.

But what to do now that you want to build a big Shiny App?

What's a big Shiny App?

  • Well, first, one that includes several thousand lines of code (R and others).
  • It's also one that is potentially developed by several coders, working on the same application at the same time.
  • It's an application where scaling matters.
  • Maintainability and ease of upgrading are important.
  • In many cases, Shiny Apps in production are not used by "tech literate" users.
  • People rely on this application for making real-world decisions, with real consequences — just as phrased by Joe Cheng with his definition of what "in production" means:

Building a Big Shiny App, the Challenges

Finding a good UI (and stick with it)

Choosing a UI is hard — we have a natural tendency, as coders, to be focused on the backend, i.e the algorithmic part of the application. But let's state the truth: no matter how complex and innovative your backend is, your application is bad is your UI is bad. That's the hard truth. If people can't understand how to use your application, your application doesn't work. No matter how incredible the backend is.

So try to find a simple and efficient UI (I'm personally a big fan of minimalistic UI). One that people can understand and use in a matter of seconds. Don't implement features or visual elements that are not actually needed, just "in case". And spend time working on that UI, really thinking about what visual elements you are implementing.

This part is a crucial part, as it will influence the rest of the work — a big app means numerous features, and it can be hard to find a way to organise all these features in an understandable, easy to use user interface.

Working as a team

Big Shiny Apps usually means that several peoples will work on the application. For example, at ThinkR, 3 to 4 peoples usually work on the application. So, how do we organize that?

From the tools point of view:

  • Use version control (not sure I have to expand on that topic 😉 )
  • Think of your shiny app as a tree, and divide it as much as possible into little pieces. Then, create one Shiny module by piece. This allows you to split the work, and also to have smaller files — it's easier to work on 40 files of 200 lines than on one big app.R file.

Here is, for example, the division of an app into modules and sub modules:

From the organisational point of view

  • Define one person who is in charge of having the big picture. This person will kick off the project, and write the skeleton of the app, with the good modules and files structure. This person will also be in charge of accepting new merge requests from other developers, and to orchestrate the master and dev branches.
  • List the tasks, and open one issue for each task on your version control system. Each issue will be solved in a separate branch.
  • Finally, assign one module to one developer — if it seems that working on one module is a two-person job, divide again into two other submodules. This is a relatively complex task, as the output of one module influences the input of another, so be sure to assign them well. This division into modules allow working simultaneously on the same app without interfering with other people's code — as each developer is working on their own modules, collaboration is way simpler.

Making the app production ready

This includes two things: scaling and maintaining. As said in the disclaimer, I won't expand on the topic of scaling, as many have written about that, but here is one piece of advice: make the R process running the app do as less as possible, and in particular prevent it from doing what it's not supposed to do. Which includes: use JavaScript so that the client browser renders things (instead of making R do the work — and don't worry, basic JS is easy to learn 😉 ), use parallelization and async, and if possible, make the heavy lifting be done outside the R session running the app.

Maintainance, on the other, is something to think about from the beginning. It includes being able to ensure that the application will work on the long run, and that new features can be easily implemented.

  • Working in the long run: separate the code with "business logic" (aka the data manipulation and the algorithm, that can work outside the context of the app) from the code building the application. That way, you can write regression tests for these functions to ensure they are stable.
  • Implement new elements: as we are working with modules, it's easy to insert new elements inside the global application.

That's it for today! Stay tuned for the second part of this post that will describe the workflow in details, and talk about the tools to use 🙂

The post Building Big Shiny Apps — A Workflow (1/2) appeared first on (en) The R Task Force.

To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

10 Tips for Choosing the Optimal Number of Clusters

Posted: 27 Jan 2019 08:59 AM PST

(This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers)

Photo by Pakata Goh on Unsplash

Clustering is one of the most common unsupervised machine learning problems. Similarity between observations is defined using some inter-observation distance measures or correlation-based distance measures.

There are 5 classes of clustering methods:

+ Hierarchical Clustering
+ Partitioning Methods (k-means, PAM, CLARA)
+ Density-Based Clustering
+ Model-based Clustering
+ Fuzzy Clustering

My desire to write this post came mainly from reading about the clustree package, the dendextend documentation, and the Practical Guide to Cluster Analysis in R book written by Alboukadel Kassambara author of the factoextra package.

Data Set

I will be using a lesser known data set from the cluster package: all.mammals.milk.1956, one which I haven't looked at before.

This small dataset contains a list of 25 mammals and the constituents of their milk (water, protein, fat, lactose, ash percentages) from John Hartigan, Clustering Algorithms, Wiley, 1975.

First let's load the required packages.

library(tidyverse)
library(magrittr)
library(cluster)
library(cluster.datasets)
library(cowplot)
library(NbClust)
library(clValid)
library(ggfortify)
library(clustree)
library(dendextend)
library(factoextra)
library(FactoMineR)
library(corrplot)
library(GGally)
library(ggiraphExtra)
library(knitr)
library(kableExtra)

Now load the data.

data("all.mammals.milk.1956")
raw_mammals <- all.mammals.milk.1956
# subset dataset
mammals <- raw_mammals %>% select(-name) # set rownames
mammals <- as_tibble(mammals)

Let's explore and visualize the data.

# Glimpse the data set
glimpse(mammals)
Observations: 25
Variables: 5
$ water 90.1, 88.5, 88.4, 90.3, 90.4, 87.7, 86.9, 82.1, 81.9, 81.6, 81.6, 86.5, 90.0,...
$ protein 2.6, 1.4, 2.2, 1.7, 0.6, 3.5, 4.8, 5.9, 7.4, 10.1, 6.6, 3.9, 2.0, 7.1, 3.0, 5...
$ fat 1.0, 3.5, 2.7, 1.4, 4.5, 3.4, 1.7, 7.9, 7.2, 6.3, 5.9, 3.2, 1.8, 5.1, 4.8, 6....
$ lactose 6.9, 6.0, 6.4, 6.2, 4.4, 4.8, 5.7, 4.7, 2.7, 4.4, 4.9, 5.6, 5.5, 3.7, 5.3, 4....
$ ash 0.35, 0.24, 0.18, 0.40, 0.10, 0.71, 0.90, 0.78, 0.85, 0.75, 0.93, 0.80, 0.47,...

All the variables are expressed as numeric. What about the statistical distribution?

# Summary of data set
summary(mammals) %>% kable() %>% kable_styling()
# Historgram for each attribute
mammals %>%
gather(Attributes, value, 1:5) %>%
ggplot(aes(x=value)) +
geom_histogram(fill = "lightblue2", color = "black") +
facet_wrap(~Attributes, scales = "free_x") +
labs(x = "Value", y = "Frequency")

What's the relationship between the different attributes? Use `corrplot()` to create correlation matrix.

corrplot(cor(mammals), type = "upper", method = "ellipse", tl.cex = 0.9)

When you have variables which are measured in different scales it is useful to scale the data.

mammals_scaled <- scale(mammals)
rownames(mammals_scaled) <- raw_mammals$name

Dimensionality reduction can help with data visualization (e.g. PCA method).

res.pca <- PCA(mammals_scaled,  graph = FALSE)
# Visualize eigenvalues/variances
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))

These are the 5 PCs that capture 80% of the variance. The scree plot shows that PC1 captured ~ 75% of the variance.

# Extract the results for variables
var <- get_pca_var(res.pca)
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
# Control variable colors using their contributions to the principle axis
fviz_pca_var(res.pca, col.var="contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
) + theme_minimal() + ggtitle("Variables - PCA")

From these visualizations it's apparrent that water and lactose tend to increase together and that protein, ash and fat increase together; the two groups being inversely related.

Naïve K-means clustering

Partitioning clustering methods, like k-means and Partitioning Around Medoids (PAM), require that you specify the number of clusters to be generated.

k-means clusters is probably one of the most well known partitioning methods. The idea behind k-means clustering consists of defining clusters the total within-cluster variation , which measures the compactness of the clusters is minimized.

We can compute k-means in R with the kmeans() function:

km2 <- kmeans(mammals_scaled, centers = 2, nstart = 30)

The above example would group the data into two clusters, centers = 2, and attempt multiple initial configurations, reporting on the best one. For example, as this algorithm is sensitive to the initial positions of the cluster centroids adding nstart = 30 will generate 30 initial configurations and then average all the centroid results.

Because the number of clusters (k) needs to be set before we start it can be advantageous to examine several different values of k.

kmean_calc <- function(df, ...){
kmeans(df, scaled = ..., nstart = 30)
}
km2 <- kmean_calc(mammals_scaled, 2)
km3 <- kmean_calc(mammals_scaled, 3)
km4 <- kmeans(mammals_scaled, 4)
km5 <- kmeans(mammals_scaled, 5)
km6 <- kmeans(mammals_scaled, 6)
km7 <- kmeans(mammals_scaled, 7)
km8 <- kmeans(mammals_scaled, 8)
km9 <- kmeans(mammals_scaled, 9)
km10 <- kmeans(mammals_scaled, 10)
km11 <- kmeans(mammals_scaled, 11)
p1 <- fviz_cluster(km2, data = mammals_scaled, frame.type = "convex") + theme_minimal() + ggtitle("k = 2") 
p2 <- fviz_cluster(km3, data = mammals_scaled, frame.type = "convex") + theme_minimal() + ggtitle("k = 3")
p3 <- fviz_cluster(km4, data = mammals_scaled, frame.type = "convex") + theme_minimal() + ggtitle("k = 4")
p4 <- fviz_cluster(km5, data = mammals_scaled, frame.type = "convex") + theme_minimal() + ggtitle("k = 5")
p5 <- fviz_cluster(km6, data = mammals_scaled, frame.type = "convex") + theme_minimal() + ggtitle("k = 6")
p6 <- fviz_cluster(km7, data = mammals_scaled, frame.type = "convex") + theme_minimal() + ggtitle("k = 7")
plot_grid(p1, p2, p3, p4, p5, p6, labels = c("k2", "k3", "k4", "k5", "k6", "k7"))

Although this visual assessment tells us where delineations occur between clusters, it does not tell us what the optimal number of clusters is.

Determining Optimal Number of Clusters

A variety of measures have been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm. There are more than thirty indices and methods for identifying the optimal number of clusters so I'll just focus on a few here including the very neat clustree package.

The "Elbow" Method

Probably the most well known method, the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.

set.seed(31)
# function to compute total within-cluster sum of squares
fviz_nbclust(mammals_scaled, kmeans, method = "wss", k.max = 24) + theme_minimal() + ggtitle("the Elbow Method")

The Elbow Curve method is helpful because it shows how increasing the number of the clusters contribute separating the clusters in a meaningful way, not in a marginal way. The bend indicates that additional clusters beyond the third have little value. (See [here] for a more mathematically rigorous interpretation and implementation of this method).

The Gap Statistic

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.

gap_stat <- clusGap(mammals_scaled, FUN = kmeans, nstart = 30, K.max = 24, B = 50)
fviz_gap_stat(gap_stat) + theme_minimal() + ggtitle("fviz_gap_stat: Gap Statistic")

The gap stats plot shows the statistics by number of clusters (k) with standard errors drawn with vertical segments and the optimal value of k marked with a vertical dashed blue line. According to this observation k = 2 is the optimal number of clusters in the data.

The Silhouette Method

Another visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.

fviz_nbclust(mammals_scaled, kmeans, method = "silhouette", k.max = 24) + theme_minimal() + ggtitle("The Silhouette Plot")

This also suggests an optimal of 2 clusters.

The Sum of Squares Method

Another clustering validation method would be to choose the optimal number of cluster by minimizing the within-cluster sum of squares (a measure of how tight each cluster is) and maximizing the between-cluster sum of squares (a measure of how seperated each cluster is from the others).

ssc <- data.frame(
kmeans = c(2,3,4,5,6,7,8),
within_ss = c(mean(km2$withinss), mean(km3$withinss), mean(km4$withinss), mean(km5$withinss), mean(km6$withinss), mean(km7$withinss), mean(km8$withinss)),
between_ss = c(km2$betweenss, km3$betweenss, km4$betweenss, km5$betweenss, km6$betweenss, km7$betweenss, km8$betweenss)
)
ssc %<>% gather(., key = "measurement", value = value, -kmeans)
#ssc$value <- log10(ssc$value)
ssc %>% ggplot(., aes(x=kmeans, y=log10(value), fill = measurement)) + geom_bar(stat = "identity", position = "dodge") + ggtitle("Cluster Model Comparison") + xlab("Number of Clusters") + ylab("Log10 Total Sum of Squares") + scale_x_discrete(name = "Number of Clusters", limits = c("0", "2", "3", "4", "5", "6", "7", "8"))

From this measurement it appears 7 clusters would be the appropriate choice.

NbClust

The NbClust package provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

res.nbclust <- NbClust(mammals_scaled, distance = "euclidean",
min.nc = 2, max.nc = 9,
method = "complete", index ="all")
factoextra::fviz_nbclust(res.nbclust) + theme_minimal() + ggtitle("NbClust's optimal number of clusters")

This suggest the optimal number of clusters is 3.

Clustree

The statistical method above produce a single score that only considers a single set of clusters at a time. The clustree R package takes an alternative approach by considering how samples change groupings as the number of clusters increases. This is useful for showing which clusters are distinct and which are unstable. It doesn't explicitly tell you which choice of optimal clusters is but it is useful for exploring possible choices.

Let's take a look at 1 to 11 clusters.

tmp <- NULL
for (k in 1:11){
tmp[k] <- kmeans(mammals_scaled, k, nstart = 30)
}
df <- data.frame(tmp)
# add a prefix to the column names
colnames(df) <- seq(1:11)
colnames(df) <- paste0("k",colnames(df))
# get individual PCA
df.pca <- prcomp(df, center = TRUE, scale. = FALSE)
ind.coord <- df.pca$x
ind.coord <- ind.coord[,1:2]
df <- bind_cols(as.data.frame(df), as.data.frame(ind.coord))
clustree(df, prefix = "k")

In this figure the size of each node corresponds to the number of samples in each cluster, and the arrows are coloured according to the number of samples each cluster receives. A separate set of arrows, the transparent ones, called the incoming node proportion, are also coloured and shows how samples from one group end up in another group — an indicator of cluster instability.

In this graph we see that as we move from k=2 to k=3 a number of species from the lookers-left cluster are reasigned to the third cluster on the right. As we move from k=8 to k=9 we see one node with multiple incoming edges an indicator that we over-clustered the data.

It can also be useful to overlay this dimension on other dimensions in the data, particularly those that come from dimensionality reduction techniques. We can do this using the clustree_overlay() function:

df_subset <- df %>% select(1:8,12:13)
clustree_overlay(df_subset, prefix = "k", x_value = "PC1", y_value = "PC2")

I prefer to see it from the side, showing one of the x or y dimensions against the resolution dimension.

overlay_list <- clustree_overlay(df_subset, prefix = "k", x_value = "PC1",
y_value = "PC2", plot_sides = TRUE)
overlay_list$x_side
overlay_list$y_side

This shows that we can an indication of the correct clustering resolution by examining the edges and we can overly information to assess the quality of the clustering.

Choosing the appropriate algorithm

What about choice of appropriate clustering algorithm? The cValid package can be used to simultaneously compare multiple clustering algorithms, to identify the best clustering approach and the optimal number of clusters. We will compare k-means, hierarchical and PAM clustering.

intern <- clValid(mammals_scaled, nClust = 2:24, 
clMethods = c("hierarchical","kmeans","pam"), validation = "internal")
# Summary
summary(intern) %>% kable() %>% kable_styling()
Clustering Methods:
hierarchical kmeans pam

Cluster sizes:
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Validation Measures:
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

hierarchical Connectivity 4.1829 10.5746 13.2579 20.1579 22.8508 25.8258 32.6270 35.3032 38.2905 39.2405 41.2405 45.7742 47.2742 50.6075 52.6075 55.8575 58.7242 60.7242 63.2242 65.2242 67.2242 69.2242 71.2242
Dunn 0.3595 0.3086 0.3282 0.2978 0.3430 0.3430 0.4390 0.4390 0.5804 0.5938 0.5938 0.8497 0.8497 0.5848 0.5848 0.4926 0.9138 0.9138 0.8892 0.9049 0.9335 1.0558 2.1253
Silhouette 0.5098 0.5091 0.4592 0.4077 0.4077 0.3664 0.3484 0.4060 0.3801 0.3749 0.3322 0.3646 0.3418 0.2650 0.2317 0.2166 0.2469 0.2213 0.1659 0.1207 0.1050 0.0832 0.0691
kmeans Connectivity 7.2385 10.5746 15.8159 20.1579 22.8508 25.8258 33.5198 35.3032 38.2905 39.2405 41.2405 45.7742 47.2742 51.8909 53.8909 57.1409 58.7242 60.7242 63.2242 65.2242 67.2242 69.2242 71.2242
Dunn 0.2070 0.3086 0.2884 0.2978 0.3430 0.3430 0.3861 0.4390 0.5804 0.5938 0.5938 0.8497 0.8497 0.5866 0.5866 0.5725 0.9138 0.9138 0.8892 0.9049 0.9335 1.0558 2.1253
Silhouette 0.5122 0.5091 0.4260 0.4077 0.4077 0.3664 0.3676 0.4060 0.3801 0.3749 0.3322 0.3646 0.3418 0.2811 0.2478 0.2402 0.2469 0.2213 0.1659 0.1207 0.1050 0.0832 0.0691
pam Connectivity 7.2385 14.1385 17.4746 24.0024 26.6857 32.0413 33.8913 36.0579 38.6607 40.6607 42.7869 45.7742 47.2742 51.7242 53.7242 56.9742 58.7242 60.7242 62.7242 64.7242 66.7242 69.2242 71.2242
Dunn 0.2070 0.1462 0.2180 0.2180 0.2978 0.2980 0.4390 0.4390 0.4390 0.4390 0.4390 0.8497 0.8497 0.5314 0.5314 0.4782 0.9138 0.9138 0.8333 0.8189 0.7937 1.0558 2.1253
Silhouette 0.5122 0.3716 0.4250 0.3581 0.3587 0.3318 0.3606 0.3592 0.3664 0.3237 0.3665 0.3646 0.3418 0.2830 0.2497 0.2389 0.2469 0.2213 0.1758 0.1598 0.1380 0.0832 0.0691

Optimal Scores:

Score Method Clusters
Connectivity 4.1829 hierarchical 2
Dunn 2.1253 hierarchical 24
Silhouette 0.5122 kmeans 2

Connectivity and Silhouette are both measurements of connectedness while the Dunn Index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance.

Extracting Features of Clusters

We would like to answer questions like "what is it that makes this cluster unique from others?" and "what are the clusters that are similar to one another"

As mentioned earlier it's difficult to assess the quality of results from clustering. We don't have true labels so clustering is a good EDA starting point for exploring the differences between these clusters in greater detail. Let's select five clusters and interrogate the features of these clusters.

# Compute dissimilarity matrix with euclidean distances
d <- dist(mammals_scaled, method = "euclidean")
# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )
# Cut tree into 5 groups
grp <- cutree(res.hc, k = 5)
# Visualize
plot(res.hc, cex = 0.6) # plot tree
rect.hclust(res.hc, k = 5, border = 2:5) # add rectangle
# Execution of k-means with k=5
final <- kmeans(mammals_scaled, 5, nstart = 30)
fviz_cluster(final, data = mammals_scaled) + theme_minimal() + ggtitle("k = 5")

Let's extract the clusters and add them back to our initial data to do some descriptive statistics at the cluster level:

as.data.frame(mammals_scaled) %>% mutate(Cluster = final$cluster) %>% group_by(Cluster) %>% summarise_all("mean") %>% kable() %>% kable_styling()

We see that cluster 2, composed solely of the Rabbit has a high ash content. Group 3 composed of the seal and dolphin are high in fat, which makes sense given the harsh demands of such a cold climate whil group 4 has a large lactose content.

mammals_df <- as.data.frame(mammals_scaled) %>% rownames_to_column()
cluster_pos <- as.data.frame(final$cluster) %>% rownames_to_column()
colnames(cluster_pos) <- c("rowname", "cluster")
mammals_final <- inner_join(cluster_pos, mammals_df)
ggRadar(mammals_final[-1], aes(group = cluster), rescale = FALSE, legend.position = "none", size = 1, interactive = FALSE, use.label = TRUE) + facet_wrap(~cluster) + scale_y_discrete(breaks = NULL) + # don't show ticks
theme(axis.text.x = element_text(size = 10)) + scale_fill_manual(values = rep("#1c6193", nrow(mammals_final))) +
scale_color_manual(values = rep("#1c6193", nrow(mammals_final))) +
ggtitle("Mammals Milk Attributes")
mammals_df <- as.data.frame(mammals_scaled)
mammals_df$cluster <- final$cluster
mammals_df$cluster <- as.character(mammals_df$cluster)
ggpairs(mammals_df, 1:5, mapping = ggplot2::aes(color = cluster, alpha = 0.5), 
diag = list(continuous = wrap("densityDiag")),
lower=list(continuous = wrap("points", alpha=0.9)))
# plot specific graphs from previous matrix with scatterplot
g <- ggplot(mammals_df, aes(x = water, y = lactose, color = cluster)) +
geom_point() +
theme(legend.position = "bottom")
ggExtra::ggMarginal(g, type = "histogram", bins = 20, color = "grey", fill = "blue")
b <- ggplot(mammals_df, aes(x = protein, y = fat, color = cluster)) +
geom_point() +
theme(legend.position = "bottom")
ggExtra::ggMarginal(b, type = "histogram", bins = 20, color = "grey", fill = "blue")
ggplot(mammals_df, aes(x = cluster, y = protein)) + 
geom_boxplot(aes(fill = cluster))
ggplot(mammals_df, aes(x = cluster, y = fat)) + 
geom_boxplot(aes(fill = cluster))
ggplot(mammals_df, aes(x = cluster, y = lactose)) + 
geom_boxplot(aes(fill = cluster))
ggplot(mammals_df, aes(x = cluster, y = ash)) + 
geom_boxplot(aes(fill = cluster))
ggplot(mammals_df, aes(x = cluster, y = water)) + 
geom_boxplot(aes(fill = cluster))
# Parallel coordiante plots allow us to put each feature on seperate column and lines connecting each column
ggparcoord(data = mammals_df, columns = 1:5, groupColumn = 6, alphaLines = 0.4, title = "Parallel Coordinate Plot for the Mammals Milk Data", scale = "globalminmax", showPoints = TRUE) + theme(legend.position = "bottom")

If you find this article useful feel free to share it with others or recommend this article! 😃

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄


10 Tips for Choosing the Optimal Number of Clusters was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

R tips and tricks – higher-order functions

Posted: 27 Jan 2019 08:35 AM PST

(This article was first published on R – Eran Raviv, and kindly contributed to R-bloggers)

A higher-order function is a function that takes one or more functions as arguments, and\or returns a function as its result. This can be super handy in programming when you want to tilt your code towards readability and still keep it concise.

Consider the following code:

# Generate some fake data  > eps <- rnorm(10, sd= 5)  > x <- c(1:10)  > y <- 2+2*x + eps  # Load libraries required  > library(quantreg)  > library(magrittr)  > eps <- rnorm(10, sd= 5)  > x <- c(1:10)  > y <- 2+2*x + eps  # create a higher order function  > higher_order_function <- function(func){  +   func(y ~ x) %>% summary  + }  >   # Give as an argument the function "lm"  > higher_order_function(lm)    Call:  func(formula = y ~ x)  Residuals:       Min       1Q   Median       3Q      Max   -12.0149  -0.7603   1.0969   2.7483   4.2373   Coefficients:              Estimate Std. Error t value Pr(>|t|)     (Intercept)   1.3214     3.3338   0.396  0.70219     x             2.1690     0.5373   4.037  0.00375 **  ---  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Residual standard error: 4.88 on 8 degrees of freedom  Multiple R-squared:  0.6708,	Adjusted R-squared:  0.6296   F-statistic:  16.3 on 1 and 8 DF,  p-value: 0.003751    # Now give as an argument the function rq (for regression quantile)  > higher_order_function(rq)    Call: func(formula = y ~ x)  tau: [1] 0.5  Coefficients:              coefficients lower bd upper bd  (Intercept)  3.80788     -1.26475  6.15759  x            1.83968      1.59747  2.98423

It's also quite safe to use in that if you provide a non-existent function it would not default to some unknown behavior but will return an error:

> higher_order_function(mm)    Error in eval(lhs, parent, parent) : object 'mm' not found

However, this function can be also written as a sequence of if statements, like so

> if_function <- function(x,y, which_reg){   +   if (which_reg== "OLS") { lm(y~x) %>% summary }  +   else if (which_reg== "LAD") { rq(y~x) %>% summary }  + }     > if_function(x,y, which_reg= "OLS")    Call:  lm(formula = y ~ x)  Residuals:       Min       1Q   Median       3Q      Max   -12.0149  -0.7603   1.0969   2.7483   4.2373   Coefficients:              Estimate Std. Error t value Pr(>|t|)     (Intercept)   1.3214     3.3338   0.396  0.70219     x             2.1690     0.5373   4.037  0.00375 **  ---  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Residual standard error: 4.88 on 8 degrees of freedom  Multiple R-squared:  0.6708,	Adjusted R-squared:  0.6296   F-statistic:  16.3 on 1 and 8 DF,  p-value: 0.003751    > if_function(x,y, which_reg= "LAD")  Call: rq(formula = y ~ x)  tau: [1] 0.5  Coefficients:              coefficients lower bd upper bd  (Intercept)  3.80788     -1.26475  6.15759  x            1.83968      1.59747  2.98423

Using higher-order functions does not seem to create any additional computational cost:

> library(microbenchmark)  > microbenchmark( higher_order_function(rq), if_function(x, y, "LAD") )    Unit: milliseconds                        expr      min       lq     mean   median       uq   higher_order_function(rq) 1.463210 1.498967 1.563553 1.527253 1.624969    if_function(x, y, "LAD") 1.468262 1.498464 1.584453 1.618997 1.644462        max neval   2.280419   100   2.082765   100    > microbenchmark( higher_order_function(lm), if_function(x, y, "OLS") )    Unit: microseconds                        expr     min       lq     mean   median      uq      max   higher_order_function(lm) 916.858 928.8825 946.9838 935.3930 955.791 1025.575    if_function(x, y, "OLS") 918.674 928.1260 953.2587 938.0465 958.284 1433.167   neval     100     100

So you can make your code more concise with little computational overhead.

To leave a comment for the author, please follow the link and comment on their blog: R – Eran Raviv.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

Posted: 27 Jan 2019 06:00 AM PST

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Stacked Bar Chart for Rank Data At work on Friday, I was trying to figure out the best way to display some rank data. What I had were rankings from 1-5 for 10 factors considered most important in a job (such as Salary, Insurance Benefits, and the Opportunity to Learn), meaning each respondent chose and ranked the top 5 from those 10, and the remaining 5 were unranked by that respondent. Without even thinking about the missing data issue, I computed a mean rank and called it a day. (Yes, I know that ranks are ordinal and means are for continuous data, but my goal was simply to differentiate importance of the factors and a mean seemed the best way to do it.) Of course, then we noticed one of the factors had a pretty high average rank, even though few people ranked it in the top 5. Oops.

So how could I present these results? One idea I had was a stacked bar chart, and it took a bit of data wrangling to do it. That is, the rankings were all in separate variables, but I want them all on the same chart. Basically, I needed to create a dataset with:

    1 variable to represent the factor being ranked

  • 1 variable to represent the ranking given (1-5, or 6 that I called "Not Ranked")
  • 1 variable to represent the number of people giving that particular rank that particular factor

What I ultimately did was run frequencies for the factor variables, turn those frequency tables into data frames, and merged them together with rbind. I then created chart with ggplot. Here's some code for a simplified example, which only uses 6 factors and asks people to rank the top 3.

First, let's read in our sample dataset – note that these data were generated only for this example and are not real data:

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ranks <- read_csv("C:/Users/slocatelli/Desktop/sample_ranks.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
## RespID = col_integer(),
## Salary = col_integer(),
## Recognition = col_integer(),
## PTO = col_integer(),
## Insurance = col_integer(),
## FlexibleHours = col_integer(),
## OptoLearn = col_integer()
## )

This dataset contains 7 variables – 1 respondent ID and 6 variables with ranks on factors considered important in a job: salary, recognition from employer, paid time off, insurance benefits, flexible scheduling, and opportunity to learn. I want to run frequencies for these variables, and turn those frequency tables into a data frame I can use in ggplot2. I'm sure there are much cleaner ways to do this (and please share in the comments!), but here's one not so pretty way:

salary <- as.data.frame(table(ranks$Salary))
salary$Name <- "Salary"
recognition <- as.data.frame(table(ranks$Recognition))
recognition$Name <- "Recognition by \nEmployer"
PTO <- as.data.frame(table(ranks$PTO))
PTO$Name <- "Paid Time Off"
insurance <- as.data.frame(table(ranks$Insurance))
insurance$Name <- "Insurance"
flexible <- as.data.frame(table(ranks$FlexibleHours))
flexible$Name <- "Flexible Schedule"
learn <- as.data.frame(table(ranks$OptoLearn))
learn$Name <- "Opportunity to \nLearn"

rank_chart <- rbind(salary, recognition, PTO, insurance, flexible, learn)
rank_chart$Var1 <- as.numeric(rank_chart$Var1)

With my not-so-pretty data wrangling, the chart itself is actually pretty easy:

ggplot(rank_chart, aes(fill = Var1, y = Freq, x = Name)) +
geom_bar(stat = "identity") +
labs(title = "Ranking of Factors Most Important in a Job") +
ylab("Frequency") +
xlab("Job Factors") +
scale_fill_continuous(name = "Ranking",
breaks = c(1:4),
labels = c("1","2","3","Not Ranked")) +
theme_bw() +
theme(plot.title=element_text(hjust=0.5))

Based on this chart, we can see the top factor is Salary. Insurance is slightly more important than paid time off, but these are definitely the top 2 and 3 factors. Recognition wasn't ranked by most people, but those who did considered it their #2 factor; ditto for flexible scheduling at #3. Opportunity to learn didn't make the top 3 for most respondents.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

littler 0.3.6: Two neat enhancements

Posted: 26 Jan 2019 08:45 AM PST

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

max-heap image

The seventh release of littler as a CRAN package is now available, following in the now more than twelve-year history as a package started by Jeff in 2006, and joined by me a few weeks later.

littler is the first command-line interface for R and predates Rscript. And it is (in my very biased eyes) better as it allows for piping as well shebang scripting via #!, uses command-line arguments more consistently and still starts faster. It also always loaded the methods package which Rscript converted to rather recently.

littler lives on Linux and Unix, has its difficulties on macOS due to yet-another-braindeadedness there (who ever thought case-insensitive filesystems as a default where a good idea?) and simply does not exist on Windows (yet – the build system could be extended – see RInside for an existence proof, and volunteers are welcome!).

A few examples as highlighted at the Github repo, as well as in the examples vignette.

This release brings updates to the two (or three) scripts I use all the time to install or update packages. First, install.r (and install2.r) now recognise when they are in a source directory and will do the right thing when called without an argument there. They also recognise . in that. Second, Colin sent in a nice PR to use Ncpus for package installation in install2.r, which I then carried over to update.r. Now, if you add, say, options(Ncpus=4), to you startup file, installations and upgrade will operate in parallel which is nice. (That also made me realize that R had a microbug for which I sent this patch now in R-devel.) Lastly, we also started a new (and still very small) FAQ vignette about littler.

The NEWS file entry is below.

Changes in littler version 0.3.6 (2019-01-26)

  • Changes in examples

    • The scripts install.r and install2.r now support argument ".", and add it if called in a source directory.

    • The script install2.r can set Ncpus for install.packages() (Colin Gillespie in #63 fixing #62)

    • The script update.r can also set Ncpus for install.packages().

    • A new vignette "litter-faq" was added.

CRANberries provides a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page. The code is available via the GitHub repo, from tarballs and now of course all from its CRAN page and via install.packages("littler"). Binary packages are available directly in Debian as well as soon via Ubuntu binaries at CRAN thanks to the tireless Michael Rutter.

Comments and suggestions are welcome at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Comments