[R-bloggers] ‘There is a game I play’ – Analyzing Metacritic scores for video games (and 5 more aRticles)

‘There is a game I play’ – Analyzing Metacritic scores for video games
Explaining Predictions: Random Forest Post-hoc Analysis (randomForestExplainer package)
Seeking postdoc (or contractor) for next generation Stan language research and development
It is Time for CRAN to Ban Package Ads
Break up with Excel: Intro and Advanced R Data Science Courses at MSACL.org Salzburg Austria, September 21–23, 2019
rstudio::conf(2020) Diversity and international scholarships

‘There is a game I play’ – Analyzing Metacritic scores for video games

Posted: 30 Aug 2019 05:00 PM PDT

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There is a game I play / try to make myself okay / try so hard to make the pieces all fit / smash it apart / just for the f**k of it (Nine Inch Nails: The Big Come Down)

After this rather distressing opening by the Nine Inch Nails, let's turn to a more uplifting topic: video games! There is a dataset on Kaggle with ratings for over 4000 video games. Let's see what we can do with that, shall we?

Preparations

First, I'll load the file. I like fread() from {datatable} because it does most of the boring work for you like checking the separator – and it's really fast. It creates a data.table, of course, and for this post, I don't want that. So, I'm converting it 'back' to a data.frame.

library(data.table)  games <- as.data.frame(fread("metacritic_games.csv"))

This is how the first six rows of this dataset look like. There are more columns, so feel free to scroll to the left.

scroll_box(kable_styling(kable(head(games))),             width = "100%", height = "500px")

game	platform	developer	genre	number_players	rating	release_date	positive_critics	neutral_critics	positive_users	neutral_users	negative_users	metascore	user_score
Portal 2	PC	Valve Software	Action		E10+	Apr 18, 2011	51	1	1700	107	19	95	90
The Elder Scrolls V: Skyrim	PC	Bethesda Game Studios	Role-Playing	No Online Multiplayer	M	Nov 10, 2011	32	0	1616	322	451	94	82
The Legend of Zelda: Ocarina of Time 3D	3DS	GREZZO	Miscellaneous	No Online Multiplayer	E10+	Jun 19, 2011	84	1	283	20	5	94	90
Batman: Arkham City	PC	Rocksteady Studios	Action Adventure		T	Nov 21, 2011	27	0	240	34	27	91	87
Super Mario 3D Land	3DS	Nintendo	Action	No Online Multiplayer	E	Nov 13, 2011	81	1	251	39	11	90	84
Deus Ex: Human Revolution	PC	Nixxes Software	Action	No Online Multiplayer	M	Aug 23, 2011	52	0	520	112	78	90	85

You don't see this in the first six lines, but some games appear several times in the list for releases on several platforms. For now, I am aggregating these different releases to get only one (mean) value for each game. The resulting data.table has over 1k fewer rows than the original one.

games.agg <- aggregate(cbind(metascore, user_score) ~ game, FUN = mean, data = games)  nrow(games)

## [1] 5699

nrow(games.agg)

## [1] 4018

Comparing metascores and user ratings

Metacritic has – of course – metascores, but it also has user ratings. Both of these are available in our dataset. So, let's start by comparing those two. I suspect them to be highly correlated, but maybe there are also some interesting differences. I will start by drawing a simple scatterplot with a center diagonal. If a data point lies on the dashed red line, the metascore is the same as the user score. We will see, that this is not the case for all of the points.

library(ggplot2)  ggplot(games.agg, aes(x = metascore, y = user_score)) +    geom_point(alpha = .3, pch = 15) +    geom_density_2d() +    geom_segment(data = data.frame(),                 inherit.aes = F,                 aes(x = 0, y = 0, xend = 100, yend = 100),                 lty = "dashed", col = "red") +    scale_y_continuous(limits = c(0, 100)) +    scale_x_continuous(limits = c(0, 100)) +    labs(x = "Metascore", y = "User score", fill = "Count")

The 'contour lines' (created by geom_density_2d()) show us where the majority of the data points lie. There is a strong concentration at around 75/75. But we also see that there are some major disagreements between metascores and user scores. I will investigate them a bit further. If we say that there is a "disagreement" between metascores and user scores, we can try to capture this in a statistical way by calculating the residuals of a linear regression (a prediction) from metascores on user scores. The residuals are prediction errors, i.e. everything in a game's user score that cannot be explained by the game's metascore.

I will calculate the linear regression model lm(). Then, I extract the residuals and save them in the variable res.user.score. Using these, I am extracting the 10 highest negative residuals (user score is much lower than metascore) and the 5 highest positive residuals (user score is much higher than metascore). I am then creating a column Label which holds the game's name if the game has one of these highest or lowest residuals. The rest is some ggplot() trickery. Drop me a comment if you want to know more about specific things. The blue line represents the regression line – the visualization of the regression model.

library(ggrepel)  lmod <- lm(user_score ~ metascore, data = games.agg)    games.agg$res.user.score <- residuals(lmod)    lowest.res <- games.agg$game[order(games.agg$res.user.score)][1:10]  highest.res <- games.agg$game[order(games.agg$res.user.score,                                      decreasing = T)][1:5]    games.agg$Label <- ifelse(games.agg$game %in% c(lowest.res, highest.res),                            games.agg$game, NA)    ggplot(games.agg, aes(x = metascore, y = user_score)) +    geom_point(pch = 15, aes(color = is.na(Label), alpha = is.na(Label))) +    scale_color_manual(values = c("TRUE" = "black", "FALSE" = "red")) +    scale_alpha_manual(values = c("TRUE" = .3, "FALSE" = 1)) +    geom_smooth(method = "lm", formula = "y ~ x", se =  F) +    scale_y_continuous(limits = c(0, 100)) +    scale_x_continuous(limits = c(0, 100)) +    geom_text_repel(aes(label = Label), force = 3, size = 3) +    guides(col = F, alpha = F) +    labs(x = "Metascore", y = "User score", fill = "Count")

Ouch, EA Sports – some of your recent games don't seem to meet your customers' standards. If I am not completely mistaken, 'NBA 2K19' and 'NBA 2K18' (what's even the point of this 'K'? You don't even save a single character by using it… get your shit together, EA Sports!) as well as 'FIFA 19' are (were?) all bloated with lootbox mechanics. This kind of gambling mechanics in a video game doesn't seem to go well with the audience – although most of the critics don't seem to care.

The labelled games above the regression line are the five games with the highest positive residuals. As I wrote above, these are the games where the customers gave very high scores in comparison to the critics. I have to say that I don't know any one of these games…

Comparing the busiest developers

Now, I am going to compare the developers that are represented by the most games in the dataset. For this, I am first creating a sorted table of games per developer and am subsetting games to only extract the games from these developers. Then, we have to aggregate again to yield one datapoint per game, even if it was published on several platforms.

I am sorting the developers' factor levels by the mean metascore (that sorts the boxplots in decreasing order), and in the end, I am assigning the worst 7 games a label to make them show up in the plot later.

dev.tab <- sort(table(games$developer), decreasing = T)  games.dev <- games[games$developer %in% names(dev.tab[1:10]),]      games.dev.agg <- aggregate(cbind(metascore, user_score) ~ game + developer,                             FUN = mean, data = games.dev)    games.dev.agg$developer <-    factor(games.dev.agg$developer,           levels = names(sort(tapply(games.dev.agg$metascore,                                      games.dev.agg$developer,                                      mean), decreasing = T)))    games.dev.agg <- games.dev.agg[order(games.dev.agg$metascore),]  games.dev.agg[1:7, "Label"] <- games.dev.agg[1:7, "game"]

There is a design principle in data visualizations that, whenever possible, you should show all datapoints. To do so, I am overlaying the boxplots with a beeswarm plot created by the {ggbeeswarm} package. Furthermore, I am adding a red circle for the mean value of each developer.

library(ggbeeswarm)  ggplot(games.dev.agg, aes(x = developer, y = metascore)) +    geom_boxplot(outlier.shape = NA) +    geom_beeswarm(alpha = .5, pch = 15, col = "blue") +    geom_point(stat = "summary", size = 4, col = "red", alpha = .5) +    geom_text_repel(aes(label = Label), size = 3) +    labs(x = "", y = "Metascore") +    theme(axis.text.x = element_text(angle = 45,                                     hjust = 1,                                     size = 12))

Remember that we are only seeing metascores here. Nintendo seems to make really good games. There are no outliers on their track record and they have the highest average metascore. Capcom is quite interesting – although they're the second best developer, there is one game ('Umbrella Corps') that is a huge outlier. Let's do exactly the same with user scores.

games.dev.agg$developer <-    factor(games.dev.agg$developer,           levels = names(sort(tapply(games.dev.agg$user_score,                                      games.dev.agg$developer,                                      mean), decreasing = T)))    games.dev.agg <- games.dev.agg[order(games.dev.agg$user_score),]  games.dev.agg$Label <- NA  games.dev.agg[1:7, "Label"] <- games.dev.agg[1:7, "game"]  ggplot(games.dev.agg, aes(x = developer, y = user_score)) +    geom_boxplot(outlier.shape = NA) +    geom_beeswarm(alpha = .5, pch = 15, col = "blue") +    geom_point(stat = "summary", size = 4, col = "red", alpha = .5) +    geom_text_repel(aes(label = Label), size = 3) +    labs(x = "", y = "User score") +    theme(axis.text.x = element_text(angle = 45,                                     hjust = 1,                                     size = 12))

My gosh, users really hate 'FIFA 19', don't they? I don't know why, but 'The LEGO Movie 2 Videogame' is the worst user-rated game in this subset – and it's a huge outlier compared to the rest of TT Games' games. And as we can see: users agree with the critics concerning 'Umbrella Corps'.

Ratings through time

Are video games getting worse or better over time? And do critics and users agree? Let's find out. Here, I am doing a simple linear regression from date on time. This might not be the best choice because time series data might need other statistical methods, but for now, let's play around with "normal" regression a bit. First, we have to define the release_date column as date variable. The mdy() function from the {lubridate} package does an excellent job here.

library(lubridate)  games$release <- as.Date(mdy(games$release_date))

Outliers can be a problem during regression analysis, so I want to get rid of them. Here, I am using the simple boxplot heuristic with a range of 1.5 times the interquartile range. Everything below or above is treated as an outlier and will not be used in the linear model.

is.outlier <- function (x) {    x < quantile(x, .25) - 1.5 * IQR(x) |      x > quantile(x, .75) + 1.5 * IQR(x)  }    games$is.out.meta <- is.outlier(games$metascore)  games$is.out.user <- is.outlier(games$user_score)    meta.mod <- lm(metascore ~ release, data = games, subset = !is.out.meta)  user.mod <- lm(user_score ~ release, data = games, subset = !is.out.meta)  summary(meta.mod)

##   ## Call:  ## lm(formula = metascore ~ release, data = games, subset = !is.out.meta)  ##   ## Residuals:  ##      Min       1Q   Median       3Q      Max   ## -28.6996  -6.0113   0.9763   7.1450  24.6318   ##   ## Coefficients:  ##               Estimate Std. Error t value      Pr(>|t|)      ## (Intercept) 55.7347579  2.8630670  19.467       < 2e-16 ***  ## release      0.0010147  0.0001701   5.964 0.00000000261 ***  ## ---  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  ##   ## Residual standard error: 9.677 on 5549 degrees of freedom  ## Multiple R-squared:  0.00637,    Adjusted R-squared:  0.006191   ## F-statistic: 35.57 on 1 and 5549 DF,  p-value: 0.000000002611

summary(user.mod)

##   ## Call:  ## lm(formula = user_score ~ release, data = games, subset = !is.out.meta)  ##   ## Residuals:  ##     Min      1Q  Median      3Q     Max   ## -57.965  -6.152   2.577   8.594  25.826   ##   ## Coefficients:  ##               Estimate Std. Error t value      Pr(>|t|)      ## (Intercept) 90.1050410  3.7616060   23.95       < 2e-16 ***  ## release     -0.0013233  0.0002235   -5.92 0.00000000342 ***  ## ---  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  ##   ## Residual standard error: 12.71 on 5549 degrees of freedom  ## Multiple R-squared:  0.006276,   Adjusted R-squared:  0.006097   ## F-statistic: 35.04 on 1 and 5549 DF,  p-value: 0.000000003416

The summaries of the two models show us that there are, indeed, significant effects of release on the scores. More interestingly, the effects point in opposite directions: metascores increase through time while user scores decrease through time (as indicated by the signs of the estimates). However, we also have to check the size of the effect. And, well, it's not very big. Indeed, it's tiny. Also, the variance explained in scores by release date (as indicated by the R-squared values) is only around 0.6 percent. So, let's not overestimate these effects. The following regression plots also suggest that there seem to be a lot more things going on there – the regression lines are almost horizontal. There are a lot more things one can do to evaluate regression models – but let's move on.

ggplot(games[!games$is.out.meta,], aes(x = release, y = metascore)) +    geom_point(alpha = .3, pch = 15) +    geom_smooth(se = F, method = "lm") +    labs(subtitle = paste0("beta = ", signif(meta.mod$coefficients["release"], 3),                           ", Outliers removed (Boxplot with r = 1.5)"),         x = "Release date", y = "Metascore", title = "Release date vs. Metascore")

ggplot(games[!games$is.out.user,], aes(x = release, y = user_score)) +    geom_point(alpha = .3, pch = 15) +    geom_smooth(se = F, method = "lm") +    labs(subtitle = paste0("beta = ", signif(user.mod$coefficients["release"], 3),                           ", Outliers removed (Boxplot with r = 1.5)"),         x = "Release date", y = "Metascore", title = "Release date vs. User score")

ESRB ratings by genre

For our last analysis, I want to see if certain genres are associated with certain ESRB ratings. To do so, I am selecting the genres with the most games (and 'shooter' because it should be the genre most closely associated with the 'Mature' rating). I am then recoding the ratings to make them more readable.

library(car)  sub.games <- unique(games[games$rating %in% c("E", "E10+", "M", "T") &                              games$genre %in% c("Action", "Action Adventure",                                                 "Role-Playing", "Adventure",                                                 "Strategy", "Sports",                                                 "Simulation", "Racing",                                                 "Puzzle", "Shooter"),                            c("rating", "genre", "game")])    sub.games$rating <- recode(sub.games$rating, "'E'='Everyone'; 'E10+'='Everyone 10+'; 'T'='Teen'; 'M'='Mature'")    sub.games.tab <- table(sub.games$rating, sub.games$genre)  scroll_box(kable_styling(kable(sub.games.tab)), width = "100%")

	Action	Action Adventure	Adventure	Puzzle	Racing	Role-Playing	Shooter	Simulation	Sports	Strategy
Everyone	215	28	20	48	53	19	0	35	89	18
Everyone 10+	274	81	26	13	13	64	2	11	15	48
Mature	253	205	76	2	1	100	10	8	2	27
Teen	371	93	70	9	8	176	4	38	20	107

So, that is our contingency table. To see if there are any clear associations between genres and ESRB ratings, we have to deal with the marginal probabilities of ESRB ratings (how often specific ratings are given overall) and genres (how many games there are in the dataset for a specific genre). We could do a Chi-Squared test and visualize this with a mosaicplot. Here, I want to do something different – a correspondence analysis. I introduced the {ca} package with a different dataset in my previous post. All of the code I am now using is explained there.

library(ca)  ca.fit <- ca(sub.games.tab) # do the ca  ca.sum <- summary(ca.fit)   # needed later for dimension percentages  ca.plot.obj <- plot(ca.fit)

make.ca.plot.df <- function (ca.plot.obj,                               row.lab = "Rows",                               col.lab = "Columns") {    df <- data.frame(Label = c(rownames(ca.plot.obj$rows),                               rownames(ca.plot.obj$cols)),                     Dim1 = c(ca.plot.obj$rows[,1], ca.plot.obj$cols[,1]),                     Dim2 = c(ca.plot.obj$rows[,2], ca.plot.obj$cols[,2]),                     Variable = c(rep(row.lab, nrow(ca.plot.obj$rows)),                                  rep(col.lab, nrow(ca.plot.obj$cols))))    rownames(df) <- 1:nrow(df)    df  }  # Create plotting data.frame for ggplot  ca.plot.df <- make.ca.plot.df(ca.plot.obj,                                row.lab = "Rating",                                col.lab = "Genre")  # Extract percentage of variance explained for dims   dim.var.percs <- ca.sum$scree[,"values2"]

library(ggrepel)  ggplot(ca.plot.df, aes(x = Dim1, y = Dim2,                         col = Variable, shape = Variable,                         label = Label)) +    geom_vline(xintercept = 0, lty = "dashed", alpha = .5) +    geom_hline(yintercept = 0, lty = "dashed", alpha = .5) +    geom_point(size = 4) +    scale_x_continuous(limits = range(ca.plot.df$Dim1) + c(diff(range(ca.plot.df$Dim1)) * -0.2,                                                           diff(range(ca.plot.df$Dim1)) * 0.2)) +    scale_y_continuous(limits = range(ca.plot.df$Dim2) + c(diff(range(ca.plot.df$Dim2)) * -0.2,                                                           diff(range(ca.plot.df$Dim2)) * 0.2)) +    geom_label_repel(show.legend = F, segment.alpha = .5) +    guides(colour = guide_legend(override.aes = list(size = 4))) +    labs(x = paste0("Dimension 1 (", signif(dim.var.percs[1], 3), "%)"),         y = paste0("Dimension 2 (", signif(dim.var.percs[2], 3), "%)"),         col = "", shape = "") +    theme_minimal()

This seems to have worked quite fine. It's not surprising that Sports, Racing, and Puzzle are closely associated with an 'Everyone' rating. What I find a little surprising is that the genre Action is so close to the middle of the graph and not closer to a 'Mature' rating. If we look at the contingency table again, this makes a little more sense: Action is by far the most assigned genre. It seems as if more 'intense' or 'gory' action titles are assigned to other genres (e.g., the quite fittingly placed Shooter genre which is clearly in the 'Mature' part of the graph). Even 'Action Adventure' is much closer to a 'Mature' rating than 'Action'…

Bye

Well, that's it for our little tour around the video games ratings dataset. Feel free to let me know what you would be interested in or to point me to your analyses on the same dataset. For now, I am having a break with some good old Nine Inch Nails music…

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Explaining Predictions: Random Forest Post-hoc Analysis (randomForestExplainer package)

Posted: 30 Aug 2019 05:00 PM PDT

[This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recap

This is a continuation on the explanation of machine learning model predictions. Specifically, random forest models. We can depend on the random forest package itself to explain predictions based on impurity importance or permutation importance. Today, we will explore external packages which aid in explaining random forest predictions.

External packages

There are external a few packages which offer to calculate variable importance for random forest models apart from the conventional measurements found within the random forest package.

`rfVarImpOOB`

The first package is rfVarImpOOB which was recently released on CRAN. It "computes a novel variable importance for random forests". Impurity reduction importance scores for out-of-bag data complements the existing in bag Gini importance. The vignette is unfortunately too brief in helping me understand the functions and the mechanism of this novel scoring of feature importance.

`VSURF`

The next package is VSURF which can handle high dimension datasets. It adopts a methodological three step approach to select important variables. Though there is no vignette, there is a BMC Medical Informatics and Decision Making research article using the VSURF package to look at features which are important to predict survival of individuals with breast cancer. The article explains the mechanism of how VSURF selects important features.

`randomForestExplainer`

The above journal article which used VSURF also used another package to select important features from a random forest. This other package is the randomForestExplainer. We will elaborate and demonstrate this package for the rest of the post. In the randomForestExplainer, the "various variable importance measures are calculated and visualized in different settings in order to get an idea on how their importance changes depending on our criteria"

Dataset

Similar to the previous posts, the Cleveland heart dataset will be used as well as principles of tidymodels.

#library  library(tidyverse)  library(tidymodels)  #import  heart<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = F)  # Renaming var   colnames(heart)<- c("age", "sex", "rest_cp", "rest_bp",  "chol", "fast_bloodsugar","rest_ecg","ex_maxHR","ex_cp",  "ex_STdepression_dur", "ex_STpeak","coloured_vessels", "thalassemia","heart_disease")  #elaborating cat var  ##simple ifelse conversion   heart<-heart %>% mutate(sex= ifelse(sex=="1", "male", "female"),fast_bloodsugar= ifelse(fast_bloodsugar=="1", ">120", "<120"), ex_cp=ifelse(ex_cp=="1", "yes", "no"),  heart_disease=ifelse(heart_disease=="0", "no", "yes"))   ## complex ifelse conversion using `case_when`  heart<-heart %>% mutate(  rest_cp=case_when(rest_cp== "1" ~ "typical",rest_cp=="2" ~ "atypical", rest_cp== "3" ~ "non-CP pain",rest_cp== "4" ~ "asymptomatic"), rest_ecg=case_when(rest_ecg=="0" ~ "normal",rest_ecg=="1" ~ "ST-T abnorm",rest_ecg=="2" ~ "LV hyperthrophy"), ex_STpeak=case_when(ex_STpeak=="1" ~ "up/norm", ex_STpeak== "2" ~ "flat",ex_STpeak== "3" ~ "down"), thalassemia=case_when(thalassemia=="3.0" ~ "norm",     thalassemia== "6.0" ~ "fixed", thalassemia== "7.0" ~ "reversable"))   # convert missing value "?" into NA  heart<-heart%>% mutate_if(is.character, funs(replace(., .=="?", NA)))  # convert char into factors  heart<-heart %>% mutate_if(is.character, as.factor)  #train/test set   set.seed(4595)  data_split <- initial_split(heart, prop=0.75, strata = "heart_disease")  heart_train <- training(data_split)  heart_test <- testing(data_split)

Though random forest model itself doesn't need explicit one hot encoding, some of randomForestExplainer functions need dummy variables. Thus we will create them in our recipe.

# create recipe object  heart_recipe<-recipe(heart_disease ~., data= heart_train) %>%    step_knnimpute(all_predictors()) %>%   step_dummy(all_nominal(), -heart_disease)    # process the traing set/ prepare recipe(non-cv)  heart_prep <-heart_recipe %>% prep(training = heart_train, retain = TRUE)

In the randomForestExplainer vignette, computation of casewise importance measure was activiated (i.e. localImp=T). Likewise, we'll do the same when we train our model.

set.seed(69)  rf_model<-rand_forest(trees = 2000, mtry = 4, mode = "classification") %>% set_engine("randomForest",  importance=T, localImp = T, ) %>% fit(heart_disease ~ ., data = juice(heart_prep))

`randomForestExplainer` (variable importance)

Measuring variable importance

The randomForestExplainer generates more feature importance score than the randomForest package. These scores generated by the randomForestExplainer::measure_importance function can be categorized into 3 types of feature importance score.

Measures of importance based on the structure of the forest which is not examine in the randomForest package.

1i) Mean_minimal_depth: Minimal depth for a variable in a tree equals to the depth of the node which splits on that variable and is the closest to the root of the tree. If it is low then a lot of observations are divided into groups on the basis of this variable.

1ii) no_of_nodes: Usually, number of trees and number of nodes are the same if trees are shallow. The number of nodes contain similar information as the number of trees and p-values. Thus, no_of_tress and p_value are omitted in plot_importance_ggpairs graph (which will be described later).

1iii) no_of_trees

1iv) p_value: Number of nodes in which predictor X was used for splitting exceeds the theoretical number of successes if they were random, following the binomial distribution given.

1v) times_a_root

Measures of importance based on the decrease in predictive accuracy post variable perturbation

2i) accuracy_decrease: Use for classification problems. Same value as randomForest::importance(rf_model$fit, type=2)

2ii) mse_increase: Use for regression problems.

Measure of importance based on loss function

3i) gini_decrease Use for classifcation cases. Same value as randomForest::importance(rf_model$fit, type=1).

3ii) node_purity_increase: Use for regression cases. Measured by the decrease in sum of squares

library(randomForestExplainer)  impt_frame<-measure_importance(rf_model$fit)    impt_frame %>% head()

##                variable mean_min_depth no_of_nodes accuracy_decrease  ## 1                   age       3.076865        8635       0.004269917  ## 2                  chol       3.377786        8872      -0.003120879  ## 3 coloured_vessels_X1.0       4.969157        3202       0.009133193  ## 4 coloured_vessels_X2.0       4.155141        2600       0.010876652  ## 5 coloured_vessels_X3.0       5.476794        1756       0.006136837  ## 6             ex_cp_yes       3.791160        3033       0.009524295  ##   gini_decrease no_of_trees times_a_root p_value  ## 1      8.686010        1979           54       0  ## 2      7.997923        1983            3       0  ## 3      2.814661        1749           13       1  ## 4      3.769974        1675           82       1  ## 5      2.044707        1280           20       1  ## 6      5.557652        1766          187       1

Ploting

plot_multi_way_importance plots variable importance scores from importance_frame. By default, the function plots scores measuring importance based on the structure of the forest, mean_min_depth against times_a_root.

plot_multi_way_importance(impt_frame, no_of_labels = 6)

plot_multi_way_importance can plot up to 3 feature importance scores using any of the three kinds of importance measurements.

plot_multi_way_importance(impt_frame, x_measure = "accuracy_decrease", y_measure = "gini_decrease", size_measure = "p_value", no_of_labels = 6)

Comments

Comparing the top 6 variables from both plots, only 4 variables thalaseemia_norm, thalassemia_reversable, ex_maxHR and ex_STdepression_dur appear in both plots. These variables are more likely to be essential in the random forest's prediction.
thalaseemia_norm and thalassemia_reversable are identified as one of the six variables with good gini_decrease and accuracy_decrease but they are insignificant as p_value >=.1.

Selecting optimal importance scores to plot

There are many multi-way importance plots which one can create thus we need to identify which sub-set of importance scores will provide us with more helpful plots. Examining the relationship between the importance measurements with plot_importance_ggpairs can assist us in identifying a sub-set of scores to use. We can pick three scores that least agree with each other, points in plots which are most dispersed.

plot_importance_ggpairs(impt_frame)

Plotting the rankings of importance measures with a LOESS curve using plot_importance_rankings informs us of pairs of measures almost exactly agreeing in their rankings of variables. This approach might be useful as rankings are more evenly spread than corresponding importance measures. This may also more clearly show where the different measures of importance disagree or agree.

plot_importance_rankings(impt_frame)

mean_min_depth and gini_decrease agree the most thus we should avoid this combination for plot_multi_way_importance.

`randomForestExplainer`: variable depth

Previously, we looked at various types of importance measures. Now we are specifically examining the mean minimal depth in detail. The distribution of the mean minimal depth allows us to appreciate the variable's role in the random forest's structure and prediction.
plot_min_depth_distribution plots the top ten variables according to mean minimal depth calculated using top trees.
The mean minimal depth can be calculated in 3 different ways in plot_min_depth_distribution using the mean_sample argument. The calculation differs in the way they treat missing values that appear when a feature is not used for tree splitting. As a result, the ranking of variables may change for each calculation.

md_frame <- min_depth_distribution(rf_model$fit)    plot_min_depth_distribution(md_frame, mean_sample = "top_trees") # default mean_sample arg

The mean minimal depth is indicated by a vertical bar with the mean value beside it. The smaller the mean minimal depth, the more important the variable is and the higher up the y-axis the variable will be.
The rainbow gradient reveals the min and max minimal depth for each variable. The bigger the proportion of minimal depth zero (red blocks), the more frequent the variable is the root of a tree. The smaller the proportion of NA minimal depth (gray blocks), the more frequent the variable is used for splitting trees.
The range of the x-axis is from zero to the maximum number of trees for the feature.

`randomForestExplainer` : variable interaction

We will use the important_variables function to select the top 6 variables based on the following variable importance measurements, times_a_root and no_of_nodes.

vars<- important_variables(impt_frame, k = 6, measures = c("times_a_root", "no_of_nodes"))

After identifying the top 6 variables, we can examine the interactions between the variables with the min_depth_interactions function.

interactions_frame <- min_depth_interactions(rf_model$fit, vars)

The interaction is reflected as the mean_min_depth which is the mean conditional minimal depth, where a variable is taken as a root node/root_variable and the mean minimal depth is calculated for the other variable.
The uncond_mean_min_depth represents the unconditional mean minimal depth of the variable which is the mean minimal depth of the variable without having a stipulated root variable. This value is the same as the mean value seen on the vertical bar in plot_min_depth_distribution.

head(interactions_frame)

##   variable          root_variable mean_min_depth occurrences  ## 1      age                    age       2.766519         838  ## 2      age              ex_cp_yes       2.678235         940  ## 3      age               ex_maxHR       2.068416        1251  ## 4      age    ex_STdepression_dur       2.142738        1215  ## 5      age       thalassemia_norm       2.719905        1036  ## 6      age thalassemia_reversable       2.539302        1064  ##                  interaction uncond_mean_min_depth  ## 1                    age:age              3.076865  ## 2              ex_cp_yes:age              3.076865  ## 3               ex_maxHR:age              3.076865  ## 4    ex_STdepression_dur:age              3.076865  ## 5       thalassemia_norm:age              3.076865  ## 6 thalassemia_reversable:age              3.076865

We can plot the outputs from min_depth_interactions with plot_min_depth_interactions.

plot_min_depth_interactions(interactions_frame)

The interactions are arranged from the most frequent occurring on the left side of the plot and in lighter blue to the least frequent occurring interaction on the right side of the plot and in darker blue.
The horizontal red line represents the minimum mean_min_depth.
The black lollipop represents the uncond_mean_min_depth.
Similar to plot_min_depth_distribution, the ranking of interactions in plot_min_depth_interactions may change depending on the arguments for mean_sample and uncond_mean_sample.

Interactive variables and forest prediction

We can further evaluate the variable interactions by plotting the probability of a prediction against the variables making up the interaction. For instance we plot the probability of having heart disease against resting blood pressure rest_bp and ST depression duration during exercise test ex_STdepression_dur. The interaction of these two variables are the most frequent interaction as seen in plot_min_depth_interactions. We plot the forest prediction against interactive variables with plot_predict_interaction.

#plot_predict_interaction(rf_model$fit, bake(heart_prep, new_data = heart_train), "rest_bp", "ex_STdepression_dur")

However, there is an error when the input supplied is a model created with parsnip. There is no error when the model is created directly from the randomForest package.

set.seed(69)  forest <- randomForest::randomForest(heart_disease ~ ., data = bake(heart_prep, new_data = heart_train), localImp = TRUE, importance=T, ntree = 2000, mtry = 4, type= "classification")    plot_predict_interaction(forest, bake(heart_prep, new_data = heart_train), "rest_bp", "ex_STdepression_dur", main = "Distribution of Predicted Probability of Having Heart Disease") + theme(legend.position="bottom") + geom_hline(yintercept = 2, linetype="longdash") + geom_vline(xintercept = 140, linetype="longdash")

ST depression duration during exercise test ex_STdepression_dur longer than 2s results in higher predicted probability of having a heart disease. The predicted probability of having heart disease for individuals with shorter ST depression duration (<2s) increases if they have high resting blood pressure rest_bp above 140.

As plot_predict_interaction is a ggplot object which you can treat it like any other ggplot. In this case, we can place it side by side with the ggplot of the distribution of heart disease in the test set.

predict_plot<-plot_predict_interaction(forest, bake(heart_prep, new_data = heart_train), "rest_bp", "ex_STdepression_dur", main = "Distribution of Predicted\n Probability of Having Heart\n Disease") + theme(legend.position="bottom") + geom_hline(yintercept = 2, linetype="dotted")     test_plot<-ggplot(heart_test, aes(rest_bp, ex_STdepression_dur, colour=heart_disease)) +geom_jitter(size=1.8) + labs(title="Distribution of Heart Disease\n in Test Set", y="") + theme_minimal() +  scale_color_manual(values=c("blue", "red")) +  theme(legend.position="bottom") + geom_hline(yintercept = 2, linetype="dotted")     gridExtra::grid.arrange(predict_plot, test_plot, ncol=2)

Conclusion

In this post, we learned how random forest predictions can be explained based on various variable importance measures, variable interactions and variable depth. The post-hoc analysis was aided with the randomForestExplainer package.

To leave a comment for the author, please follow the link and comment on their blog: R on notast.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Seeking postdoc (or contractor) for next generation Stan language research and development

Posted: 30 Aug 2019 12:27 PM PDT

[This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Stan group at Columbia is looking to hire a postdoc^* to work on the next generation compiler for the Stan open-source probabilistic programming language. Ideally, a candidate will bring language development experience and also have research interests in a related field such as programming languages, applied statistics, numerical analysis, or statistical computation.

The language features on the roadmap include lambdas with closures, sparse matrices and vectors, ragged arrays, tuples and structs, user-defined Jacobians, and variadic functions. The parser, intermediate representation, and code generation are written in OCaml using the Menhir compiler framework. The code is hosted on GitHub in the stanc3 repo; the current design documents are in the design docs repo. The generated code is templated C++ that depends on the automatic differentiation framework in the Stan math library and is used by Stan's statistical inference algorithms.

The research and development for Stan will be carried out in collaboration with the larger Stan development team, which includes a large group of friendly and constructive collaborators within and outside of Columbia University. In addition to software development, the team has a diverse set of research interests in programming language semantics, applied statistics, statistical methodology, and statistical computation. Of particular relevance to this project is foundational theory work on programming language semantics for differentiable and probabilistic programming languages.

The position would be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for one year with a possible reappointment for a second year.

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, carp@alias-i.com. The position is available immediately and we will review applications as they arrive.

Thanks to Schmidt Futures for the funding to make this possible!

^* We could also hire a contractor on an hourly basis. For that, I'd be looking for someone with experience who could hit the ground running with the OCaml code.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

It is Time for CRAN to Ban Package Ads

Posted: 30 Aug 2019 08:53 AM PDT

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

NPM (a popular Javascript package repository) just banned package advertisements. I feel the CRAN repository should do the same.

Not all R-users are fully aware of package advertisements. But they clutter up work, interfere with reproducibility, and frankly are just wrong.

Here is an example which could be considered to contain advertisements: .onAttach() from ggplot2 (promoting websites, and books):

    .onAttach <- function(...) {    withr::with_preserve_seed({      if (!interactive() || stats::runif(1) > 0.1) return()        tips <- c(        "RStudio Community is a great place to get help: https://community.rstudio.com/c/tidyverse.",        "Find out what's changed in ggplot2 at https://github.com/tidyverse/ggplot2/releases.",        "Use suppressPackageStartupMessages() to eliminate package startup messages.",        "Need help? Try Stackoverflow: https://stackoverflow.com/tags/ggplot2.",        "Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/",        "Want to understand how all the pieces fit together? See the R for Data Science book: http://r4ds.had.co.nz/"      )        tip <- sample(tips, 1)      packageStartupMessage(paste(strwrap(tip), collapse = "\n"))    })  }

The above means: when you attach ggplot2, it is needlessly monkeying with the pseudo-random number generator (in addition to running it is attempting to re-wind its state to appear to not have run it), and 10 percent of the time inserting an advertisement. This is needlessly non-deterministic and can spoil work if you don't take the (advertised) precaution of using suppressPackageStartupMessages().

In R the correct way to attach a package should be:

    library(PACKAGE_NAME)

and not:

    suppressPackageStartupMessages(library(PACKAGE_NAME))

This specific instance may or may not bother you. But if this style of messaging is allowed, what advertising is also allowed and where will that take us?

Users should insist on clean packages that can be attached deterministically and neatly. Users should insist on having access to "free software", ("advertising supported" is often not considered the same as "free").

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Break up with Excel: Intro and Advanced R Data Science Courses at MSACL.org Salzburg Austria, September 21–23, 2019

Posted: 30 Aug 2019 06:58 AM PDT

[This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

MSACL Conference

There are two RStats Data Science courses happening in Salzburg Austria on September 22–24, 2019 at the 6th annual MSACL Clinical Mass Spectrometry Conference. These courses are held twice annually, once in Europe and once in Palm Springs.

plot of chunk msaclAd

Introductory Course

The introductory course will be taught by Dan Holmes, MD of the University of British Columbia and Will Slade, PhD of Laboratory Corporation of America.
Course description is here

Intermediate/Advanced Course

The intermediate/advance course will be taught by Shannon Haymond, PhD of Northwestern University and Patrick Mathias, MD PhD of the University of Washington.
Course description is here

Registration

Although the conference is for clinical mass spectrometry, the courses are generic in nature and generally geared towards the biological and health sciences rather than mass spectometry per se.
You do not need to register for the conference to attend the pre-conference courses.
Academic rates apply. Registration for the course includes lots of snacks and coffee… like, good coffee.
Registration details are here.

Details

Both courses will take place in the following location:

Salzburg Congress Centre
- Day 1: September 22, 2019, 1300h–1800h
- Day 2: September 23, 2019, 0830h–1730h
- Day 3: September 24, 2019, 0830h–1130h

To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

rstudio::conf(2020) Diversity and international scholarships

Posted: 29 Aug 2019 05:00 PM PDT

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rstudio::conf(2020L) continues our tradition of diversity scholarships, and this year we're increasing the program size to 44 recipients. As a result of thinking about our goals, this year we have two components to the program:

38 domestic diversity scholarships available to anyone living in the US or
Canada who is a member of a group that is under-represented at rstudio::conf.
6 international scholarships available to people living in South America,
Central America, or Mexico.

In the long run, we hope that the rstudio::conf participants reflect the full diversity of the world around us. We believe in building on-ramps so that people from diverse backgrounds can learn R, build their knowledge, and then contribute back to the community. Through this program, we want to support scholars by covering the costs of conference registration (including workshops) and providing funds for travel and accommodation (up to $1000). At the conference, scholars will also have networking opportunities with past diversity scholarship recipients as well as with leaders in the field.

We also recognise that there are many parts of the world that do not offer easy access to R events. In addition to South America, Central America, and Mexico, mentioned above, Africa has an active and growing community of R users, but relatively few R events hosted within the region. Due to the persistent challenges African scholars have faced in obtaining visas to attend rstudio::conf in recent years, we are currently looking into alternative ways we might involve our African colleagues in this year's conference, and will be sharing more about these plans soon. We will continue to re-evaluate the regions where our scholarships can have the greatest impact and will adjust this program as rstudio::conf grows. International scholars will receive complimentary conference and workshop registration, funds for travel and accommodation funding (up to $3000), and additional networking opportunities.

Scholarship applications will be evaluated on three main criteria:

How will attending the conference impact you? What important problems will
you be able to tackle that you couldn't before?
You will learn the most if you already have some experience with R, so show us
what you've achieved so far (GitHub is the easiest way for us to assess your
skills, if you're new to Github, check out
https://guides.github.com/activities/hello-world/ to get up something
informative in under an hour).
How will you share your knowledge with others? We can't help everyone, so
we're particularly interested in helping those who will go back to their
communities and spread the love. Show us how you've shared your skills and
knowledge in the past, and tell us what you have planned for the future.

The scholarships are competitive, so make sure you highlight what makes you special.

Apply now!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

THE BENEFIT

Search This Blog