[R-bloggers] ANOVA vs Multiple Comparisons (and 8 more aRticles)

[R-bloggers] ANOVA vs Multiple Comparisons (and 8 more aRticles)

Link to R-bloggers

ANOVA vs Multiple Comparisons

Posted: 15 Oct 2020 05:56 AM PDT

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When we run an ANOVA, we analyze the differences among group means in a sample. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means.

ANOVA Null and Alternatve Hypothesis

The null hypothesis in ANOVA is that there is no difference between means and the alternative is that the means are not all equal.

\(H_0: \mu _1= \mu _2=…= \mu _K \)
\(H_1: The~ \mu_s~Are~Not~All~Equal\)

This means that when we are dealing with many groups, we cannot compare them pairwise. We can simply answer if the means between groups can be considered as equal or not.


Tukey's HSD

What about if we want to compare all the groups pairwise? In this case, we can apply the Tukey's HSD which is a single-step multiple comparison procedure and statistical test. It can be used to find means that are significantly different from each other.


Example of ANOVA vs Tukey's HSD

Let's assume that we are dealing with the following 4 groups:

  • Group "a": 100 observations from the Normal Distribution with mean 10 and standard deviation 5
  • Group "b": 100 observations from the Normal Distribution with mean 10 and standard deviation 5
  • Group "c": 100 observations from the Normal Distribution with mean 11 and standard deviation 6
  • Group "d": 100 observations from the Normal Distribution with mean 11 and standard deviation 6

Clearly, we were expecting the ANOVA to reject to Null Hypothesis but we would also to know that the Group a and Group b are not statistically different and the same with the Group c and Group d

Let's work in R:

  library(multcomp)  library(tidyverse)    # Create the four groups  set.seed(10)   df1 <- data.frame(Var="a", Value=rnorm(100,10,5))  df2 <- data.frame(Var="b", Value=rnorm(100,10,5))  df3 <- data.frame(Var="c", Value=rnorm(100,11,6))  df4 <- data.frame(Var="d", Value=rnorm(100,11,6))    # merge them in one data frame  df<-rbind(df1,df2,df3,df4)    # convert Var to a factor  df$Var<-as.factor(df$Var)    df%>%ggplot(aes(x=Value, fill=Var))+geom_density(alpha=0.5)     
ANOVA vs Multiple Comparisons 1

ANOVA

  # ANOVA  model1<-lm(Value~Var, data=df)  anova(model1)     

Output:

Analysis of Variance Table    Response: Value             Df  Sum Sq Mean Sq F value    Pr(>F)      Var         3   565.7 188.565   6.351 0.0003257 ***  Residuals 396 11757.5  29.691                        ---  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Clearly, we reject the null hypothesis since the p-value is 0.0003257

Tukey's HSD

Let's apply the Tukey HSD test to test all the means.

  # Tukey multiple comparisons  summary(glht(model1, mcp(Var="Tukey")))     

Output:

	 Simultaneous Tests for General Linear Hypotheses    Multiple Comparisons of Means: Tukey Contrasts      Fit: lm(formula = Value ~ Var, data = df)    Linear Hypotheses:             Estimate Std. Error t value Pr(>|t|)     b - a == 0   0.2079     0.7706   0.270  0.99312     c - a == 0   1.8553     0.7706   2.408  0.07727 .   d - a == 0   2.8758     0.7706   3.732  0.00129 **  c - b == 0   1.6473     0.7706   2.138  0.14298     d - b == 0   2.6678     0.7706   3.462  0.00329 **  d - c == 0   1.0205     0.7706   1.324  0.54795     ---  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  (Adjusted p values reported -- single-step method)   

As we can see from the output above, the difference between c vs a and c vs b found not be statistically significant although they are from different distributions. The reason for that is the "issue" with the multiple comparisons. Let's compare them by applying the t-test

t-test a vs c

  t.test(df%>%filter(Var=="a")%>%pull(), df%>%filter(Var=="c")%>%pull())  

Output:

	Welch Two Sample t-test    data:  df %>% filter(Var == "a") %>% pull() and df %>% filter(Var == "c") %>% pull()  t = -2.4743, df = 189.47, p-value = 0.01423  alternative hypothesis: true difference in means is not equal to 0  95 percent confidence interval:   -3.3343125 -0.3761991  sample estimates:  mean of x mean of y    9.317255 11.172511    

t-test b vs c

  t.test(df%>%filter(Var=="b")%>%pull(), df%>%filter(Var=="c")%>%pull())     

Output:

	Welch Two Sample t-test    data:  df %>% filter(Var == "b") %>% pull() and df %>% filter(Var == "c") %>% pull()  t = -2.1711, df = 191.53, p-value = 0.03115  alternative hypothesis: true difference in means is not equal to 0  95 percent confidence interval:   -3.1439117 -0.1507362  sample estimates:  mean of x mean of y    9.525187 11.172511    

As we can see from above, the means of the two groups, in both cases, found to be statistically significant, if we ignore the multiple comparisons.

Discussion

When we are dealing with multiple comparisons and we want to apply pairwise comparisons, then Tukey's HSD is a good option. Another approach is to consider the P-Value Adjustments.

You can also have a look at how you can consider the multiple comparisons in A/B/n Testing

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post ANOVA vs Multiple Comparisons first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

The Shift and Balance Fallacies

Posted: 15 Oct 2020 01:06 AM PDT

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two related fallacies I see in machine learning practice are the shift and balance fallacies (for an earlier simple fallacy, please see here). They involve thinking logistic regression has a bit simpler structure that it actually does, and also thinking logistic regression is a bit less powerful than it actually is.

The fallacies are somewhat opposite: the first fallacy is shifting or re-weighting data doesn't change much, and the second is that re-balancing is a necessary pre-processing step. As the two ideas seem to contradict each other it would be odd if they were both true. In fact we are closer to both being false.

The shift fallacy

The shift fallacy is as follows. We fit two models m and m_shift with data-weights one (the all ones vector) and a * (one - y) + b * y (y being the dependent variable). We are re-sampling according to outcome, a (not always advisable) technique popular with some for un-balanced classification problems (note: we think this technique is popular due to the common error of using classification rules for classification problems) . Then the fallacy is to (falsely) believed the two models differ only in the intercept term.

This is easy to disprove in R.

library(wrapr)    # build our example data  # modeling y as a function of x1 and x2 (plus intercept)    d <- wrapr::build_frame(    "x1"  , "x2", "y" |      0   , 0   , 0   |      0   , 0   , 0   |      0   , 1   , 1   |      1   , 0   , 0   |      1   , 0   , 0   |      1   , 0   , 1   |      1   , 1   , 0   )    knitr::kable(d)
x1 x2 y
0 0 0
0 0 0
0 1 1
1 0 0
1 0 0
1 0 1
1 1 0

First we fit the model with each data-row having the same weight.

m <- glm(    y ~ x1 + x2,    data = d,    family = binomial())    m$coefficients
## (Intercept)          x1          x2   ##  -1.2055937  -0.3129307   1.3620590

Now we build a balanced weighting. We are up-sampling both classes so we don't have any fractional weights (fractional weights are fine, but they trigger a warning in glm()).

w <- ifelse(d$y == 1, sum(1 - d$y), sum(d$y))  w
## [1] 2 2 5 2 2 5 2
# confirm prevalence is 0.5 under this weighting  sum(w * d$y) / sum(w)
## [1] 0.5

Now we fit the model for the balanced data situation.

m_shift <- glm(    y ~ x1 + x2,    data = d,    family = binomial(),    weights = w)    m_shift$coefficients
## (Intercept)          x1          x2   ##  -0.5512784   0.1168985   1.4347723

Notice that all of the coefficients changed, not just the intercept term. And we have thus demonstrated the shift fallacy.

The balance fallacy

An additional point is: the simple model without re-weighting is the better model on this training data. There appears to be an industry belief that to work with unbalanced classes one must re-balance the data. In fact moving to "balanced data" doesn't magically improve the model quality, what it does is helps hide some of the bad consequences of using classification rules instead of probability models (please see here for some discussion).

For instance our original model has the following statistical deviance (lower is better):

deviance <- function(prediction, truth) {    -2 * sum(truth * log(prediction) + (1 - truth) * log(1 - prediction))  }    deviance(    prediction = predict(m, newdata = d, type = 'response'),    truth = d$y)
## [1] 7.745254

And our balanced model has a worse deviance.

deviance(    prediction = predict(m_shift, newdata = d, type = 'response'),    truth = d$y)
## [1] 9.004022

Part of this issue is that the balanced model is scaled wrong. It's average prediction is, by design, inflated.

mean(predict(m_shift, newdata = d, type = 'response'))
## [1] 0.4784371

Whereas, the original model average to the same as the average of the truth values (a property of logistic regression).

mean(predict(m, newdata = d, type = 'response'))
## [1] 0.2857143
mean(d$y)
## [1] 0.2857143

So let's adjust the balanced predictions back to the correct expected value (essentially Platt scaling).

d$balanced_pred <- predict(m_shift, newdata = d, type = 'link')    m_scale <- glm(    y ~ balanced_pred,    data = d,    family = binomial())    corrected_balanced_pred <- predict(m_scale, newdata = d, type = 'response')    mean(corrected_balanced_pred)
## [1] 0.2857143

We now have a prediction with the correct expected value. However, notice this deviance is still larger than the simple un-weighted original model.

deviance(    prediction = corrected_balanced_pred,    truth = d$y)
## [1] 7.803104

Our opinion is: re-weighting or re-sampling data for a logistic regression is pointless. The fitting procedure deals with un-balanced data quite well, and doesn't need any attempt at help. We think this sort of re-weighting and re-sampling introduces complexity, the possibility of data-leaks with up-sampling, and a loss of statistical efficiency with down-sampling. Likely the re-sampling fallacy is driven by a need to move model scores to near 0.5 when using 0.5 as a default classification rule threshold (which we argue against in "Don't Use Classification Rules for Classification Problems"). This is a problem that is more easily avoided by insisting on a probability model over a classification rule.

Conclusion

Some tools, such as logistic regression, work best on training data that accurately represents the distributions facts of problem, and do not require artificially balanced training data. Also, re-balancing training data is a bit more involved than one might think, as we see more than just the intercept term changes when we re-balance data.

Take logistic regression as the entry level probability model for classification problems. If it doesn't need data re-balancing then other any tool claiming to be universally better than it should also not need artificial re-balancing (though if they are internally using classification rule metrics, some hyper-parameters or internal procedures may need to be adjusted).

Prevalence re-balancing is working around mere operational issues: such as using classification rules (instead of probability models), using sub-optimal metrics (such as accuracy). However, there operational issues are better directly corrected than worked around. A lot of the complexity we see in modern machine learning pipelines is patches patching unwanted effects of previous patches.

(The source for this article can be found here, and a rendering of it here.)

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Shift and Balance Fallacies first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Benford’s law meets IPL, Intl. T20 and ODI cricket

Posted: 15 Oct 2020 12:28 AM PDT

[This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

"To grasp how different a million is from a billion, think about it like this: A million seconds is a little under two weeks; a billion seconds is about thirty-two years."

"One of the pleasures of looking at the world through mathematical eyes is that you can see certain patterns that would otherwise be hidden."

               Steven Strogatz, Prof at Cornell University

Introduction

Within the last two weeks, I was introduced to Benford's Law by 2 of my friends. Initially, I looked it up and Google and was quite intrigued by the law. Subsequently another friends asked me to check the 'Digits' episode, from the "Connected" series on Netflix by Latif Nasser, which I strongly recommend you watch.

Benford's Law also called the Newcomb–Benford law, the law of anomalous numbers, or the First Digit Law states that, when dealing with quantities obtained from Nature, the frequency of appearance of each digit in the first significant place is logarithmic. For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30.1% of the time, the number 2 about 17.6%, number 3 about 12.5% all the way to the number 9 at 4.6%. This interesting logarithmic pattern is observed in most natural datasets from population densities, river lengths, heights of skyscrapers, tax returns etc. What is really curious about this law, is that when we measure the lengths of rivers, the law holds perfectly regardless of the units used to measure. So the length of the rivers would obey the law whether we measure in meters, feet, miles etc. There is something almost mystical about this law.

The law has also been used widely to detect financial fraud, manipulations in tax statements, bots in twitter, fake accounts in social networks, image manipulation etc. In this age of deep fakes, the ability to detect fake images will assume paramount importance. While deviations from Benford Law do not always signify fraud, to large extent they point to an aberration. Prof Nigrini, of Cape Town used this law to identify financial discrepancies in Enron's financial statement resulting in the infamous scandal. Also the 2009 Iranian election was found to be fradulent as the first digit percentages did not conform to those specified by Benford's Law.

While it cannot be said with absolute certainty, marked deviations from Benford's law could possibly indicate that there has been manipulation of natural processes. Possibly Benford's law could be used to detect large scale match-fixing in cricket tournaments. However, we cannot look at this in isolation and the other statistical and forensic methods may be required to determine if there is fraud. Here is an interesting paper Promises and perils of Benford's law

A set of numbers is said to satisfy Benford's law if the leading digit d (d ∈ {1, …, 9}) occurs with probability

P(d)=log_{10}(1+1/d)

This law also works for number in other bases, in base b >=2

P(d)=log_{b}(1+1/d)

Interestingly, this law also applies to sports on the number of point scored in basketball etc. I was curious to see if this applied to cricket. Previously, using my R package yorkr, I had already converted all T20 data and ODI data from Cricsheet which is available at yorkrData2020, I wanted to check if Benford's Law worked on the runs scored, or deliveries faced by batsmen at team level or at a tournament level (IPL, Intl. T20 or ODI).

Thankfully, R has a package benford.analysis to check for data behaviour in accordance to Benford's Law, and I have used this package in my post

This post is also available in RPubs as Benford's Law meets IPL, Intl. T20 and ODI

library(data.table)  library(reshape2)  library(dplyr)  library(benford.analysis)  library(yorkr)

In this post, I have randomly check data with Benford's law. The fully converted dataset is available in yorkrData2020 which I have included above. You can try on any dataset including ODI (men,women),Intl T20(men,women),IPL,BBL,PSL,NTB and WBB.

1. Check the runs distribution by Royal Challengers Bangalore

We can see the behaviour is as expected with Benford's law, with minor deviations

load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Royal Challengers Bangalore-BattingDetails.RData")  rcbRunsTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive")   rcbRunsTrends  ##   ## Benford object:  ##    ## Data: battingDetails$runs   ## Number of observations used = 1205   ## Number of obs. for second order = 99   ## First digits analysed = 1  ##   ## Mantissa:   ##   ##    Statistic  Value  ##         Mean  0.458  ##          Var  0.091  ##  Ex.Kurtosis -1.213  ##     Skewness -0.025  ##   ##   ## The 5 largest deviations:   ##   ##   digits absolute.diff  ## 1      1         14.26  ## 2      7         13.88  ## 3      9          8.14  ## 4      6          5.33  ## 5      4          4.78  ##   ## Stats:  ##   ##  Pearson's Chi-squared test  ##   ## data:  battingDetails$runs  ## X-squared = 5.2091, df = 8, p-value = 0.735  ##   ##   ##  Mantissa Arc Test  ##   ## data:  battingDetails$runs  ## L2 = 0.0022852, df = 2, p-value = 0.06369  ##   ## Mean Absolute Deviation (MAD): 0.004941381  ## MAD Conformity - Nigrini (2012): Close conformity  ## Distortion Factor: -18.8725  ##   ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

2. Check the 'balls played' distribution by Royal Challengers Bangalore

load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Royal Challengers Bangalore-BattingDetails.RData")  rcbBallsPlayedTrends = benford(battingDetails$ballsPlayed, number.of.digits = 1, discrete = T, sign = "positive")   plot(rcbBallsPlayedTrends)

 

3. Check the runs distribution by Chennai Super Kings

The trend seems to deviate from the expected behavior to some extent in the number of digits for 5 & 7.

load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Chennai Super Kings-BattingDetails.RData")  cskRunsTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive")   cskRunsTrends  ##   ## Benford object:  ##    ## Data: battingDetails$runs   ## Number of observations used = 1054   ## Number of obs. for second order = 94   ## First digits analysed = 1  ##   ## Mantissa:   ##   ##    Statistic  Value  ##         Mean  0.466  ##          Var  0.081  ##  Ex.Kurtosis -1.100  ##     Skewness -0.054  ##   ##   ## The 5 largest deviations:   ##   ##   digits absolute.diff  ## 1      5         27.54  ## 2      2         18.40  ## 3      1         17.29  ## 4      9         14.23  ## 5      7         14.12  ##   ## Stats:  ##   ##  Pearson's Chi-squared test  ##   ## data:  battingDetails$runs  ## X-squared = 22.862, df = 8, p-value = 0.003545  ##   ##   ##  Mantissa Arc Test  ##   ## data:  battingDetails$runs  ## L2 = 0.002376, df = 2, p-value = 0.08173  ##   ## Mean Absolute Deviation (MAD): 0.01309597  ## MAD Conformity - Nigrini (2012): Marginally acceptable conformity  ## Distortion Factor: -17.90664  ##   ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

4. Check runs distribution in all of Indian Premier League (IPL)

battingDF <- NULL  teams <-c("Chennai Super Kings","Deccan Chargers","Delhi Daredevils",            "Kings XI Punjab", 'Kochi Tuskers Kerala',"Kolkata Knight Riders",            "Mumbai Indians", "Pune Warriors","Rajasthan Royals",            "Royal Challengers Bangalore","Sunrisers Hyderabad","Gujarat Lions",            "Rising Pune Supergiants")      setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails")  for(team in teams){    battingDetails <- NULL    val <- paste(team,"-BattingDetails.RData",sep="")    print(val)    tryCatch(load(val),             error = function(e) {               print("No data1")               setNext=TRUE             }                              )    details <- battingDetails    battingDF <- rbind(battingDF,details)  }  ## [1] "Chennai Super Kings-BattingDetails.RData"  ## [1] "Deccan Chargers-BattingDetails.RData"  ## [1] "Delhi Daredevils-BattingDetails.RData"  ## [1] "Kings XI Punjab-BattingDetails.RData"  ## [1] "Kochi Tuskers Kerala-BattingDetails.RData"  ## [1] "Kolkata Knight Riders-BattingDetails.RData"  ## [1] "Mumbai Indians-BattingDetails.RData"  ## [1] "Pune Warriors-BattingDetails.RData"  ## [1] "Rajasthan Royals-BattingDetails.RData"  ## [1] "Royal Challengers Bangalore-BattingDetails.RData"  ## [1] "Sunrisers Hyderabad-BattingDetails.RData"  ## [1] "Gujarat Lions-BattingDetails.RData"  ## [1] "Rising Pune Supergiants-BattingDetails.RData"  trends = benford(battingDF$runs, number.of.digits = 1, discrete = T, sign = "positive")   trends  ##   ## Benford object:  ##    ## Data: battingDF$runs   ## Number of observations used = 10129   ## Number of obs. for second order = 123   ## First digits analysed = 1  ##   ## Mantissa:   ##   ##    Statistic   Value  ##         Mean  0.4521  ##          Var  0.0856  ##  Ex.Kurtosis -1.1570  ##     Skewness -0.0033  ##   ##   ## The 5 largest deviations:   ##   ##   digits absolute.diff  ## 1      2        159.37  ## 2      9        121.48  ## 3      7         93.40  ## 4      8         83.12  ## 5      1         61.87  ##   ## Stats:  ##   ##  Pearson's Chi-squared test  ##   ## data:  battingDF$runs  ## X-squared = 78.166, df = 8, p-value = 1.143e-13  ##   ##   ##  Mantissa Arc Test  ##   ## data:  battingDF$runs  ## L2 = 5.8237e-05, df = 2, p-value = 0.5544  ##   ## Mean Absolute Deviation (MAD): 0.006627966  ## MAD Conformity - Nigrini (2012): Acceptable conformity  ## Distortion Factor: -20.90333  ##   ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

5. Check Benford's law in India matches

setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails")  load("India-BattingDetails.RData")    indiaTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive")   plot(indiaTrends)

 

6. Check Benford's law in all of Intl. T20

setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails")  teams <-c("Australia","India","Pakistan","West Indies", 'Sri Lanka',            "England", "Bangladesh","Netherlands","Scotland", "Afghanistan",            "Zimbabwe","Ireland","New Zealand","South Africa","Canada",            "Bermuda","Kenya","Hong Kong","Nepal","Oman","Papua New Guinea",            "United Arab Emirates","Namibia","Cayman Islands","Singapore",            "United States of America","Bhutan","Maldives","Botswana","Nigeria",            "Denmark","Germany","Jersey","Norway","Qatar","Malaysia","Vanuatu",            "Thailand")    for(team in teams){    battingDetails <- NULL    val <- paste(team,"-BattingDetails.RData",sep="")    print(val)    tryCatch(load(val),             error = function(e) {               print("No data1")               setNext=TRUE             }                              )    details <- battingDetails    battingDF <- rbind(battingDF,details)      }  intlT20Trends = benford(battingDF$runs, number.of.digits = 1, discrete = T, sign = "positive")   intlT20Trends  ##   ## Benford object:  ##    ## Data: battingDF$runs   ## Number of observations used = 21833   ## Number of obs. for second order = 131   ## First digits analysed = 1  ##   ## Mantissa:   ##   ##    Statistic  Value  ##         Mean  0.447  ##          Var  0.085  ##  Ex.Kurtosis -1.158  ##     Skewness  0.018  ##   ##   ## The 5 largest deviations:   ##   ##   digits absolute.diff  ## 1      2        361.40  ## 2      9        276.02  ## 3      1        264.61  ## 4      7        210.14  ## 5      8        198.81  ##   ## Stats:  ##   ##  Pearson's Chi-squared test  ##   ## data:  battingDF$runs  ## X-squared = 202.29, df = 8, p-value < 2.2e-16  ##   ##   ##  Mantissa Arc Test  ##   ## data:  battingDF$runs  ## L2 = 5.3983e-06, df = 2, p-value = 0.8888  ##   ## Mean Absolute Deviation (MAD): 0.007821098  ## MAD Conformity - Nigrini (2012): Acceptable conformity  ## Distortion Factor: -24.11086  ##   ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

Conclusion

Maths rules our lives, more than we are aware, more that we like to admit. It is there in all of nature. Whether it is the recursive patterns of Mandelbrot sets, the intrinsic notion of beauty through the golden ratio, the murmuration of swallows, the synchronous blinking of fireflies or in the almost univerality of Benford's law on natural datasets, mathematics govern us.

Isn't it strange that while we humans pride ourselves of freewill, the runs scored by batsmen in particular formats conform to Benford's rule for the first digits. It almost looks like, the runs that will be scored is almost to extent predetermined to fall within specified ranges obeying Benford's law. So much for choice.

Something to be pondered over!

Also see

  1. Introducing GooglyPlusPlus!!!
  2. Deconstructing Convolutional Neural Networks with Tensorflow and Keras
  3. Going deeper into IBM's Quantum Experience!
  4. Experiments with deblurring using OpenCV
  5. Big Data 6: The T20 Dance of Apache NiFi and yorkpy
  6. Deep Learning from first principles in Python, R and Octave – Part 4
  7. Practical Machine Learning with R and Python – Part 4
  8. Re-introducing cricketr! : An R package to analyze performances of cricketers
  9. Bull in a china shop – Behind the scenes in Android

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Benford's law meets IPL, Intl. T20 and ODI cricket first appeared on R-bloggers.

10 Must-Know Tidyverse Features!

Posted: 14 Oct 2020 01:00 PM PDT

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Tutorials Update

Interested in more R tutorials? Learn more R tips:

๐Ÿ‘‰ Register for our blog to get new articles as we release them.


Tidyverse Updates

There is no doubt that the tidyverse opinionated collection of R packages offers attractive, intuitive ways of wrangling data for data science. In earlier versions of tidyverse some elements of user control were sacrificed in favor of simplifying functions that could be picked up and easily used by rookies. In the 2020 updates to dplyr and tidyr there has been progress to restoring some finer control.

This means that there are new methods available in the tidyverse that some may not be aware of. The methods allow you to better transform your data directly to the way you want and to perform operations more flexibly. They also provide new ways to perform common tasks like nesting, modeling and graphing in ways where the code is more readable. Often users are only just scratching the surface of what can be done with the latest updates to this important set of packages.

It's incumbent on any analyst to stay up to date with new methods. This post covers ten examples of approaches to common data tasks that are better served by the latest tidyverse updates. We will use the new Palmer Penguins dataset, a great all round dataset for illustrating data wrangling.

First let's load our tidyverse packages and the Palmer Penguins dataset and take a quick look at it. Please be sure to install the latest versions of these packages before trying to replicate the work here.

library(tidyverse)  library(palmerpenguins)    penguins <- palmerpenguins::penguins  %>%             filter(!is.na(bill_length_mm))    penguins
## # A tibble: 342 x 8  ##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g  ##                                                ##  1 Adelie  Torge~           39.1          18.7              181        3750  ##  2 Adelie  Torge~           39.5          17.4              186        3800  ##  3 Adelie  Torge~           40.3          18                195        3250  ##  4 Adelie  Torge~           36.7          19.3              193        3450  ##  5 Adelie  Torge~           39.3          20.6              190        3650  ##  6 Adelie  Torge~           38.9          17.8              181        3625  ##  7 Adelie  Torge~           39.2          19.6              195        4675  ##  8 Adelie  Torge~           34.1          18.1              193        3475  ##  9 Adelie  Torge~           42            20.2              190        4250  ## 10 Adelie  Torge~           37.8          17.1              186        3300  ## # ... with 332 more rows, and 2 more variables: sex , year 

The dataset presents several observations of anatomical parts of penguins of different species, sexes and locations, and the year that the measurements were taken.

1. Selecting columns

tidyselect helper functions are now built in to allow you to save time by selecting columns using dplyr::select() based on common conditions. In this case, if we want to reduce the dataset to just bill measurements we can use this, noting that all measurement columns contain an underscore:

penguins %>%     dplyr::select(!contains("_"), starts_with("bill"))
## # A tibble: 342 x 6  ##    species island    sex     year bill_length_mm bill_depth_mm  ##                                   ##  1 Adelie  Torgersen male    2007           39.1          18.7  ##  2 Adelie  Torgersen female  2007           39.5          17.4  ##  3 Adelie  Torgersen female  2007           40.3          18    ##  4 Adelie  Torgersen female  2007           36.7          19.3  ##  5 Adelie  Torgersen male    2007           39.3          20.6  ##  6 Adelie  Torgersen female  2007           38.9          17.8  ##  7 Adelie  Torgersen male    2007           39.2          19.6  ##  8 Adelie  Torgersen     2007           34.1          18.1  ##  9 Adelie  Torgersen     2007           42            20.2  ## 10 Adelie  Torgersen     2007           37.8          17.1  ## # ... with 332 more rows

A full set of tidyselect helper functions can be found in the documentation here.

2. Reordering columns

dplyr::relocate() allows a new way to reorder specific columns or sets of columns. For example, if we want to make sure that all of the measurement columns are at the end of the dataset, we can use this, noting that my last column is year:

penguins <- penguins %>%     dplyr::relocate(contains("_"), .after = year)    penguins
## # A tibble: 342 x 8  ##    species island sex    year bill_length_mm bill_depth_mm flipper_length_~  ##                                           ##  1 Adelie  Torge~ male   2007           39.1          18.7              181  ##  2 Adelie  Torge~ fema~  2007           39.5          17.4              186  ##  3 Adelie  Torge~ fema~  2007           40.3          18                195  ##  4 Adelie  Torge~ fema~  2007           36.7          19.3              193  ##  5 Adelie  Torge~ male   2007           39.3          20.6              190  ##  6 Adelie  Torge~ fema~  2007           38.9          17.8              181  ##  7 Adelie  Torge~ male   2007           39.2          19.6              195  ##  8 Adelie  Torge~    2007           34.1          18.1              193  ##  9 Adelie  Torge~    2007           42            20.2              190  ## 10 Adelie  Torge~    2007           37.8          17.1              186  ## # ... with 332 more rows, and 1 more variable: body_mass_g 

Similar to .after you can also use .before as an argument here.

3. Controlling mutated column locations

Note in the penguins dataset that there are no unique identifiers for each study group. This can be problematic when we have multiple penguins of the same species, island, sex and year in the dataset. To address this and prepare for later examples, let's add a unique identifier using dplyr::mutate(), and here we can illustrate how mutate() now allows us to position our new column in a similar way to relocate():

penguins_id <- penguins %>%     dplyr::group_by(species, island, sex, year) %>%     dplyr::mutate(studygroupid = row_number(), .before = contains("_"))    penguins_id
## # A tibble: 342 x 9  ## # Groups:   species, island, sex, year [35]  ##    species island sex    year studygroupid bill_length_mm bill_depth_mm  ##                                       ##  1 Adelie  Torge~ male   2007            1           39.1          18.7  ##  2 Adelie  Torge~ fema~  2007            1           39.5          17.4  ##  3 Adelie  Torge~ fema~  2007            2           40.3          18    ##  4 Adelie  Torge~ fema~  2007            3           36.7          19.3  ##  5 Adelie  Torge~ male   2007            2           39.3          20.6  ##  6 Adelie  Torge~ fema~  2007            4           38.9          17.8  ##  7 Adelie  Torge~ male   2007            3           39.2          19.6  ##  8 Adelie  Torge~    2007            1           34.1          18.1  ##  9 Adelie  Torge~    2007            2           42            20.2  ## 10 Adelie  Torge~    2007            3           37.8          17.1  ## # ... with 332 more rows, and 2 more variables: flipper_length_mm ,  ## #   body_mass_g 

4. Transforming from wide to long

The penguins dataset is clearly in a wide form, as it gives multiple observations across the columns. For many reasons we may want to transform data from wide to long. In long data, each observation has its own row. The older function gather() in tidyr was popular for this sort of task but its new version pivot_longer() is even more powerful. In this case we have different body parts, measures and units inside these column names, but we can break them out very simply like this:

penguins_long <- penguins_id %>%     tidyr::pivot_longer(contains("_"), # break out the measurement cols                        names_to = c("part", "measure", "unit"), # break them into these three columns                        names_sep = "_") #  use the underscore to separate    penguins_long
## # A tibble: 1,368 x 9  ## # Groups:   species, island, sex, year [35]  ##    species island    sex     year studygroupid part    measure unit   value  ##                                 ##  1 Adelie  Torgersen male    2007            1 bill    length  mm      39.1  ##  2 Adelie  Torgersen male    2007            1 bill    depth   mm      18.7  ##  3 Adelie  Torgersen male    2007            1 flipper length  mm     181    ##  4 Adelie  Torgersen male    2007            1 body    mass    g     3750    ##  5 Adelie  Torgersen female  2007            1 bill    length  mm      39.5  ##  6 Adelie  Torgersen female  2007            1 bill    depth   mm      17.4  ##  7 Adelie  Torgersen female  2007            1 flipper length  mm     186    ##  8 Adelie  Torgersen female  2007            1 body    mass    g     3800    ##  9 Adelie  Torgersen female  2007            2 bill    length  mm      40.3  ## 10 Adelie  Torgersen female  2007            2 bill    depth   mm      18    ## # ... with 1,358 more rows

5. Transforming from long to wide

It's just as easy to move back from long to wide. pivot_wider() gives much more flexibility compared to the older spread():

penguins_wide <- penguins_long %>%     tidyr::pivot_wider(names_from = c("part", "measure", "unit"), # pivot these columns                       values_from = "value", # take the values from here                       names_sep = "_") # combine col names using an underscore    penguins_wide
## # A tibble: 342 x 9  ## # Groups:   species, island, sex, year [35]  ##    species island sex    year studygroupid bill_length_mm bill_depth_mm  ##                                       ##  1 Adelie  Torge~ male   2007            1           39.1          18.7  ##  2 Adelie  Torge~ fema~  2007            1           39.5          17.4  ##  3 Adelie  Torge~ fema~  2007            2           40.3          18    ##  4 Adelie  Torge~ fema~  2007            3           36.7          19.3  ##  5 Adelie  Torge~ male   2007            2           39.3          20.6  ##  6 Adelie  Torge~ fema~  2007            4           38.9          17.8  ##  7 Adelie  Torge~ male   2007            3           39.2          19.6  ##  8 Adelie  Torge~    2007            1           34.1          18.1  ##  9 Adelie  Torge~    2007            2           42            20.2  ## 10 Adelie  Torge~    2007            3           37.8          17.1  ## # ... with 332 more rows, and 2 more variables: flipper_length_mm ,  ## #   body_mass_g 

6. Running group statistics across multiple columns

dplyr can how apply multiple summary functions to grouped data using the across adverb, helping you be more efficient. If we wanted to summarize all bill and flipper measurements in our penguins we would do this:

penguin_stats <- penguins %>%     dplyr::group_by(species) %>%     dplyr::summarize(across(ends_with("mm"), # do this for columns ending in mm                            list(~mean(.x, na.rm = TRUE),                                  ~sd(.x, na.rm = TRUE)))) # calculate a mean and sd    penguin_stats
## # A tibble: 3 x 7  ##   species bill_length_mm_1 bill_length_mm_2 bill_depth_mm_1 bill_depth_mm_2  ##                                                     ## 1 Adelie              38.8             2.66            18.3           1.22   ## 2 Chinst~             48.8             3.34            18.4           1.14   ## 3 Gentoo              47.5             3.08            15.0           0.981  ## # ... with 2 more variables: flipper_length_mm_1 ,  ## #   flipper_length_mm_2 

7. Control output columns names when summarising columns

The columns in penguin_stats have been given default names which are not that intuitive. If we name our summary functions, we can then use the .names argument to control precisely how we want these columns named. This uses glue notation. For example, here we want to construct the new column names by taking the existing column names, removing any underscores or 'mm' metrics, and pasting to the summary function name using an underscore:

penguin_stats <- penguins %>%     dplyr::group_by(species) %>%     dplyr::summarize(across(ends_with("mm"),                             list(mean = ~mean(.x, na.rm = TRUE),                                  sd = ~sd(.x, na.rm = TRUE)), # name summary functions                            .names = "{gsub('_|_mm', '', col)}_{fn}")) # column names structure    penguin_stats
## # A tibble: 3 x 7  ##   species billlength_mean billlength_sd billdepth_mean billdepth_sd  ##                                             ## 1 Adelie             38.8          2.66           18.3        1.22   ## 2 Chinst~            48.8          3.34           18.4        1.14   ## 3 Gentoo             47.5          3.08           15.0        0.981  ## # ... with 2 more variables: flipperlength_mean , flipperlength_sd 

8. Running models across subsets

The output of summarize() can now be literally anything, because dplyr now allows different column types. We can generate summary vectors, dataframes or other objects like models or graphs.

If we wanted to run a model for each species you could do it like this:

penguin_models <- penguins %>%     dplyr::group_by(species) %>%     dplyr::summarize(model = list(lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm)))  # store models in a list column    penguin_models
## # A tibble: 3 x 2  ##   species   model   ##          ## 1 Adelie        ## 2 Chinstrap     ## 3 Gentoo    

It's not usually that useful to keep model objects in a dataframe, but we could use other tidy-oriented packages to summarize the statistics of the models and return them all as nicely integrated dataframes:

library(broom)    penguin_models <- penguins %>%     dplyr::group_by(species) %>%     dplyr::summarize(broom::glance(lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm))) # summarize model stats    penguin_models
## # A tibble: 3 x 13  ##   species r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC  ##                                   ## 1 Adelie      0.508         0.498  325.      50.6 1.55e-22     3 -1086. 2181.  ## 2 Chinst~     0.504         0.481  277.      21.7 8.48e-10     3  -477.  964.  ## 3 Gentoo      0.625         0.615  313.      66.0 3.39e-25     3  -879. 1768.  ## # ... with 4 more variables: BIC , deviance , df.residual ,  ## #   nobs 

9. Nesting data

Often we have to work with subsets, and it can be useful to apply a common function across all subsets of the data. For example, maybe we want to take a look at our different species of penguins and make some different graphs of them. Grouping based on subsets would previously be achieved by the following somewhat awkward combination of tidyverse functions.

penguins %>%     dplyr::group_by(species) %>%     tidyr::nest() %>%     dplyr::rowwise()
## # A tibble: 3 x 2  ## # Rowwise:  species  ##   species   data                ##                      ## 1 Adelie      ## 2 Gentoo      ## 3 Chinstrap 

The new function nest_by() provides a more intuitive way to do the same thing:

penguins %>%     nest_by(species)
## # A tibble: 3 x 2  ## # Rowwise:  species  ##   species                 data  ##        >  ## 1 Adelie             [151 x 7]  ## 2 Chinstrap           [68 x 7]  ## 3 Gentoo             [123 x 7]

The nested data will be stored in a column called data unless we specify otherwise using a .key argument.

10. Graphing across subsets

Armed with nest_by() and the fact that we can summarize or mutate virtually any type of object now, this allows us to generate graphs across subsets and store them in a dataframe for later use. Let's scatter plot bill length and depth for our three penguin species:

# generic function for generating a simple scatter plot in ggplot2  scatter_fn <- function(df, col1, col2, title) {    df %>%       ggplot2::ggplot(aes(x = , y = )) +      ggplot2::geom_point() +      ggplot2::geom_smooth(method = "loess", formula = "y ~ x") +      ggplot2::labs(title = title)  }    # run function across species and store plots in a list column  penguin_scatters <- penguins %>%     dplyr::nest_by(species) %>%     dplyr::mutate(plot = list(scatter_fn(data, bill_length_mm, bill_depth_mm, species)))     penguin_scatters
## # A tibble: 3 x 3  ## # Rowwise:  species  ##   species                 data plot    ##        >   ## 1 Adelie             [151 x 7]     ## 2 Chinstrap           [68 x 7]     ## 3 Gentoo             [123 x 7] 

Now we can easily display the different scatter plots to show, for example, that our penguins exemplify Simpson's Paradox:

library(patchwork)    # generate scatter for entire dataset  p_all <- scatter_fn(penguins, bill_length_mm, bill_depth_mm, "All Species")     # get species scatters from penguin_scatters dataframe  for (i in 1:3) {   assign(paste("p", i, sep = "_"),          penguin_scatters$plot[i][[1]])   }    # display nicely using patchwork in R Markdown  p_all /  (p_1 | p_2 | p_3) +    plot_annotation(caption = "{palmerpenguins} dataset")

Scatterplot

Author: Jim Gruman, Data Analytics Leader

Serving enterprise needs with innovators in mobile power, decision intelligence, and product management, Jim can be found at https://jimgruman.netlify.app.

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 10 Must-Know Tidyverse Features! first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

My year in R

Posted: 14 Oct 2020 11:00 AM PDT

[This article was first published on R on Amit Levinson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A figure of an animated R Icon
Image by Allison Horst

Learning R for a little over a year now was and still is a great experience. But a year isn't a lot, so why make a blog post about it?

I believe that pausing what one is doing and periodically evaluating if this pursuit is the right direction for him or her – is a healthy process. Doing so can help you acknowledge your accomplishments and think of where you're heading. And I'm glad I have the opportunity to do it in the following post.

The Journey

I was wondering how to summarize my past year: A list of resources? a story? a listicle? I decided to go with more of an item-list somewhat chronologically ordered that I believe captures my experience. Of course like a lot of many other things in life, the timeline discussed isn't completely rigid as some items I did concurrently or jumped back and forth.

1. Hearing about R

I first heard of R when it was used in a hierarchical linear models workshop I attended. The workshop focused more on the statistics part of the analysis so we didn't go in depth into the code. Subsequently I heard about R twice – Once from a friend studying Psychology, Yarden Ashur, and from my sister, Maayan Levinson, a statistician with the CBS. I'll admit it took me some time to pick it up, but eventually I did.

2. First learning steps

My friend from Psychology also told me there's a recorded R course for psychology available on moodle (A platform for online course information) I could use freely. The course was led by Yoav Kessler and proved to be a fantastic introduction. I followed along with the course and did the different assignments until we reached ggplot, a library for plotting in R.

3. Joining the TidyTuesday community

By the time we reached ggplot in the psychology course I was already somewhat familiar with Twitter where the #Tidytuesday mostly takes place. #Tidytuesday is an amazing project where every week a new dataset is published for the #rstats community to analyze, visualize and post their results on Twitter. My excitement and motivation to participate were extremely high: So many professional and experienced R-users working on the same dataset, conjuring amazing visualizations and posting their code for others to explore (and all this for free)?! I was blown away. So I followed along on Twitter for a week or two until I said OK, let's give it a try.

It was week 38 in 2019 and we were working on visualizing national parks. I wasn't really sure what to do, so I did a minimal exploration of the data and noticed an interesting increase of visitors in national parks across years, which seemed intuitive and perfect for a first visualization. Following the basic area-graph I made I remembered a visualization a week earlier from Ariane Aumaitre using roller-coaster icons in her graph. Knowing nothing on how to integrate icons, I adapted her code into my visualization to create a nice scenery for the mountain the data displayed (see tweet below). I was pretty satisfied at the time and the feedback from the #rstats community was incredible – I was hooked on the project.

First visualization and participation in TidyTuesday

I believe the project was a fantastic introduction to continuously analyzing and visualizing data in R. Participating in the project provides a safe, motivating and rich setting to practice and learn R. Additionally, I didn't have anything that 'forced' me to learn R, so knowing that every week I had a new data set to analyze and visualize along with others provided me with a sense of routine and commitment.

4. Opening a GitHub account

Following the first visualization for #Tidytuesday I wanted to share the code I wrote. At the time I was only using GitHub to read code written by others. Using the Happy git with r guide I was able to properly upload my code and synchronize future work. Since then, using GitHub taught me so much: Reading others' code and discovering new functions; Organizing my own code so others can easily read it and thus 'forcing me' to clean it once I finished a project; and having a place to host all my efforts. I sincerely believe that opening a GitHub account to share everything I did was an important and pivoting moment learning R. Although I still have so much more to learn when it comes to cleaning code and project management, a lot of what I know now is attributed to having GitHub repositories and code as accessible as possible for others to explore and learn.

5. Visualizing things I was interested in

As I was participating in #Tidytuesday Eliud Kipchoge broke (unofficially) the two-hour marathon barrier. I found this amazing and wanted to visualize the comparison between the new record and older ones. I manually copied the marathon record values from Wikipedia and used that to plot running icons representing the different records. It wasn't an aesthetic plot but it was definitely rewarding. I've since improved the visualization by making it reproducible and eventually wrote a blog post explaining the process of how I made it. Similarly, a month or two later I plotted bomb shelter locations around my house amidst missiles fired towards Israel, all in R while using Google maps. I finally took an opportunity to make a visualization that related to my daily routine.

6. Continuous learning

Well, it's kind of redundant to say this, as we're always learning, but it is important: After I joined the #Tidytuesday community, I started again to actively learn about visualizations and data wrangling in addition to solidifying my basic knowledge of R. For this I relied on the following sources:

  • R & R4DS book – I decided to start with the R for data science online book. Every morning I would spend 30-60 minutes either reading and learning something new or attempting to answer the book's questions. I'd re-write the code into my R console while following along, exploring and trying to understand what was going on. To validate my answers to the questions or understand those I didn't know I cross-checked them with the Excercise solutions book by Jeffrey Arnold. I also bought Data Visualizaiton, A practical introduction by Kieran Healy and Text Mining with R, A tidy Approach by Julia Silge and David Robinson, but I used them more on the go and less of a sit-down.

  • Joining online courses – While reading the R4DS book I decided to seek out some other courses at my university, mainly for recreational as I had no course credits left to take. The friend who introduced me to R told me about a workshop on 'algorithms and research in the intersection of Psychology and big data' by Michael Gilead. Getting an approval to sit in on some classes I found it a great introduction to working with big data. While joining in I met Almog Simchon who led a fantastic 'text analysis in R' workshop the following semester. I'd also join Mattan Ben-Shachar's TA class and learn more about data wrangling and statistical methods. Additionally, a friend told me about Jonathan Rosenblatt's R course in the department of industrial engineering and management (recorded lectures in Hebrew are freely available). Although I didn't understand some of what was going on in these courses, I was glad to expose myself to new things I could follow up on later. If any of you are reading this, thank you very much for the opportunity.

  • Other visualization books – I found myself leaning towards books that are not related to the R ecosystem, but that I discovered during my R journey. That is, I found a strong liking towards visualizing data and bought books on that topic too. These books immensely improved my visualizations and how I look at visualizing data. I still have a lot more to learn – both theoretically and technically – but these books definitely inspired and opened my mind when it comes to visualizations. I highly recommend reading Storytelling with data by Cole Nussbaumer Knaflic and How Charts Lie by Alberto Cairo which I started with.

  • Watch live webinars & videos; attended meetups – An invaluable source of learning was participating in various online meetups and webinars. A good place I found many of them was on Twitter, but I'm sure you can also find them in Facebook groups, Rstudio news letter, etc. I Sometimes didn't understand what they were talking about but just exposing myself to it felt great (seems like a reoccurring theme). It motivated me to want to learn more in order to succeed in doing what was presented. I also highly recommend exploring the Rstudio Videos page.

    Luckily, I started learning R before the COVID-19 prevailed so I was able to join 2 Israeli R-meetups. This was a great experience and although being a novice when I attended them, the community was great, people were welcoming, there was great pizza and beer and the presentations were fantastic. It was a great source of inspiration of what was to come. Plus, seeing so many people enthusiastically talking about R made me understand that I'm not alone in liking this world.

7. Making my own website

A month or two into learning R I noticed people had their own websites they made in R. Again I was fascinated at how this was possible – Not only can I wrangle data and beautifully visualize it but I can also build my own website? and for 10$ I can use my own domain? This was crazy!



Opening my own website – A motivation to learn and write

Nearing mid January (4+- months into R) I decided it's time open my own website. I had a few things I already made (Eliud Kipchoge's record and the bomb shelters around my house) and also wanted a place for others to learn more about me. I scrolled and followed along the blogdown book for creating websites with R, viewed some of Allison Hill's blogdown workshops and other resources. Eventually, I was setup and had my website live, done in R, hosted for free on Netlify and GitHub with an elegant Hugo Academic theme, and my own domain for only 10$! I was amazed at how easy and rewarding this was. I mean, I had no knowledge (and still don't) of HTML, CSS or anything else to build a website and here I conjured one, and pretty easily!

I highly recommend creating a website. Even if you're not an R user, I think a personal website is a great motivator for writing blogs; a platform for others to learn more about you and a not so difficult thing to do today. Opening a website has definitely motivated me to learn much more by writing about it (here's a great talk by David Robinson on The unreasonable effectiveness of public work).

8. Giving a talk about R

During Passover (April) 2020, the Israel-2050 fellows group sent out a call inviting individuals to talk about anything they wanted. I decided to take the opportunity and give a talk there, and following that to a group of friends of mine that meet periodically with someone presenting something. Although being only ~7 months into learning R, I wanted to share its amazing abilities for wrangling and visualizing data, the extreme difference of using it compared to SPSS I learned and how it helped me explore intriguing questions I had. So I sat down, wrote an outline, and made a presentation using the {Xaringan} package. You can find the slides here.

The talk was great (I think) and some of the participants even followed up inquiring about resources to get started, how can they do this in R, etc. However, more importantly, making and giving the talk forced me to think about what is it in R that I like. Organizing these thoughts and communicating them in a way that is appealing to the audience was a fantastic opportunity to stop and think about exactly that: Why do I like working in R and why should they join it.




My first talk about R (use your keyboard arrow to scroll through it).

9. Integrating R into my daily work

Using R as a research assistant – I was very fortunate that the researcher I work for, Dr. Jennifer Oser, was (and still is) very supportive of integrating R into our daily work. I remember as we started analyzing our data and trying to make sense of it I was debating whether to open SPSS, Excel or R. Luckily, I knew how to do some of what we wanted to in R so I turned to use that. I believe we've greatly progressed since, so much that I find it absurd to use something else now. If you can integrate R into your daily work it's definitely a bonus, I know I learned a lot (I mean a lot) about rmarkdown and version control once I started using R in my research assistant position.

Integrating R into my thesis – The reason I initially started learning R was so that I could analyze my thesis' findings and finish my MA with a new skill. No one forced me to use R, and I'm sure I could have done OK with SPSS (or maybe not?), but I was keen on using R in my thesis; it was an exciting and challenging experience. Prior to my thesis I've mostly done visualizations and descriptive reports so it was great working on regression models, reliability and other forms of reports. I also learned more about version control, using the same functions I wrote for the pilot study and my main analysis and so forth. I couldn't imagine producing SPSS tables and integrating them every time in a separate text document; plus, it was very rewarding trying to automate the process as much as possible.

10. Blog, and then blog some more

I imagine you've heard this saying a lot, but I definitely agree with it: If you like it then you should put a ring on it write about it. Don't write for others, write for yourself. While I mentioned earlier that I wrote about my visualizations, I also wanted to learn about the statistical analyses I came across. I would read about a topic and think of an example I can easily use to explain the concept. For myself, not for others.

For example, to learn what was Term frequency inverse document frequency (tf-idf) I implemented it in analyzing the tfidf of 4 books by political theorists' I like. At one point I wanted to learn more about Bernoulli trials so I explored the uncertainty in the Israeli lottery. Alternatively, write about a challenge you faced and how you solved it.In another example, I wrote about presenting a static summary of categorical variables from my thesis pilot survey (found here).

Don't write for others to click on your website; rather, write for you to learn or communicate something you want to share with your future self and the world, no matter who reads it.

Summary

So this was a not-so-short recap of my last year, which I hope was of value. A lot of the above is owed to the amazing R community – Any and every one who blogs, shares his code, interacts about R on social media and was forthcoming. I'm very grateful to the many people I've reached out to with random questions, wanting to join their course or inquire about further reading.

It's interesting to think back about something you've done and if and how would you have done it differently. As to the latter, I'm not sure, and I'm kind of glad that it happened the way it did.

I think my main takeaways are:

  1. Learn a bit from a course, website, blog, book, etc. Don't get too caught on in my opinion, rather try and implement it on examples that could be similar to those in a book, but aren't given on a silver platter (for example search for your own dataset from Kaggle and the like).

  2. Find a community and project that will keep you hooked. If you're interested in R, then join the #Tidytuesday project!

  3. Share what you learn – whether by uploading your code to GitHub, opening your own website or giving a talk. One of the main motivators I had to overcome of writing after using R for only 4 months ( an "imposter syndrome") was discovering that a Youtube video on how to unzip a file had 650,000 views. That is, there's an audience for everything. Don't think "I'm not good enough to write about it". One of the best ways I learned something was understanding it in a way that I could then communicate it to others.

  4. Have fun! The more – The better!

If you're looking for a place to start, Oscar Baruffa compiled a fantastic resouce aggregating ~100 books about R (most are free).

What's next for me

Great question! Honestly I don't know. I hope to finish my thesis soon and search for a job that'll require me to work with R and visualize data. In addition, I'll probably also try and learn some Tableau and improve my SQL skills as they are somewhat sought after in various jobs I looked at. As to R, I hope to learn some new concepts and statistical analyses; incorporate more #Tidytuesdays into my weekly routine; and analyze some data I have waiting around for a blog post. Of course everything is flexible, and in that case I really don't know what's waiting but I'm definitely excited about it!

To leave a comment for the author, please follow the link and comment on their blog: R on Amit Levinson.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post My year in R first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

2 Months in 2 Minutes – rOpenSci News, October 2020

Posted: 14 Oct 2020 11:00 AM PDT

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.




rOpenSci HQ




rOpenSci at R-Ladies

Our community manager Stefanie Butland, and one of our software review editors Brooke Anderson, are speaking remotely at an R-Ladies East Lansing meetup Thursday, October 22nd. They will talk about our how to get involved in rOpenSci using our new Contributing Guide as an entry point, and through participating in software review as a package author or reviewer.




Contributing Guide Release

The purpose of our new Contributing Guide is to welcome you to rOpenSci and help you recognize yourself as a potential contributor. It will help you figure out what you might gain by giving your time, expertise, and experience; match your needs with things that will help rOpenSci's mission; and connect you with resources to help you along the way.




Software Peer Review

3 community-contributed packages passed software peer review.

hex logo of R package medrxivr
hex logo of R package treedata.table

Consider submitting your package or volunteering to review. If you want to be a reviewer fill out this short form, and we'll ping you when there's a submission that fits in your area of expertise.




Software

2 new peer-reviewed packages from the community are on CRAN.

1 new package from the rOpenSci team is on CRAN.




On the Blog




From the community




From the rOpenSci team




Use Cases

  • 74 published works cited or used rOpenSci software (listed in individual newsletters)

  • 6 use cases for our packages or resources were posted in our discussion forum. Look for citecorp, europepmc, fulltext, magick, oai, rAltmetric, rcrossref, refsplitr, rentrez, rtweet, roadoi

Have you used an rOpenSci package? Share your use case and we'll tweet about it.




In the News




Call For Maintainers

Part of our mission is making sustainable software that users can rely on. Sometimes software maintainers need to give up maintenance due to a variety of circumstances. When that happens we try to find new maintainers. Check out our guidance for taking over maintenance of a package.

monkeylearn is in need of a new maintainer. Comment on this thread if you're interested.




Keep up with rOpenSci

We create a newsletter every two weeks. You can subscribe via rss feed in XML or JSON or via our one-way mailing list.

Follow @rOpenSci on Twitter.

Find out how you can contribute to rOpenSci as a user or developer.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 2 Months in 2 Minutes - rOpenSci News, October 2020 first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Overengineering in ML – business life is not a Kaggle competition

Posted: 14 Oct 2020 03:41 AM PDT

[This article was first published on That's so Random, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

"Overengineering is the act of designing a product to be more robust or have more features than often necessary for its intended use, or for a process to be unnecessarily complex or inefficient." This is how the Wikipedia page on overengineering starts. It is the diligent engineer who wants to make sure that every possible feature is incorporated in the product, that creates an overengineered product. We find overengineering in real world products, as well as in software. It is a relevant concept in data science as well. First of all, because software engineering is very much a part of data science. We should be careful not to create dashboards, reports and other products that are too complex and contain more information than the user can stomach. But maybe there is a second, more subtle lesson, in overengineering for data scientists. We might create machine learning models that predict too well. Sounds funny? Let me explain what I mean by it.

In machine learning, theoretically at least, there is an optimal model give the available data in the train set. It is the one that gives the best predictions on new data, is the one that has just the right level of complexity. It is not too simple, such that it would miss predictive relationships between feature and target (aka is not underfitting), but it also not so complex that it incorporates random noise in the train set (aka is not overfitting).The golden standard within machine learning is to hold out a part of the train set to represent new data, to gauge where on the bias-variance continuum the predictor is. Either by using a test set, by using cross-validation, or, ideally, using both.

Machine learning competitions, like the ones on Kaggle, challenge data scientists to find the model that is as close to the theoretical optimum as possible. Since different models and machine learning algorithms typically excel in different areas, oftentimes the optimal result is attained by combining them in what called an ensemble. Not seldom are ML competitions won by multiple contestants who joined forces and combined their models into one big super model.

In the ML competition context, there is no such thing as "predicting too well". Predicting as well as you can is the sheer goal of these competitions. However, in real-world applications this is not the case, in my opinion. There the objective is (or maybe should be) creating as much business value as possibles. With this goal in mind we should realize that optimizing machine learning models comes with costs. Obviously, there is the salary of the data scientist(s) involved. As you come closer to the optimal model, the more you'll need to scrape for improvement. Most likely, there will be diminishing returns on the time spent as the project progresses in terms of gained prediction accuracy.

But costs can also be in the complexity of the implementation. I don't mean the model complexity here, but the complexity of the product as a whole. The amount of code written might increase sharply when more complex features are introduced. Or using a more involved model might require the training to run on multiple cores or will increase the training time by, say, fivefold. Making your product more complex makes it more vulnerable for bugs and more dificult to maintain in the future. Although the predictions of a more complex model might be (slightly) better, it's business value might actually be lower than a simpler solution, because of this vulnaribility.

The strange-sounding statement in the introduction of this blog "We might create machine learning models that perform too well", might make more sense now. Too much time and money can be invested, creating a product that is too complex and performs too well for the business needs it serves. With other words, we are overengineering the machine learning solution.

Figthing overengineering

There are at least two ways that will help you not to overengineer a machine learning product. First of all, by building a product incrementally. Probably no surprise coming from a proponent of working in an agile way, I think starting small and simple is the way to go. If the predictions are not up to par with the business requirements, see where the biggest improvement can be made in the least amount of time adding the least amount of complexity to the product. Then, assess again and start another cycle if needed. Until you arrive at a solution that is just good enough for the business need. We could call this Occam's model, the simplest possible solution that fulfills the requirements.

Secondly, by realising that the call if the predictions are good enough to meet business needs is a business decision, not a data science choice. If you have someone on your team who is responsible for allocation of resources, planning, etc. (PO, manager, business lead, however they is called), it should be predominantly their call if there is need for further improvement. The question of these people to data scientists is too often "Is the model good enough, already?", where it should be "What is the current performance of the model?". As a data scientist, in the midst of optimisation, you might not be the best judge of good enough. Our ideas for further optimisation and general perfectionism could cloud our judgement. Rather, we should make it our job to inform the business people as best as we can about the current performance, and leave the final call to them.

To leave a comment for the author, please follow the link and comment on their blog: That's so Random.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Overengineering in ML - business life is not a Kaggle competition first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Zoom talk on “Organising exams in R” from the Grenoble R user group

Posted: 14 Oct 2020 01:55 AM PDT

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Due to the current sanitary situation, the Grenoble (France) R user group has decided to switch 100% online. The advantage to this is that our talks will now open for anybody around the globe ๐Ÿ™‚

The next talk will be on October 22nd, 2020 at 5PM (FR):

Organising exams in R

Package {exams} enables creating questionnaires that combine program output, graphs, etc in an automatised and dynamic fashion. They may be exported in many different formats: html, pdf, nops and most intresting xml. Xml is compatible with moodle which allows to reproducibly generate random questions in R and create a great number of different exams and make them an online moodle exam.

Link to the event:

https://www.eventbrite.com/e/organising-exams-in-r-tickets-125308530187

Link to Zoom: https://us04web.zoom.us/j/76885441433?pwd=bUhvejdUb2sxa29saEk5M3NlMldBdz09

Link to the Grenoble R user group 2020/2021 calendar: 

https://r-in-grenoble.github.io/sessions.html

Hope to see you there!


Zoom talk on "Organising exams in R" from the Grenoble R user group was first posted on October 14, 2020 at 2:55 pm.
©2020 “R-posts.com“. Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at tal.galili@gmail.com

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Zoom talk on "Organising exams in R" from the Grenoble R user group first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Climate Change & AI for GOOD | Online Open Forum Oct 15th

Posted: 14 Oct 2020 01:55 AM PDT

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Join Data Natives for a discussion on how to curb Climate Change and better protect our environment for the next generation. Get inspired by innovative solutions which use data, machine learning and AI technologies for GOOD. Lubomila Jordanova, Founder of Plan A, and featured speaker, explains that "the IT sector will use up to 51% of the global energy output in 2030. Let's adjust the digital industry and use Data for Climate Action, because carbon reduction is key to making companies future-proof." When used carefully, AI can help us solve some of the most serious challenges. However, key to that success is measuring impact with the right methods, mindsets, and metrics.

The founders of startups that developed innovative solutions to combat humanity's biggest challenge, will share their experiences and thoughts: Brittany Salas (Co-Founder at Active Giving) Peter Sรคnger (Co-Founder/Executive Managing Director at Green City Solutions GmbH) Shaheer Hussam (CEO & Co-Founder at Aetlan) | Lubomila Jordanova (Founder at Plan A)  Oliver Arafat (Alibaba Cloud's Senior Solution Architect)

Details
What? Climate Change & AI for GOOD | DN Unlimited Open Forum powered by Alibaba Cloud
When? October 15th at 6 PM CET
Where? Online, worldwide
Register for FREE here: https://datanatives.io/climate-change-ai-for-good-open-forum/


Climate Change & AI for GOOD | Online Open Forum Oct 15th was first posted on October 14, 2020 at 2:55 pm.
©2020 “R-posts.com“. Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at tal.galili@gmail.com

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Climate Change & AI for GOOD | Online Open Forum Oct 15th first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Comments