[R-bloggers] Course sequence: Data analytics for the liberal arts (and 10 more aRticles)

[R-bloggers] Course sequence: Data analytics for the liberal arts (and 10 more aRticles)

Link to R-bloggers

Course sequence: Data analytics for the liberal arts

Posted: 31 Aug 2020 02:41 PM PDT

[This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I'm a proud liberal arts graduate myself who, with some fumbling, ended up in the world of data analytics. It may sound odd, but I never fancied myself much of a "math person," and I still love to explore the world through the arts and qualitative methods.

I still hold that it's because of these interests that I decided to dig in on data analytics: I wanted to be as efficient, productive and automated as possible in using quantitative data, so that I had the time and energy to explore it qualitatively.

This distinction holds a truth that we can use to unite data analytics with the liberal arts: computers are good at processing data, and humans are good at making sense of it.

The good news about computers is that they do what you tell them to do. The bad news is that they do what you tell them to do.

Ted Nelson

Why the liberal arts in an age of big data and technology? The liberal arts teach you to have a perspective and make sense of the world. As I said in my post on creating the future with the liberal arts, "The liberal arts student appreciates the grit, fragility and triumph of the human condition. We can only see where we are going if we know where we've been."

Humans may have a gift to contextualize large amounts of long-term information, but we are terrible at holding short-term memory. For example, most of us can only hold seven things at a time in working memory.

 (If you don't believe me at how bad we are at holding short-term memory, take this test.)

Computers, on the other hand, thrive at executing orders that require processing lots of defined chunks of data. Here is where computational thinking and data literacy holds so much promise for liberal arts students.

Here is a proposed basic two-semester course sequence that I proposed to a liberal arts professor friend of mine. The professor had mentioned the typical scenario of liberal arts grads being ill-prepared for the job market, and that the school was looking for ways to position their students better for careers success.

I know this scenario well — I was one myself, but I've come not to regret receiving my education in the first place. Instead, I've looked for ways to integrate data analytics education with liberal arts education, specifically so that liberal arts students, who have so much value to provide employers, are seen in a compelling light.

To do so, I've integrated a coding bootcamp-like experience to fit a three-credit-hour college course. Students will learn how to build spreadsheet models, read information from databases, and script data collection and analysis processes. These skills will make liberal arts graduates more competitive on the job market.

My second course is geared toward liberal arts students who wish to pursue graduate school and careers in academia. These students can benefit from computational processes as they are increasingly being used in those fields. The course is heavy on text analysis and natural language processing — again, to provide a method for comparing many bits of short-term memory at scale.

I hope the below learning guide can benefit your liberal arts organization. Universities must adapt, but I do not believe the best way to do so is by abandoning liberal arts programs. There is a way to augment this field of study with computational and data analytics methods.

Data analytics for liberal arts PDF

Data analyics for the liberal arts

I imagine this as a two-semester sequence for upperclassmen. The first course, "Intro to Data Analytics" is more of a general survey of data analytics with an emphasis on applications in the workforce. The second, "Data Analytics for the Liberal Arts" is geared toward students interested in graduate school in the humanities or similar. However, these topics would be of use to others as natural language processing has become popular in industry.

Intro to Data Analytics

This is a full-semester, 3-credit hour course. I consider this as a sort of "data analytics bootcamp" conducted as a typical semester-long college course.

Learning outcomes:

  • Student can use multiple data analytics technologies in building a data analysis project
  • Student can create visualizations and other data-backed assets using information design principles
  • Student can conduct compelling presentations based on data analysis

Topics:

  • What is data analytics? Why is it so popular? Should it be so popular? What can liberal arts students bring to the table? (This can be a bit contrarian and liberal arts triumphalist.)
  • Basic spreadsheet modeling: be able to build a break-even pricing model or a staffing model. Almost anyone in business will have to build a budget at some point and it will probably be in Excel.
  • Databases: Understand the basic architecture that powers modern business and how it's changed in the last 15 years. Be able to write basic commands in structured query language (SQL) to read from databases. Understand the difference between relational and non-relational databases, structured and unstructured data.
  • Information design: Much of this is based on cognitive science. Learn when to use which chart versus tables, etc. How to design effective slide decks, visualizations and dashboards and use these assets to present sound data-backed presentations and recommendations. Pairs nicely with the liberal arts emphasis on sound communication.
  • Coding: Perform basic task automation and data analysis with R.

Assessments:

  • Build a break-even model for an ecommerce store. Present findings to management.
  • Build an interactive dashboard investigating sales performance over time
  • Perform basic data exploration and analysis of a relational database
  • Conduct reproducible data analysis exercise and presentation using a series of data analytics tools

Data Analytics for the Liberal Arts

This is a full-semester, 3-credit hour course. The text could be Text Analysis with R for Student of Literature with some supplemental material. It may be wise to make "Intro to Data Analytics" a pre-req along with a basic stats course.

This course will be conducted entirely in R. The "target demo" would be students interested in grad school for the humanities or related, but the appeal could easily extend as natural language processing is becoming quite popular in industry.

Learning outcomes:

  • Student can conduct exploratory data analysis of unstructured data sets
  • Student can build reproducible data analysis projects in line with the CRISP-DM or similar methodology
  • Student can create compelling visualizations to aid in their findings

Topics:

  • Advantages and disadvantages of data analytics in the liberal arts
  • Basic exploratory data analysis: describing, summarizing, visualizing one and multiple variables
  • Unstructured data collection: APIs, web scraping
  • What makes data "tidy" and why it matters
  • Structured vs unstructured data
  • Word frequency analysis
  • Document similarity
  • Document clustering & classification
  • Topic modeling
  • Network analysis

Assessments:

  • Collect data from an API or the Web, explore and prepare it for quantitative analysis
  • Visualize word frequencies and occurrences across the corpus of one or more authors
  • Build a model to classify documents based on predicted original author
  • Conduct network analysis on historical or literary figures 

If helping your undergraduate liberal arts population succeed in their careers and in grad school through data analytics is something you're wrestling with, please set up a free call with me and we can talk.

This download is part of my resource library. For exclusive free access, subscribe below.

To leave a comment for the author, please follow the link and comment on their blog: George J. Mount.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Course sequence: Data analytics for the liberal arts first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Simulations Comparing Interaction for Adjusted Risk Ratios versus Adjusted Odds Ratios

Posted: 31 Aug 2020 07:33 AM PDT

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Introduction

Risk ratios (RR) and odds ratios (OR) are both used to analyze factors associated with increased risk of disease. They are different in that a risk ratio is the ratio of two risks and an odds ratio is the ratio of two odds. Risks and odds are calculated differently. For example, with one toss of a fair die, the risk of rolling a 1 is 1/6 = 0.167. The odds of rolling a 1 are 1/5 = 0.200. In textbook terms, risk is successes divided by total trials and odds are successes divided by failures. In this example rolling a 1 is a success and rolling 2 through 5 is a failure.

In R, glm (generalized linear model) is used to compute adjusted risk ratios (ARR's) and adjusted odds ratios (AOR's). Adjusting eliminates the influence of correlations between factors on the ratios. In statistical terms, adjusting controls for confounding. In addition to confounding, the RR's and OR's can be influenced by interaction which is also called effect modification. This occurs when the ratio for one group is different from the ratio for another group. For example, non-smokers who work with asbestos might have 4.0 times the risk of getting lung cancer compared to non-smokers who do not work with asbestos. This is for non-smokers. For smokers, this risk ratio might be much larger at 8.0, meaning smokers who work with asbestos have 8.0 times the risk of getting lung cancer compared to smokers who do not work with asbestos. In this case there is interaction between smoking and asbestos exposure because the RR for non-smokers is 4.0 and the RR for smokers is 8.0. Alternatively, it could be said that smoking modifies the effect of the asbestos exposure on lung cancer.

For a given data set, the AOR's and ARR's will almost always agree in terms of statistical significance. The AOR's and ARR's will be different but the p values will be close. This is not true for interaction terms. Often, using the same data, an analysis of interaction using OR's will indicate significant interaction while the RR's will indicate no interaction. This phenomenon is explained very well in the article "Distributional interaction: Interpretational problems when using incidence odds ratios to assess interaction" by Ulka B Campbell, Nicolle M Gatto and Sharon Schwartz. (Epidemiologic Perspectives & Innovations 2005, 2:1 doi:10.1186/1742-5573-2-1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1079911/.

The authors described the problem for RR's and OR's and included mathematical explanations. The purpose of this article is to show how this problem occurs when using adjusted RR's and adjusted OR's with the R glm function to assess interaction.

Creating a test data set

The first step is to create a test data set. The R code below produces a data set of 1,000,000 lines with simulated variables for two risk factors, a grouping variable and an outcome variable. This is designed to simulate a situation where there are two groups of people that have different risks of disease. There are also two risk factors and these are designed to have the same risk ratios for the disease for simulated people in both groups.

>
> rm(list=ls(all=TRUE))
>
> nx=1000000
>
> set.seed(523)
>
> x1=runif(nx)
> x2=runif(nx)
> grpx=runif(nx)
> dep1=runif(nx)
>
> rf1=ifelse(x1 < 0.20,1,0)
> rf2=ifelse(x2 < 0.30,1,0)
> grp=ifelse(grpx < 0.5,1,0)
>
> d1=as.data.frame(cbind(rf1,rf2,grp,dep1))

The R output above creates the two risk factor variables (rf1 and rf2) and the group variable (grp). The proportions are 0.20, for rf1, 0.30 for rf2 and 0.50 for grp 0 and grp 1. This simulates a scenario where the risk factors are not too rare and the two groups are similar to a male/female grouping. The proportions will vary slightly due to randomizing with the runif R function.

First Simulation

> #Sim 1
>
> attach(d1,warn.conflict=F)
>
> xg12f=(grp*100)+(rf1*10)+rf2
>
> disprop=0.01
>
> disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
> disease=ifelse(xg12f == 1 & dep1 < (1.5*disprop),1,disease)
> disease=ifelse(xg12f == 10 & dep1 < (1.7*disprop),1,disease)
> disease=ifelse(xg12f == 11 & dep1 < (1.5*1.7*disprop),1,disease)
>
> disease=ifelse(xg12f == 100 & dep1 < (2*disprop),1,disease)
> disease=ifelse(xg12f == 101 & dep1 < (3*disprop),1,disease)
> disease=ifelse(xg12f == 110 & dep1 < (3.4*disprop),1,disease)
> disease=ifelse(xg12f == 111 & dep1 < (5.1*disprop),1,disease)
>
> sum(disease)/nrow(d1)
[1] 0.019679
>

The block of output above creates a new variable, xg12f, that combines the three variables, grp, rf1 and rf2. This makes it easier to create the outcome variable named disease. Each value of xg12f defines a unique combination of the variables grp, rf1 and rf2. The disease proportion (disprop) is initially set to 0.01 and then the series of ifelse statements increases the disease proportion by multiples of disprop based on the risk factors and group. This simulates a situation where the disease is rare for people with no risk factors (xg12f = 0) and then increases with various combinations of risk factors and group. For example, a simulated person in group 0 with risk factor 1 but not risk factor 2 would have a value for xg12f of 10. The third ifelse statement sets the disease proportion for simulated persons in this category to 1.7 times the disprop value. In general, simulated persons with risk factors have higher disease proportions and simulated persons in group 1 have higher disease proportions. Simulated persons in group 1 with both risk factors (xg12f = 111) have the highest multiple at 5.1 times the proportion of disease for simulated persons in group 0 with no risk factors (xg12f = 0). As shown above, the overall proportion of simulated persons with the disease is 0.019679.

 

>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=log))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = log))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.3265  -0.2001  -0.2001  -0.1422   3.0328 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -4.598839   0.018000 -255.492   <2e-16 ***
rf1          0.500622   0.029614   16.905   <2e-16 ***
rf2          0.401599   0.026798   14.986   <2e-16 ***
grp          0.677464   0.021601   31.362   <2e-16 ***
rf1:rf2     -0.004331   0.031359   -0.138    0.890   
rf1:grp      0.032458   0.032800    0.990    0.322   
rf2:grp      0.032862   0.030670    1.071    0.284   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 193574  on 999999  degrees of freedom
Residual deviance: 189361  on 999993  degrees of freedom
AIC: 189375Number of Fisher Scoring iterations: 7> x1=exp(confint.default(r1))
> rr=exp(coef(r1))
> round(cbind(rr,x1),digits=2)
              rr 2.5 % 97.5 %
(Intercept) 0.01  0.01   0.01
rf1         1.65  1.56   1.75
rf2         1.49  1.42   1.57
grp         1.97  1.89   2.05
rf1:rf2     1.00  0.94   1.06
rf1:grp     1.03  0.97   1.10
rf2:grp     1.03  0.97   1.10
>


The R output above uses the R function glm to produce ARR's with 95% confidence intervals and p values. In this glm the family is binomial with link = log. This produces coefficients that are the logs of the ARR's and these are used to produce the table of ARR's and 95% CI that appears below the glm output. It should be noted that the intercept is the log of the disease proportion for simulated persons that have no risk factors. In the table of ARR's the data for the intercept is not the ARR but rather the proportion of disease for simulated persons with no risk factors, 0.01.

As expected, the ARR's are statistically significant for rf1, rf2 and grp. Also as expected, the interaction terms rf1:rf2, rf1:grp and rf2:grp, are not statistically significant.

Next are the AOR's.

>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=logit))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = logit))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.3265  -0.2000  -0.2000  -0.1423   3.0326 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -4.588289   0.018205 -252.037   <2e-16 ***
rf1          0.505839   0.030181   16.760   <2e-16 ***
rf2          0.405558   0.027242   14.887   <2e-16 ***
grp          0.686728   0.021929   31.316   <2e-16 ***
rf1:rf2      0.002334   0.032418    0.072    0.943   
rf1:grp      0.042153   0.033593    1.255    0.210   
rf2:grp      0.040418   0.031327    1.290    0.197   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 193574  on 999999  degrees of freedom
Residual deviance: 189361  on 999993  degrees of freedom
AIC: 189375Number of Fisher Scoring iterations: 7> x1=exp(confint.default(r1))
> or=exp(coef(r1))
> round(cbind(or,x1),digits=2)
              or 2.5 % 97.5 %
(Intercept) 0.01  0.01   0.01
rf1         1.66  1.56   1.76
rf2         1.50  1.42   1.58
grp         1.99  1.90   2.07
rf1:rf2     1.00  0.94   1.07
rf1:grp     1.04  0.98   1.11
rf2:grp     1.04  0.98   1.11
>
The R output above is similar to the previous glm output but in this glm the family is binomial with link = logit instead of log. This produces coefficients that are the logs of the AOR's and these are used to produce the table of AOR's and 95% CI. The line for the intercept in the table is not AOR's but is the odds of disease for simulated persons with no risk factors. As with ARR's, the AOR's are statistically significant for rf1, rf2 and grp. The interaction terms rf1:rf2, rf1:grp and rf2:grp, are not statistically significant which also agrees with the glm output for the ARR's.

Second Simulation

> #sim 2
>
> disprop=0.05
>
> disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
> disease=ifelse(xg12f == 1 & dep1 < (1.5*disprop),1,disease)
> disease=ifelse(xg12f == 10 & dep1 < (1.7*disprop),1,disease)
> disease=ifelse(xg12f == 11 & dep1 < (1.5*1.7*disprop),1,disease)
>
> disease=ifelse(xg12f == 100 & dep1 < (2*disprop),1,disease)
> disease=ifelse(xg12f == 101 & dep1 < (3*disprop),1,disease)
> disease=ifelse(xg12f == 110 & dep1 < (3.4*disprop),1,disease)
> disease=ifelse(xg12f == 111 & dep1 < (5.1*disprop),1,disease)
>
> sum(disease)/nrow(d1)
[1] 0.09847
>
The R output above uses the same data file but the disprop variable is increased from 0.01 to 0.05. This reflects a situation where the disease proportion is higher. As shown above, the overall proportion of simulated persons with the disease is 0.09847.
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=log))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = log))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.7718  -0.4588  -0.4168  -0.3207   2.4468 Coefficients:
              Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -2.9935084  0.0078756 -380.099   <2e-16 ***
rf1          0.5071062  0.0127189   39.870   <2e-16 ***
rf2          0.4029193  0.0115901   34.764   <2e-16 ***
grp          0.6901061  0.0093582   73.743   <2e-16 ***
rf1:rf2      0.0007967  0.0130359    0.061    0.951   
rf1:grp      0.0155518  0.0139306    1.116    0.264   
rf2:grp      0.0206406  0.0131138    1.574    0.115   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 643416  on 999999  degrees of freedom
Residual deviance: 620356  on 999993  degrees of freedom
AIC: 620370Number of Fisher Scoring iterations: 6> x1=exp(confint.default(r1))
> rr=exp(coef(r1))
> round(cbind(rr,x1),digits=2)
              rr 2.5 % 97.5 %
(Intercept) 0.05  0.05   0.05
rf1         1.66  1.62   1.70
rf2         1.50  1.46   1.53
grp         1.99  1.96   2.03
rf1:rf2     1.00  0.98   1.03
rf1:grp     1.02  0.99   1.04
rf2:grp     1.02  0.99   1.05
>
The R output above is glm output with family = binomial and link = log. As before, the coefficients are the logs of ARR's and are used to produce the table of ARR's. Also as before the line for the intercept in the table is not ARR's but is the prevalence of disease for simulated persons with no risk factors. In this case it is 0.05 in contrast to the earlier glm result with the previous data file which was 0.01. As expected, this output shows statistically significant ARR's for the risk factors and no statistically significant results for the interaction terms.
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=logit))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = logit))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-0.7701  -0.4586  -0.4157  -0.3210   2.4459 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -2.939696   0.008344 -352.305  < 2e-16 ***
rf1          0.534243   0.014054   38.012  < 2e-16 ***
rf2          0.423218   0.012628   33.515  < 2e-16 ***
grp          0.740304   0.010119   73.158  < 2e-16 ***
rf1:rf2      0.042163   0.015615    2.700  0.00693 **
rf1:grp      0.072082   0.015824    4.555 5.23e-06 ***
rf2:grp      0.063938   0.014672    4.358 1.31e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 643416  on 999999  degrees of freedom
Residual deviance: 620356  on 999993  degrees of freedom
AIC: 620370Number of Fisher Scoring iterations: 5> x1=exp(confint.default(r1))
> or=exp(coef(r1))
> round(cbind(or,x1),digits=2)
              or 2.5 % 97.5 %
(Intercept) 0.05  0.05   0.05
rf1         1.71  1.66   1.75
rf2         1.53  1.49   1.57
grp         2.10  2.06   2.14
rf1:rf2     1.04  1.01   1.08
rf1:grp     1.07  1.04   1.11
rf2:grp     1.07  1.04   1.10
>
The R output above is glm output with family = binomial and link = logit. The coefficients are the logs of AOR's and are used to produce the table of AOR's. The line for the intercept in the table is not AOR's but is the odds of disease for simulated persons with no risk factors. As expected, this output shows statistically significant AOR's for the risk factors. However, in contrast to the results with the ARR's, this glm output shows statistically significant results for the interaction terms. In short, with this data set, the interaction terms are not statistically significant when using ARR's but they are statistically significant when using AOR's.

Lastly, the value of the disprop variable is increased again, this time to 0.10. The output below shows the creation of the new disease variable and shows the overall proportion of simulated persons with the disease is increased to 0.196368.

The output below is the, by now familiar, glm output for ARR's and AOR' s. With the increased proportion of disease, the ARR's still show the interaction terms as not statistically significant. But the AOR's show the interaction terms as having p values less than 2 X 10-16 which is pretty low. Once again, the ARR's show no interaction while the AOR's indicate strong interaction.

 Third Simulation

>
> #sim 3
>
> disprop=0.10
>
> disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
> disease=ifelse(xg12f == 1 & dep1 < (1.5*disprop),1,disease)
> disease=ifelse(xg12f == 10 & dep1 < (1.7*disprop),1,disease)
> disease=ifelse(xg12f == 11 & dep1 < (1.5*1.7*disprop),1,disease)
>
> disease=ifelse(xg12f == 100 & dep1 < (2*disprop),1,disease)
> disease=ifelse(xg12f == 101 & dep1 < (3*disprop),1,disease)
> disease=ifelse(xg12f == 110 & dep1 < (3.4*disprop),1,disease)
> disease=ifelse(xg12f == 111 & dep1 < (5.1*disprop),1,disease)
>
> sum(disease)/nrow(d1)
[1] 0.196368
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=log))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = log))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-1.2029  -0.6670  -0.5688  -0.4587   2.1465 Coefficients:
             Estimate Std. Error  z value Pr(>|z|)   
(Intercept) -2.303778   0.005401 -426.540   <2e-16 ***
rf1          0.512466   0.008491   60.356   <2e-16 ***
rf2          0.402541   0.007816   51.502   <2e-16 ***
grp          0.691635   0.006344  109.025   <2e-16 ***
rf1:rf2      0.008824   0.008221    1.073    0.283   
rf1:grp      0.013005   0.009147    1.422    0.155   
rf2:grp      0.011577   0.008706    1.330    0.184   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 990652  on 999999  degrees of freedom
Residual deviance: 938046  on 999993  degrees of freedom
AIC: 938060Number of Fisher Scoring iterations: 6> x1=exp(confint.default(r1))
> rr=exp(coef(r1))
> round(cbind(rr,x1),digits=2)
              rr 2.5 % 97.5 %
(Intercept) 0.10  0.10   0.10
rf1         1.67  1.64   1.70
rf2         1.50  1.47   1.52
grp         2.00  1.97   2.02
rf1:rf2     1.01  0.99   1.03
rf1:grp     1.01  1.00   1.03
rf2:grp     1.01  0.99   1.03
>
> r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp,
+ family=binomial(link=logit))
> summary(r1)Call:
glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp +
    rf2 * grp, family = binomial(link = logit))Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-1.1920  -0.6657  -0.5655  -0.4604   2.1433 Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.190813   0.006088 -359.89   <2e-16 ***
rf1          0.561856   0.010543   53.29   <2e-16 ***
rf2          0.438542   0.009391   46.70   <2e-16 ***
grp          0.796766   0.007483  106.48   <2e-16 ***
rf1:rf2      0.134330   0.012407   10.83   <2e-16 ***
rf1:grp      0.169283   0.012129   13.96   <2e-16 ***
rf2:grp      0.124378   0.011124   11.18   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 990652  on 999999  degrees of freedom
Residual deviance: 938069  on 999993  degrees of freedom
AIC: 938083Number of Fisher Scoring iterations: 4> x1=exp(confint.default(r1))
> or=exp(coef(r1))
> round(cbind(or,x1),digits=2)
              or 2.5 % 97.5 %
(Intercept) 0.11  0.11   0.11
rf1         1.75  1.72   1.79
rf2         1.55  1.52   1.58
grp         2.22  2.19   2.25
rf1:rf2     1.14  1.12   1.17
rf1:grp     1.18  1.16   1.21
rf2:grp     1.13  1.11   1.16
>


Discussion

These simulations show that given the same data, different generalized linear models (glm) will sometimes lead to different conclusions about interaction between the dependent and independent variables. When the disease is rare, adjusted odds ratios and adjusted risk ratios tend to agree regarding the statistical significance of the interaction terms. However, as the prevalence of the disease increases, the adjusted odds ratios and adjusted risk ratios tend to produce conflicting results for the interaction terms.  This is discussed and explained mathematically in the article referred to above: "Distributional interaction: Interpretational problems when using incidence odds ratios to assess interaction" by Ulka B Campbell, Nicolle M Gatto and Sharon Schwartz. (Epidemiologic Perspectives & Innovations 2005, 2:1 doi:10.1186/1742-5573-2-1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1079911/. The purpose of this paper was to illustrate how this works with adjusted risk ratios and adjusted odds ratios using the R glm function.

For practicing statisticians and epidemiologists, it might be good advice to use risk ratios whenever denominators are available; and use odds ratios only when denominators are not available, such as case-control studies.

For example, with case-control studies, data is collected for selected persons with a disease (cases) and also for a specific number of persons without the disease (controls). With this data there are no denominators for persons at risk of getting the disease and disease rates cannot be calculated. Without disease rates, rate ratios cannot be calculated, but case control data can be used to calculate odds ratios even though the rates are unavailable. For more on this see "Statistical Methods in Cancer Research Volume I: The Analysis of Case-Control Studies", IARC Scientific Publication No. 32, Authors: Breslow NE, Day NE at: https://publications.iarc.fr/Book-And-Report-Series/Iarc-Scientific-Publications/Statistical-Methods-In-Cancer-Research-Volume-I-The-Analysis-Of-Case-Control-Studies-1980

Here is the R program used in this analysis. It should run when copied and pasted into the R window.


rm(list=ls(all=TRUE))

nx=1000000

set.seed(523)

x1=runif(nx)
x2=runif(nx)
grpx=runif(nx)
dep1=runif(nx)

rf1=ifelse(x1 < 0.20,1,0)
rf2=ifelse(x2 < 0.30,1,0)
grp=ifelse(grpx < 0.5,1,0)

d1=as.data.frame(cbind(rf1,rf2,grp,dep1))

#Sim 1

attach(d1,warn.conflict=F)

xg12f=(grp100)+(rf110)+rf2

disprop=0.01

disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
disease=ifelse(xg12f == 1 & dep1 < (1.5disprop),1,disease)
disease=ifelse(xg12f == 10 & dep1 < (1.7
disprop),1,disease)
disease=ifelse(xg12f == 11 & dep1 < (1.51.7disprop),1,disease)

disease=ifelse(xg12f == 100 & dep1 < (2disprop),1,disease)
disease=ifelse(xg12f == 101 & dep1 < (3
disprop),1,disease)
disease=ifelse(xg12f == 110 & dep1 < (3.4disprop),1,disease)
disease=ifelse(xg12f == 111 & dep1 < (5.1
disprop),1,disease)

sum(disease)/nrow(d1)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=log))
summary(r1)
x1=exp(confint.default(r1))
rr=exp(coef(r1))
round(cbind(rr,x1),digits=2)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=logit))
summary(r1)
x1=exp(confint.default(r1))
or=exp(coef(r1))
round(cbind(or,x1),digits=2)

#sim 2

disprop=0.05

disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
disease=ifelse(xg12f == 1 & dep1 < (1.5disprop),1,disease)
disease=ifelse(xg12f == 10 & dep1 < (1.7
disprop),1,disease)
disease=ifelse(xg12f == 11 & dep1 < (1.51.7disprop),1,disease)

disease=ifelse(xg12f == 100 & dep1 < (2disprop),1,disease)
disease=ifelse(xg12f == 101 & dep1 < (3
disprop),1,disease)
disease=ifelse(xg12f == 110 & dep1 < (3.4disprop),1,disease)
disease=ifelse(xg12f == 111 & dep1 < (5.1
disprop),1,disease)

sum(disease)/nrow(d1)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=log))
summary(r1)
x1=exp(confint.default(r1))
rr=exp(coef(r1))
round(cbind(rr,x1),digits=2)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=logit))
summary(r1)
x1=exp(confint.default(r1))
or=exp(coef(r1))
round(cbind(or,x1),digits=2)

#sim 3

disprop=0.10

disease=ifelse(xg12f == 0 & dep1 < disprop,1,0)
disease=ifelse(xg12f == 1 & dep1 < (1.5disprop),1,disease)
disease=ifelse(xg12f == 10 & dep1 < (1.7
disprop),1,disease)
disease=ifelse(xg12f == 11 & dep1 < (1.51.7disprop),1,disease)

disease=ifelse(xg12f == 100 & dep1 < (2disprop),1,disease)
disease=ifelse(xg12f == 101 & dep1 < (3
disprop),1,disease)
disease=ifelse(xg12f == 110 & dep1 < (3.4disprop),1,disease)
disease=ifelse(xg12f == 111 & dep1 < (5.1
disprop),1,disease)

sum(disease)/nrow(d1)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=log))
summary(r1)
x1=exp(confint.default(r1))
rr=exp(coef(r1))
round(cbind(rr,x1),digits=2)

r1=glm(disease~rf1+rf2+grp+rf1rf2+rf1grp+rf2grp,
family=binomial(link=logit))
summary(r1)
x1=exp(confint.default(r1))
or=exp(coef(r1))
round(cbind(or,x1),digits=2)


Simulations Comparing Interaction for Adjusted Risk Ratios versus Adjusted Odds Ratios was first posted on August 31, 2020 at 2:33 pm.
©2020 "R-posts.com". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at tal.galili@gmail.com

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Simulations Comparing Interaction for Adjusted Risk Ratios versus Adjusted Odds Ratios first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Data manipulation in r using data frames – an extensive article of basics part2 – aggregation and sorting

Posted: 31 Aug 2020 06:42 AM PDT

[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the second part of this two-part series on data manipulation in R. This article aims to present the reader with different ways of data aggregation and sorting. Here is the composition of this article.

  1. Aggregation in R
    1. Why aggregate data?
    2. Data used in this article
    3. Aggregation using base R
      1. Aggregation using the aggregate function
      2. Assigning a name to an aggregated column in aggregate function in R
      3. Aggregating multiple columns using an aggregate function in R
      4. Aggregating multiple columns data by multiple functions using the aggregate function in R
      5. Aggregation using by() function
      6. Aggregation using sweep() function
    4. Aggregation using dplyr
      1. Aggregation using group_by and summarize functions
      2. Aggregation by group_by and summarize functions with multiple variables and functions
      3. Aggregation using group_by and summarize functions with a range of variables
      4. Aggregation using group_by and summarize_if function
      5. Aggregation using group_by and summarize_at function
      6. Aggregation using group_by and summarize_all function
    5. Aggregation using data.table
      1. Converting data frame to data table
      2. Aggregating single variable using data table
      3. Assigning a custom name to aggregated variable in data table
      4. Aggregating multiple variables using multiple aggregation functions in data table
      5. Aggregating variables as well as filtering observations in data table
      6. Counting the number of observations within a group in data table
    6. Aggregation using sqldf
  2. Sorting in R
    1. Why sort data?
    2. Sorting using base R
      1. Sorting data using order function
      2. Sorting data in descending order using order function
    3. Sorting using dplyr
      1. Sorting data using arrange function
      2. Sorting data in descending order using arrange function
    4. Sorting using data.table
      1. Sorting data using order function
      2. Sorting data in descending order using order function
      3. Sorting data using setorder function
      4. Sorting data in descending order using setorder function
    5. Sorting using sqldf

Aggregation in R


To begin with, let's look at the term aggregation. As per the Cambridge dictionary, "the process of combining things or amounts into a single group or total" is aggregation. 

Why aggregate data?


Usually, the data has two types, qualitative and quantitative. These two are statistical terms. Qualitative data defines the characteristics of the data. Labels, properties, attributes, and categories are all examples of qualitative data. As the name suggests, data that express the quality is qualitative data. On the other hand, quantitative data represent numbers. In other disciplines like data warehousing or business intelligence, qualitative data is equivalent to dimensions and quantitative data to measures.  


Data analysis is a complex process and involves several steps. Some of these steps may include data to be examined by its quality. Usually, the qualitative data's granularity is higher. Granularity is the level of detail. 


For example, a dataset that contains all countries of the world may have multiple variables describing qualitative data. The name of the country would be at the most granular level, as all countries' names would be unique and, no two countries will have the same name. Whereas, granularity level would rise as we look at the countries by continents and then by the hemisphere. 


Similarly, in an analysis, data is examined on various levels by its different qualities. At this point, aggregation comes into the picture. It is required if you want to explore quantitative data elements by its quality that sits higher than the most granular level.


In this article, we will practice different aggregation functions and options offered by base R and other packages like dplyr and data.table. Note that this article does not aim to list all the functions that (if used in a way) can aggregate data, whereas the aim here is to explain the concept of data aggregation. R is a rich language that offers different ways of doing the same thing.


Data used in this article


We will use the data that we have prepared in our previous post How to prepare data for analysis in R. Click on this link to download the original data. This data is available under the PDDL license. However, to save space and make it cleaner, we will remove a few columns from the data frame. It's an opportunity for us to revise how to remove a column from the data frame. Let's first create the data frame and understand the data.

financials <- read.csv("constituents-financials_csv.csv")
str(financials)
## 'data.frame':    505 obs. of  14 variables:
## $ Symbol : chr "MMM" "AOS" "ABT" "ABBV" ...
## $ Name : chr "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ...
## $ Sector : chr "Industrials" "Industrials" "Health Care" "Health Care" ...
## $ Price : num 222.9 60.2 56.3 108.5 150.5 ...
## $ Price.Earnings: num 24.3 27.8 22.5 19.4 25.5 ...
## $ Dividend.Yield: num 2.33 1.15 1.91 2.5 1.71 ...
## $ Earnings.Share: num 7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ...
## $ X52.Week.Low : num 259.8 68.4 64.6 125.9 162.6 ...
## $ X52.Week.High : num 175.5 48.9 42.3 60 114.8 ...
## $ Market.Cap : num 1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ...
## $ EBITDA : num 9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ...
## $ Price.Sales : num 4.39 3.58 3.74 6.29 2.6 ...
## $ Price.Book : num 11.34 6.35 3.19 26.14 10.62 ...
## $ SEC.Filings : chr "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...
Now we have our data frame ready, let's reshape our data frame by selecting a few columns to keep the results clean and tidy.

financials <- financials %>% select(Symbol, Sector, Price, X52.Week.Low, X52.Week.High) 

head(financials,10)
##    Symbol                 Sector  Price X52.Week.Low X52.Week.High
## 1 MMM Industrials 222.89 259.770 175.490
## 2 AOS Industrials 60.24 68.390 48.925
## 3 ABT Health Care 56.27 64.600 42.280
## 4 ABBV Health Care 108.48 125.860 60.050
## 5 ACN Information Technology 150.51 162.600 114.820
## 6 ATVI Information Technology 65.83 74.945 38.930
## 7 AYI Industrials 145.41 225.360 142.000
## 8 ADBE Information Technology 185.16 204.450 114.451
## 9 AAP Consumer Discretionary 109.63 169.550 78.810
## 10 AMD Information Technology 11.22 15.650 9.700

Our final data now has five variables and 505 observations. 

Aggregation using base R

Aggregation using the aggregate function

When talking about aggregation, the aggregate function is the most obvious choice. We can use the aggregate function with data frames and time series. We can be specific with the signature if we are using the data frame by writing aggregate.data.frame. Both aggregate and aggregate.data.frame will result in the same data.
Aggregate is a generic function that can be used for both data frames and time series.
We will look at this function by different scenarios. Before the use of any aggregation, the use case must exist. It means why you want to aggregate data. This information is crucial as this will guide you to extract correctly summarized data that will answer your question. 
So the scenario here is to "find out the total price of all shares by all sectors". If you recall from the section above on qualitative and quantitative data, the variables sector and price represents qualitative and quantitative data, respectively. This information is crucial to put the variables into their correct place within the function.

aggregate(financials$Price, by = list(financials$Sector), FUN = sum)
##                       Group.1        x
## 1 Consumer Discretionary 10418.90
## 2 Consumer Staples 2711.98
## 3 Energy 1852.40
## 4 Financials 6055.81
## 5 Health Care 8083.46
## 6 Industrials 7831.47
## 7 Information Technology 8347.00
## 8 Materials 2559.67
## 9 Real Estate 2927.52
## 10 Telecommunication Services 100.81
## 11 Utilities 1545.45

Assigning a name to an aggregated column in aggregate function in R

The result above shows that the function aggregate split the original data into small subsets by sector variable and applied sum function over the price variable. The result above shows x as the name of the summarised column. If you want to rename it, a little tweak in the code using setName function will do the trick. Here is an example. 

setNames(aggregate(financials$Price, by = list(financials$Sector), FUN = sum),c("Sector","Total.Price"))
##                        Sector Total.Price
## 1 Consumer Discretionary 10418.90
## 2 Consumer Staples 2711.98
## 3 Energy 1852.40
## 4 Financials 6055.81
## 5 Health Care 8083.46
## 6 Industrials 7831.47
## 7 Information Technology 8347.00
## 8 Materials 2559.67
## 9 Real Estate 2927.52
## 10 Telecommunication Services 100.81
## 11 Utilities 1545.45

Aggregating multiple columns using the aggregate function in R

Now to apply the same function on multiple variables, all you have to do is to supply an expression to subset the required columns from the data frame as an argument. Here is an example, we have used setName function on top to assign meaningful names to the variables. 

The scenario here is to "find out the average price, average 52-weeks low price, and the average 52-week high price of all shares by all sectors". 

setNames(aggregate(financials[,c("Price","X52.Week.Low","X52.Week.High")], by = list(financials$Sector), FUN = mean), c("Sector","Average.Price","Average.Low","Average.High"))
##                        Sector Average.Price Average.Low Average.High
## 1 Consumer Discretionary 124.03452 146.93143 96.09236
## 2 Consumer Staples 79.76412 92.83229 68.92944
## 3 Energy 57.88750 72.58969 48.14123
## 4 Financials 89.05603 101.82185 72.69447
## 5 Health Care 132.51574 160.75853 103.71925
## 6 Industrials 116.88761 134.57948 90.83702
## 7 Information Technology 119.24286 138.77864 91.89142
## 8 Materials 102.38680 118.03885 85.58325
## 9 Real Estate 88.71273 110.55045 82.87809
## 10 Telecommunication Services 33.60333 41.69333 29.50367
## 11 Utilities 55.19464 68.49732 52.80232

Aggregating multiple columns data by multiple functions using the aggregate function in R

Moving on to the next level of the scenario, we would like to apply multiple functions over the different variables. It is possible by defining a custom function. The requirement now is to "find out the minimum and maximum values of price, 52-weeks low price of all shares by all sectors". Here is an example. 

aggregate(financials[c("Price","X52.Week.Low")], by = list(financials$Sector), FUN = function(x) c(min(x),max(x)))
##                       Group.1 Price.1 Price.2 X52.Week.Low.1 X52.Week.Low.2
## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990
## 2 Consumer Staples 19.96 208.73 21.175 229.500
## 3 Energy 2.82 169.16 6.590 199.830
## 4 Financials 13.38 509.38 16.530 594.520
## 5 Health Care 25.20 601.00 29.930 697.260
## 6 Industrials 14.45 334.30 30.590 361.790
## 7 Information Technology 11.22 1007.71 15.650 1198.000
## 8 Materials 17.16 387.65 20.250 435.150
## 9 Real Estate 14.01 409.98 21.530 495.345
## 10 Telecommunication Services 16.20 49.04 27.610 54.770
## 11 Utilities 10.06 145.29 12.050 159.640

The above command can be written in different ways. Here are a couple of examples. Note the use of cbind and formula (~).

aggregate(cbind(financials$Price,financials$X52.Week.Low) ~financials$Sector, FUN = function(x) c(min(x),max(x)))
##             financials$Sector    V1.1    V1.2     V2.1     V2.2
## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990
## 2 Consumer Staples 19.96 208.73 21.175 229.500
## 3 Energy 2.82 169.16 6.590 199.830
## 4 Financials 13.38 509.38 16.530 594.520
## 5 Health Care 25.20 601.00 29.930 697.260
## 6 Industrials 14.45 334.30 30.590 361.790
## 7 Information Technology 11.22 1007.71 15.650 1198.000
## 8 Materials 17.16 387.65 20.250 435.150
## 9 Real Estate 14.01 409.98 21.530 495.345
## 10 Telecommunication Services 16.20 49.04 27.610 54.770
## 11 Utilities 10.06 145.29 12.050 159.640
aggregate(cbind(Price,X52.Week.Low) ~ Sector, data = financials, FUN = function(x) c(min(x),max(x)))
##                        Sector Price.1 Price.2 X52.Week.Low.1 X52.Week.Low.2
## 1 Consumer Discretionary 10.43 1806.06 13.480 2067.990
## 2 Consumer Staples 19.96 208.73 21.175 229.500
## 3 Energy 2.82 169.16 6.590 199.830
## 4 Financials 13.38 509.38 16.530 594.520
## 5 Health Care 25.20 601.00 29.930 697.260
## 6 Industrials 14.45 334.30 30.590 361.790
## 7 Information Technology 11.22 1007.71 15.650 1198.000
## 8 Materials 17.16 387.65 20.250 435.150
## 9 Real Estate 14.01 409.98 21.530 495.345
## 10 Telecommunication Services 16.20 49.04 27.610 54.770
## 11 Utilities 10.06 145.29 12.050 159.640

With the aggregate function, possibilities are endless. We can do a lot using the aggregate function. What we have seen in the sections above is just a part. Here are some other arguments which you can add to the aggregate function based on your requirements.

Argument Use
na.action Use this to specify if your data contain NA or missing values and how you want to handle them
simplify If used results will be presented in vector or matrix form
formula Depending on subsetting requirement formulas like y ~ x or cbind(y1,y2) ~ x1 + x2 can be used
data The data frame in question which has the variables to subset and aggregate
drop Use if you want to drop unused combination of grouping values
subset Use it to limit the operation to certain observations

Lastly, to name a few, you can use sum, count, first, last, mean, median, min, max, and sd functions with the aggregate function. If you think this list should have more function names, then please add a comment in the comment box below.

Aggregation using by() function

The by function is different from aggregate as it is a wrapper for tapply function, this means it also encapsulates the functionality of tapply function. The return value from "by" function depends on the usage of simplify argument. If the value of simplify argument is TRUE then it returns a list or array otherwise it always returns a list. Here is an example of by function. In this example, we are trying to find the sum of price variable by sector.
All other arguments in this function are self-explanatory except INDICES that represent factors or a list of factors.  

by(data=financials$Price, INDICES = financials$Sector, FUN = sum, simplify = TRUE)
## financials$Sector: Consumer Discretionary
## [1] 10418.9
## ------------------------------------------------------------
## financials$Sector: Consumer Staples
## [1] 2711.98
## ------------------------------------------------------------
## financials$Sector: Energy
## [1] 1852.4
## ------------------------------------------------------------
## financials$Sector: Financials
## [1] 6055.81
## ------------------------------------------------------------
## financials$Sector: Health Care
## [1] 8083.46
## ------------------------------------------------------------
## financials$Sector: Industrials
## [1] 7831.47
## ------------------------------------------------------------
## financials$Sector: Information Technology
## [1] 8347
## ------------------------------------------------------------
## financials$Sector: Materials
## [1] 2559.67
## ------------------------------------------------------------
## financials$Sector: Real Estate
## [1] 2927.52
## ------------------------------------------------------------
## financials$Sector: Telecommunication Services
## [1] 100.81
## ------------------------------------------------------------
## financials$Sector: Utilities
## [1] 1545.45


Aggregation using sweep() function

There is one more function I would like to mention here that you can use to aggregate data. Function sweep returns an array by sweeping out summary statistics. This specific function may not fit into all kinds of aggregation requirements but worth mentioning if you want to do some operation while summarising data. This function wouldn't be my first choice though it may fit into yours. Here is an example.

sweep(financials["Price"], MARGIN = 1, STATS = 0,FUN = sum)
## [1] 52434.47
The function above is returning the sum of variable Price. Over the years, I didn't use the sweep function for my work. If you know the aggregation scenario where this function would be most useful, then please comment in the comment box below. All other arguments are obvious except STATS, my understanding of this argument is that this is the value that you would like to apply to the variable in conjunction with the function argument. So, I have set STATS to 0 and used sum as the function, which results in the total Price variable. If I set STATS to 1, then all the values of variable Price would be increased with one, and the total will be different.

Though sweep function doesn't result in grouped or subsetted data as others, I thought this function is worth mentioning. Similarly, other functions from the apply family are available to aggregate data, but you may encounter limitations on grouping or performance requirements. If you want, you can achieve aggregation results using these functions, but there are better and faster options available, and from this point onwards, we will concentrate on those. 

Aggregation using dplyr

Aggregation using group_by and summarize functions

The three functions we have seen above may fit into different aggregation scenarios, but in terms of ease of use, I consider the group_by and summarize functions of the dplyr package. The performance of functions is subjective and dependent on the size and operation of the data. I personally never evaluated the performance though wherever I have read, group_by is praised for its performance. Let's look at an example of aggregating data using the group by function.
The scenario here is to "find out the total price of all shares by all sectors". 
financials %>% group_by(Sector) %>% summarize(total.price = sum(Price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 2
## Sector total.price
##
## 1 Consumer Discretionary 10419.
## 2 Consumer Staples 2712.
## 3 Energy 1852.
## 4 Financials 6056.
## 5 Health Care 8083.
## 6 Industrials 7831.
## 7 Information Technology 8347
## 8 Materials 2560.
## 9 Real Estate 2928.
## 10 Telecommunication Services 101.
## 11 Utilities 1545.

Simply supplying the variable name within the group_by function is enough to set the grouping of observations by the value of the selected variable. The summarize function is the engine where the data is operated upon using the function of your choice. In the example above, I have chosen the sum function which is applied over grouped observation. This strategy or method of aggregation is known as split-apply-combine. First, we are splitting the observations into groups followed by applying a function and then combining results to present.


If you were observant, then you may have noticed a warning as well. This warning is a hint to remind you how you want to control the grouping of data. You can suppress this warning by using options(dplyr.summarise.inform = FALSE). For more information, please click this link.


Also, did you notice how the results are printed above? 


Nothing wrong with that, it is because the return value is in the form of tibble. You can print.data.frame() function to print is in a nice format. Here is an example.


financials %>% group_by(Sector) %>% summarize(total.price = sum(Price)) %>% print.data.frame()
## `summarise()` ungrouping output (override with `.groups` argument)
##                        Sector total.price
## 1 Consumer Discretionary 10418.90
## 2 Consumer Staples 2711.98
## 3 Energy 1852.40
## 4 Financials 6055.81
## 5 Health Care 8083.46
## 6 Industrials 7831.47
## 7 Information Technology 8347.00
## 8 Materials 2559.67
## 9 Real Estate 2927.52
## 10 Telecommunication Services 100.81
## 11 Utilities 1545.45


Lastly, did you notice that the summarised variable is now named total.price? By simply putting the new name in front of the aggregation function will do the trick. It is not mandatory, but I am sure you wouldn't like the default name.

Aggregation by group_by and summarize functions with multiple variables and functions

Let's progress ahead with our scenario and find out the total price, minimum 52-week low and maximum 52-week high price of all shares and by all sectors. The aim here is to show the use of multiple variables with multiple aggregation functions.


financials %>% group_by(Sector) %>% summarize(total.price = sum(Price), min.52.week.low = min(X52.Week.Low), max.52.week.high = max(X52.Week.High)) %>% print.data.frame()
## `summarise()` ungrouping output (override with `.groups` argument)
##                        Sector total.price min.52.week.low max.52.week.high
## 1 Consumer Discretionary 10418.90 13.480 1589.0000
## 2 Consumer Staples 2711.98 21.175 152.0100
## 3 Energy 1852.40 6.590 125.4600
## 4 Financials 6055.81 16.530 368.0000
## 5 Health Care 8083.46 29.930 459.3400
## 6 Industrials 7831.47 30.590 256.4000
## 7 Information Technology 8347.00 15.650 824.3000
## 8 Materials 2559.67 20.250 302.0101
## 9 Real Estate 2927.52 21.530 361.9000
## 10 Telecommunication Services 100.81 27.610 42.8000
## 11 Utilities 1545.45 12.050 124.1800

Also, we can supply more than one variable in the group_by function if you want to group data by multiple variables.

Aggregation using group_by and summarize functions with a range of variables

You can use the summarize function in a variety of ways. If you have several variables to aggregate, then the following type of command would help. 


financials %>% group_by(Sector) %>% summarize(across(Price:X52.Week.High, ~sum(.x))) %>% print.data.frame()
## `summarise()` ungrouping output (override with `.groups` argument)
##                        Sector    Price X52.Week.Low X52.Week.High
## 1 Consumer Discretionary 10418.90 12342.240 8071.759
## 2 Consumer Staples 2711.98 3156.298 2343.601
## 3 Energy 1852.40 2322.870 1540.519
## 4 Financials 6055.81 6923.886 4943.224
## 5 Health Care 8083.46 9806.271 6326.874
## 6 Industrials 7831.47 9016.825 6086.081
## 7 Information Technology 8347.00 9714.505 6432.399
## 8 Materials 2559.67 2950.971 2139.581
## 9 Real Estate 2927.52 3648.165 2734.977
## 10 Telecommunication Services 100.81 125.080 88.511
## 11 Utilities 1545.45 1917.925 1478.465

Aggregation using group_by and summarize_if function

Or, if you want to aggregate all of the numeric variables from your data frame, then the following version of summarize function would help.

financials %>% group_by(Sector) %>% summarize_if(is.numeric,sum) %>% print.data.frame()
##                        Sector    Price X52.Week.Low X52.Week.High
## 1 Consumer Discretionary 10418.90 12342.240 8071.759
## 2 Consumer Staples 2711.98 3156.298 2343.601
## 3 Energy 1852.40 2322.870 1540.519
## 4 Financials 6055.81 6923.886 4943.224
## 5 Health Care 8083.46 9806.271 6326.874
## 6 Industrials 7831.47 9016.825 6086.081
## 7 Information Technology 8347.00 9714.505 6432.399
## 8 Materials 2559.67 2950.971 2139.581
## 9 Real Estate 2927.52 3648.165 2734.977
## 10 Telecommunication Services 100.81 125.080 88.511
## 11 Utilities 1545.45 1917.925 1478.465

Aggregation using group_by and summarize_at function

Summarize_at, on the other hand, helps you to aggregate more than one variable in one go. The requirement is that the variables are supplied as vector or vars. 

financials %>% group_by(Sector) %>% summarize_at(c("Price","X52.Week.High"),sum) %>% print.data.frame()
##                        Sector    Price X52.Week.High
## 1 Consumer Discretionary 10418.90 8071.759
## 2 Consumer Staples 2711.98 2343.601
## 3 Energy 1852.40 1540.519
## 4 Financials 6055.81 4943.224
## 5 Health Care 8083.46 6326.874
## 6 Industrials 7831.47 6086.081
## 7 Information Technology 8347.00 6432.399
## 8 Materials 2559.67 2139.581
## 9 Real Estate 2927.52 2734.977
## 10 Telecommunication Services 100.81 88.511
## 11 Utilities 1545.45 1478.465

Aggregation using group_by and summarize_all function

Lastly, summarize_all function helps to aggregate all the values from the data frame, and we can use multiple aggregation functions as well. Note in our data frame we have two character variables, let's prepare the data for using summarize_all function and aggregate data to show total and minimum values of all variable across the data frame.

financials %>% select(-Symbol) %>% group_by(Sector) %>% summarize_all(c(sum,min))
## # A tibble: 11 x 7
## Sector Price_fn1 X52.Week.Low_fn1 X52.Week.High_f… Price_fn2 X52.Week.Low_fn2
##
## 1 Consu… 10419. 12342. 8072. 10.4 13.5
## 2 Consu… 2712. 3156. 2344. 20.0 21.2
## 3 Energy 1852. 2323. 1541. 2.82 6.59
## 4 Finan… 6056. 6924. 4943. 13.4 16.5
## 5 Healt… 8083. 9806. 6327. 25.2 29.9
## 6 Indus… 7831. 9017. 6086. 14.4 30.6
## 7 Infor… 8347 9715. 6432. 11.2 15.6
## 8 Mater… 2560. 2951. 2140. 17.2 20.2
## 9 Real … 2928. 3648. 2735. 14.0 21.5
## 10 Telec… 101. 125. 88.5 16.2 27.6
## 11 Utili… 1545. 1918. 1478. 10.1 12.0
## # … with 1 more variable: X52.Week.High_fn2

Aggregation using data.table

Converting data frame to data table

Aggregation using the data table is a step further from dplyr in terms of simplicity. We write the aggregation code as if we are subsetting the data frame. Before we progress, let's convert our data frame into a data table.

financials <- setDT(financials)
class(financials)
## [1] "data.table" "data.frame"

Aggregating single variable using data table

Let's look at an example where the scenario is to find out the total price of all shares by all sectors.

financials[,sum(Price), by=Sector]
##                         Sector       V1
## 1: Industrials 7831.47
## 2: Health Care 8083.46
## 3: Information Technology 8347.00
## 4: Consumer Discretionary 10418.90
## 5: Utilities 1545.45
## 6: Financials 6055.81
## 7: Materials 2559.67
## 8: Real Estate 2927.52
## 9: Consumer Staples 2711.98
## 10: Energy 1852.40
## 11: Telecommunication Services 100.81

Assigning a custom name to an aggregated variable in data table

The result above shows the aggregated value by the selected variable. However, the aggregated variable's name is not friendly. Let's give it a proper name.

financials[,list(total.price=sum(Price)), by=Sector]

Or

financials[,.(total.price=sum(Price)), by=Sector]
##                         Sector total.price
## 1: Industrials 7831.47
## 2: Health Care 8083.46
## 3: Information Technology 8347.00
## 4: Consumer Discretionary 10418.90
## 5: Utilities 1545.45
## 6: Financials 6055.81
## 7: Materials 2559.67
## 8: Real Estate 2927.52
## 9: Consumer Staples 2711.98
## 10: Energy 1852.40
## 11: Telecommunication Services 100.81

Aggregating multiple variables using multiple aggregation functions in data table

What we have seen until now is one custom-named variable aggregated by a different variable. Let take it further and see if we can apply more than one aggregation function over more than one variable. The scenario is to find out the total price and minimum 52-week low price from all shares by all sectors.


financials[,.(total.price=sum(Price), mim.52.week.low=min(X52.Week.Low)), by=Sector]
Or

financials[,list(total.price=sum(Price), mim.52.week.low=min(X52.Week.Low)), by=Sector]
##                         Sector total.price mim.52.week.low
## 1: Industrials 7831.47 30.590
## 2: Health Care 8083.46 29.930
## 3: Information Technology 8347.00 15.650
## 4: Consumer Discretionary 10418.90 13.480
## 5: Utilities 1545.45 12.050
## 6: Financials 6055.81 16.530
## 7: Materials 2559.67 20.250
## 8: Real Estate 2927.52 21.530
## 9: Consumer Staples 2711.98 21.175
## 10: Energy 1852.40 6.590
## 11: Telecommunication Services 100.81 27.610

Aggregating variables as well as filtering observations in data table

What if we want to aggregate as well as filter the observations by supplying specific conditions. The data table offers a solution to this. Here is an example.

financials[Sector == "Industrials",list(total.price=sum(Price), mim.52.week.low=min(X52.Week.Low)), by=Sector]
##         Sector total.price mim.52.week.low
## 1: Industrials 7831.47 30.59

Counting the number of observations within a group in data table

Although we can use different aggregation functions, what if we want to count the observations within a group? We can do this using .N special variable. 

financials[,.N, by=Sector]
##                         Sector  N
## 1: Industrials 67
## 2: Health Care 61
## 3: Information Technology 70
## 4: Consumer Discretionary 84
## 5: Utilities 28
## 6: Financials 68
## 7: Materials 25
## 8: Real Estate 33
## 9: Consumer Staples 34
## 10: Energy 32
## 11: Telecommunication Services 3
If the requirement is to group observations using multiple variables, then the following syntax can be used for by argument.
by=.(variable to group 1, variable to group 2)

Aggregation using sqldf

I love sqldf, it makes things easy, but if you are not familiar with SQL then it may be a little challenging. Aggregation in sqldf require us to understand SQL, and hence I would just list one example here. SQL is an easy language to learn and I will soon be posing articles on SQL.

Let's define our scenario, find out the total price and minimum 52-week low price from all shares by all sectors where aggregated total price is greater than 5000. This scenario includes more than one variable with the use of more than one aggregation function as well as filtration criterion over grouped observations.

sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector having sum(Price) > 5000')
##                   Sector sum(Price) min("X52.Week.Low")
## 1 Consumer Discretionary 10418.90 13.48
## 2 Financials 6055.81 16.53
## 3 Health Care 8083.46 29.93
## 4 Industrials 7831.47 30.59
## 5 Information Technology 8347.00 15.65
Let's give aggregated columns a custom name.

sqldf('select Sector, sum(Price) as "total.price", min("X52.Week.Low") as "min.52.week.low" from financials group by Sector having sum(Price) > 5000')
##                   Sector total.price min.52.week.low
## 1 Consumer Discretionary 10418.90 13.48
## 2 Financials 6055.81 16.53
## 3 Health Care 8083.46 29.93
## 4 Industrials 7831.47 30.59
## 5 Information Technology 8347.00 15.65

Sorting in R

Why sort data?

It is a fact that the human brain looks for patterns. Identifying patterns is crucial while performing analysis, and sorted data helps a lot in finding these patterns. Sort help understanding the data, moreover, in profiling the data. Sort functionality help put data into an order which can provide information to help your analysis.

Sorting data using base R

Sorting data using order function

In base R, function order help sort the data. The usage of the order function is straight forward. Here is an example.

financials[order(Symbol),c("Symbol","Price")]
##      Symbol  Price
## 1: A 65.05
## 2: AAL 48.60
## 3: AAP 109.63
## 4: AAPL 155.15
## 5: ABBV 108.48
## ---
## 501: XYL 70.24
## 502: YUM 76.30
## 503: ZBH 115.53
## 504: ZION 50.71
## 505: ZTS 71.51

Sorting data in descending order using order function

We are subsetting variables Symbol and Price from data frame financials and sorting the output by Symbol variable's values. If we want to sort the data in descending order, then function desc comes to our rescue.

financials[order(desc(Symbol)),c("Symbol","Price")]
##      Symbol  Price
## 1: ZTS 71.51
## 2: ZION 50.71
## 3: ZBH 115.53
## 4: YUM 76.30
## 5: XYL 70.24
## ---
## 501: ABBV 108.48
## 502: AAPL 155.15
## 503: AAP 109.63
## 504: AAL 48.60
## 505: A 65.05

Sorting data using dplyr

Sorting data using the arrange function

dplyr package has a function arrange which help sort data. Here is an example.

financials %>% group_by(Sector) %>% summarize(sum(Price)) %>% arrange(Sector) %>% print.data.frame()
## `summarise()` ungrouping output (override with `.groups` argument)
##                        Sector sum(Price)
## 1 Consumer Discretionary 10418.90
## 2 Consumer Staples 2711.98
## 3 Energy 1852.40
## 4 Financials 6055.81
## 5 Health Care 8083.46
## 6 Industrials 7831.47
## 7 Information Technology 8347.00
## 8 Materials 2559.67
## 9 Real Estate 2927.52
## 10 Telecommunication Services 100.81
## 11 Utilities 1545.45

Sorting data in descending order using arrange function

Let's reorder the data in descending order.

financials %>% group_by(Sector) %>% summarize(sum(Price)) %>% arrange(desc(Sector)) %>% print.data.frame()
## `summarise()` ungrouping output (override with `.groups` argument)
##                        Sector sum(Price)
## 1 Utilities 1545.45
## 2 Telecommunication Services 100.81
## 3 Real Estate 2927.52
## 4 Materials 2559.67
## 5 Information Technology 8347.00
## 6 Industrials 7831.47
## 7 Health Care 8083.46
## 8 Financials 6055.81
## 9 Energy 1852.40
## 10 Consumer Staples 2711.98
## 11 Consumer Discretionary 10418.90

Sorting data using data table

Sorting data using order function

Data table has two functions order and setorder to help sort the data. Here is an example but first convert the data frame to data table and then sort.
financials <- setDT(financials)
financials[order(Sector),sum(Price), by=Sector]
##                         Sector       V1
## 1: Consumer Discretionary 10418.90
## 2: Consumer Staples 2711.98
## 3: Energy 1852.40
## 4: Financials 6055.81
## 5: Health Care 8083.46
## 6: Industrials 7831.47
## 7: Information Technology 8347.00
## 8: Materials 2559.67
## 9: Real Estate 2927.52
## 10: Telecommunication Services 100.81
## 11: Utilities 1545.45

Sorting data in descending order using order function

Let's reverse the order now using desc function.
financials[order(desc(Sector)),sum(Price), by=Sector]
##                         Sector       V1
## 1: Utilities 1545.45
## 2: Telecommunication Services 100.81
## 3: Real Estate 2927.52
## 4: Materials 2559.67
## 5: Information Technology 8347.00
## 6: Industrials 7831.47
## 7: Health Care 8083.46
## 8: Financials 6055.81
## 9: Energy 1852.40
## 10: Consumer Staples 2711.98
## 11: Consumer Discretionary 10418.90

Sorting data using setorder function

Now let's look at the sorting using setorder function.
order.data <- setorder(financials[,c("Symbol","Price")],Symbol)
order.data
##      Symbol  Price
## 1: A 65.05
## 2: AAL 48.60
## 3: AAP 109.63
## 4: AAPL 155.15
## 5: ABBV 108.48
## ---
## 501: XYL 70.24
## 502: YUM 76.30
## 503: ZBH 115.53
## 504: ZION 50.71
## 505: ZTS 71.51

Sorting data in descending order using setorder function

Same setorder function can be used to reverse the order of the data by simply putting a minus sign in the front. Here is an example.
order.data <- setorder(financials[,c("Symbol","Price")],-Symbol)
order.data
##      Symbol  Price
## 1: ZTS 71.51
## 2: ZION 50.71
## 3: ZBH 115.53
## 4: YUM 76.30
## 5: XYL 70.24
## ---
## 501: ABBV 108.48
## 502: AAPL 155.15
## 503: AAP 109.63
## 504: AAL 48.60
## 505: A 65.05

Sorting data using sqldf

Data sorting using sqldf is simply achieved using the "order by" clause in the SQL command. 
financials <- read.csv("constituents-financials_csv.csv")
sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector order by Sector')
##                        Sector sum(Price) min("X52.Week.Low")
## 1 Consumer Discretionary 10418.90 13.480
## 2 Consumer Staples 2711.98 21.175
## 3 Energy 1852.40 6.590
## 4 Financials 6055.81 16.530
## 5 Health Care 8083.46 29.930
## 6 Industrials 7831.47 30.590
## 7 Information Technology 8347.00 15.650
## 8 Materials 2559.67 20.250
## 9 Real Estate 2927.52 21.530
## 10 Telecommunication Services 100.81 27.610
## 11 Utilities 1545.45 12.050
Reversing the order of the data using desc clause. Here is an example.
sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector order by Sector desc')
##                        Sector sum(Price) min("X52.Week.Low")
## 1 Utilities 1545.45 12.050
## 2 Telecommunication Services 100.81 27.610
## 3 Real Estate 2927.52 21.530
## 4 Materials 2559.67 20.250
## 5 Information Technology 8347.00 15.650
## 6 Industrials 7831.47 30.590
## 7 Health Care 8083.46 29.930
## 8 Financials 6055.81 16.530
## 9 Energy 1852.40 6.590
## 10 Consumer Staples 2711.98 21.175
## 11 Consumer Discretionary 10418.90 13.480
Concludingly, you may have noticed that we have not discussed NA values at all. Almost all aggregation function allows na.rm argument. Please try and comment if you find out that any particular function did not support it.
Well done, you've made to the end. Thank you for reading this article. I hope you've liked it. Please share it with your friends and if you are agreeing or disagreeing with any point above then please comment in the comment box below. 

To leave a comment for the author, please follow the link and comment on their blog: dataENQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Data manipulation in r using data frames - an extensive article of basics part2 - aggregation and sorting first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

An Interesting Aspect of the Omitted Variable Bias

Posted: 31 Aug 2020 03:00 AM PDT

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:



Assume we want to estimate the causal effect beta of x on y. However, we have an unobserved confounder z that affects both x and y. If we don't add the confounder z as control variable in the regression of y on x, the OLS estimator of beta will be biased. That is the so called omitted variable bias.

Let's simulate a data set and illustrate the omitted variable bias:

n = 10000  alpha = beta = gamma = 1    z = rnorm(n,0,1)  eps.x = rnorm(n,0,1)  eps.y = rnorm(n,0,1)    x = alpha*z + eps.x  y = beta*x + gamma*z + eps.y    # Estimate short regression with z omitted  coef(lm(y~x))[2]  
##        x   ## 1.486573  

While the true causal effect beta is equal to 1, our OLS estimator where we omit z is around 1.5. This means it has a positive bias of roughly 0.5.

Before we continue, let's have a quiz (click here if the Google form quiz is not correctly shown.):

Let's see what happens if we increase the impact of the confounder z on x, say to alpha=1000.

alpha = 1000  x = alpha*z + eps.x  y = beta*x + gamma*z + eps.y  coef(lm(y~x))[2]  
##        x   ## 1.000983  

The bias is almost gone!

This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both y and x. Thus the omitted variable bias probably becomes worse if the confounder z affects y or x more strongly. While this intuition is correct for small alpha, it is wrong once alpha is sufficiently large.

For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:

[asy. \; Bias(\hat \beta) = \gamma\alpha\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_x)}]
(From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)

Let's plot the bias for different values of $\alpha$:

Var.z = Var.eps.x = 1  alpha = seq(0,10,by=0.1)  asym.bias = gamma*alpha * Var.z /              (alpha^2*Var.z+Var.eps.x)  plot(alpha,asym.bias)  

For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.

Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.

Typical presentation of the omitted variable bias formula

Note that the omitted variable bias formula is usually presented as follows:

[
Bias(\hat \beta) = \gamma \hat \delta]
where $\hat \delta$ is the OLS estimate of the linear regression [z = const + \delta x + u
]

(This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as
[
x=\tilde{const} + \frac 1 \delta z + \tilde u
]
suggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the y and x in a simple linear regression can be a bit surprising, see my previous post.)

If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get
[\lim_{Var(\varepsilon_x)\rightarrow 0 } asy. \; Bias(\hat \beta) = \frac \gamma\alpha]

However, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post An Interesting Aspect of the Omitted Variable Bias first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Course Launch: High-Performance Time Series Forecasting in 7 Days!

Posted: 30 Aug 2020 09:04 PM PDT

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

High-Performance Time Series Forecasting Course is an amazing course designed to teach Data Scientists and Business Analysts the business forecasting tools and techniques that win time series competitions and deliver your organization results! We've combined an innovative program with a clear-cut path to forecasting. You'll undergo a complete transformation learning the most in-demand skills that organizations need right now. Time to accelerate your career!


Create a free account to get the launch discount in 7 Days!

Start Here: Before you go any further, create a free account at Business Science University – Our online educational platform. We will send you a discount link in 7-days (at Course Launch).

Sign Up Here


High-Performance Time Series
Improve your forecasting abilities and accelerate your career

Watch our 2-minute course video that describes the innovative, 3-part roadmap for learning high-performance time series forecasting.

Watch Your Journey (2-Minute Video) – A Streamlined Path to High-Performance Forecasting

Learn from Time Series Competitions
Analyze the winning submissions



High-Performance Time Series Forecasting is a state-of-the-art course designed to teach data scientists and business analysts how to apply the latest forecasting techniques to their businesses by learning from the strategies that won 4 Key Time Series Competitions.

A Project-Based Course
Learn By Completing Real Business Projects with Real Business Data



Project-Based Learning – Learn by Solving 2 Real-World Business Forecasting Problems

This is a project-based course, and here's the scenario. You are forecasting data related to key performance indicators for an e-learning company. Your primary goals are to improve sales demand forecasting by 10% or more and to set up system for forecasting email subscriptions.

To complete these projects, you will need to learn the state-of-the-art time series forecasting techniques along with time-series feature engineering and working with external data sources. You'll learn 3 key software libraries along the way!

The Curriculum
By Learn An End-To-End Forecasting Process



Learn the most advanced technology: Modeltime, Timetk, and GluonTS

To complete the projects, you will need to learn:

  • Feature Engineering, Visualization, Data Wrangling, and Transformations: timetk (R)
  • Forecasting with Machine Learning: modeltime (R)
  • Forecasting with Deep Learning: gluonts (Python)

Your 3-Part Journey to High-Performance Forecasting
The innovative system is divided into three parts



The 3-Part Learning Path for High-Performance Forecasting

Part 1 – Feature Engineering

You begin by learning the core data science skills needed to process, transform, visualize, and feature engineer time series data in preparation for machine learning.

  • Competition Overview – 4 competitions and strategies that won – 30 min
  • TS Jumpstart – 1+ Hours
  • Visualization – 1.5 hours
  • Data Wrangling – 1.5 Hours
  • Transformations – 1.5 Hours
  • Challenge 1 – Investigating Revenue and Google Analytics Data – 30 min
  • Feature Eng – Intro & Advanced – 3 hours
  • Challenge 2 – Feature Engineering Revenue, Events, and Lagged Regressors – 1.5 Hours

Part 2 – Forecasting with Machine Learning

You learn how to use many different forecasting and machine learning algorithms for time series.

  • ARIMA – 1.5 hours
  • Prophet – 45 Min
  • ETS, TBATS, & Seasonal Decomp – 1 hour
  • Challenge 3 – Forecasting Revenue with ARIMA, Prophet, and ETS/TBATS – 1.5 hours
  • Machine Learning – GLMNet, KNN, RF, XGBoost, Rule Based – 2 Hours
  • Boosted Algorithms – Prophet & ARIMA – 30 min
  • Hyper Parameter Tuning & Cross Validation – 1+ Hours
  • Scalable Modeling & Time Series Groups – 1+ Hours
  • Ensemble Approaches – Averaging & Stacking – 1+ Hours
  • Challenge 4 – Forecasting Revenue with Ensemble Learning – 1.5 hours

Part 3 – Forecasting with Deep Learning

You learn a state-of-the-art Python library called GluonTS for time series deep learning.

You learn how to perform deep learning for time series:

  • DeepAR

  • DeepVAR

  • NBeats

  • and more!

Next Steps – Create a free account to get the launch discount in 7 Days!

Career Acceleration Begins in 7-Days: I'm super excited to help you apply the most advanced business forecasting techinques in your organization. Create a free account at Business Science University – Our online educational platform.

Sign Up Here!

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Course Launch: High-Performance Time Series Forecasting in 7 Days! first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Crowd Counting Consortium Crowd Data and Shiny Dashboard

Posted: 30 Aug 2020 05:00 PM PDT

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Jay Ulfelder, PhD, serves as Program Manager for the Nonviolent Action Lab, part of the Carr Center for Human Rights Policy at the Harvard Kennedy School. He has used R to work at the intersection of social science and data science for nearly two decades.

Where are people in the United States protesting in 2020, and what are they protesting about? How large have those crowds been? How many protesters have been arrested or injured? And how does this year's groundswell of protest activity compare to the past several years, which had already produced some of the largest single-day gatherings in U.S. history?

These are the kinds of questions the Crowd Counting Consortium (CCC) Crowd Dataset helps answer. Begun after the 2017 Women's March by Professors Erica Chenoweth (Harvard University) and Jeremy Pressman (University of Connecticut), the CCC's database on political crowds has grown into one of the most comprehensive open sources of near-real time information on protests, marches, demonstrations, strikes, and similar political gatherings in the contemporary United States. At the start of August 2020, the database included nearly 50,000 events. These data have been used in numerous academic and press pieces, including a recent New York Times story on the historic scale of this year's Black Lives Matter uprising.

As rich as the data are, they have been a challenge to use. The CCC shares its data on political crowds via a stack of monthly Google Sheets with formats that can vary from sheet to sheet in small but confounding ways. Column names don't always match, and certain columns have been added or dropped over time. Some sheets include separate tabs for specific macro-events or campaigns (e.g., a coordinated climate strike), while others group everything in a single sheet. And, of course, typos happen.

To make this tremendous resource more accessible to researchers, activists, journalists, and data scientists, the Nonviolent Action Lab at Harvard's Carr Center for Human Rights Policy—a new venture started by CCC co-founder Chenoweth—has created a GitHub repository to host a compiled, cleaned, and augmented version of the CCC's database.

In addition to all the information contained in the monthly sheets, the compiled version adds two big feature sets.

  • Geolocation. After compiling the data, the Lab uses the googleway package to run the cities and towns in which the events took place through Google Maps's Geolocation API and extracts geocoordinates from the results, along with clean versions of the locality, county, and state names associated with them.

  • Issue Tags. In the original data, protesters' claims—in other words, what the protest is about—are recorded by human coders in a loosely structured way. To allow researchers to group or filter the data by theme, the Nonviolent Action Lab maintains a dictionary of a few dozen major political issues in the U.S. (e.g., "guns", "migration", "racism") and keyword– and keyphrase-based regular expressions associated with them. By mapping this dictionary over the claim strings, we generate a set of binary issue tags that gets added to the compiled database as a vector of semicolon-separated strings.

To make the CCC data more accessible to a wider audience, the Lab has also built a Shiny dashboard that lets users filter events in various ways and then map and plot the results. Users can filter by date range, year, or campaign as well as issue and political valence (pro-Trump, anti-Trump, or neither).

The dashboard has two main tabs. The first uses the leaflet package to map the events with markers that contain summary details and links to the source(s) the human coders used to research them.

The second tab uses plotly and the streamgraph html widget package to render interactive plots of trends over time in the occurrence of the selected events, the number of participants in them, and the political issues associated with them.

The point of the Nonviolent Action Lab's repository and dashboard is to make the Crowd Counting Consortium''s data more accessible and more useful to as wide an audience as possible. If you use either of these resources and find bugs or errors or have suggestions on how to improve them, please let us know.

  • To provide feedback on the compiled data set or the Shiny dashboard, please open an issue on the GitHub repository, here.

  • If you think you see an error in the CCC's data or know about an event that it doesn't cover, please use this form to submit a correction or record via the CCC website.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Crowd Counting Consortium Crowd Data and Shiny Dashboard first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Task 1 – Retail Strategy and Analytics

Posted: 30 Aug 2020 05:00 PM PDT

[This article was first published on R Blogs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Hello Everyone! It's been long since I posted something new.
I completed this Virtual Experience Program a month back and thought I could write about the solutions of all the 3 tasks for anyone who might be seeking for the same.

This post is specifically about Task 1 – Data Preparation and Customer Analytics

You are provided initially with the solution modules but still some parts and codes are a bit tough to crack. I did it my way but surely there could be various other solutions to the same problem.

So, lets dive straight to the solutions.

Load the required libraries

library(data.table)  library(ggplot2)  library(ggmosaic)  library(readr)

Examine transaction data

Creating local dataset

To inspect if certain columns are in their specified format for eg. date column is in date format etc.

Also, looking at the description of both the datasets

purchase_beahviour <- as.data.table( read.csv("QVI_purchase_behaviour.csv"))  transaction_data <- as.data.table(readxl::read_xlsx("QVI_transaction_data.xlsx"))    str(purchase_beahviour)
## Classes 'data.table' and 'data.frame':   72637 obs. of  3 variables:  ##  $ LYLTY_CARD_NBR  : int  1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...  ##  $ LIFESTAGE       : Factor w/ 7 levels "MIDAGE SINGLES/COUPLES",..: 7 7 6 4 1 7 2 7 4 3 ...  ##  $ PREMIUM_CUSTOMER: Factor w/ 3 levels "Budget","Mainstream",..: 3 2 1 2 2 1 3 2 2 2 ...  ##  - attr(*, ".internal.selfref")=
str(transaction_data)
## Classes 'data.table' and 'data.frame':   264836 obs. of  8 variables:  ##  $ DATE          : num  43390 43599 43605 43329 43330 ...  ##  $ STORE_NBR     : num  1 1 1 2 2 4 4 4 5 7 ...  ##  $ LYLTY_CARD_NBR: num  1000 1307 1343 2373 2426 ...  ##  $ TXN_ID        : num  1 348 383 974 1038 ...  ##  $ PROD_NBR      : num  5 66 61 69 108 57 16 24 42 52 ...  ##  $ PROD_NAME     : chr  "Natural Chip        Compny SeaSalt175g" "CCs Nacho Cheese    175g" "Smiths Crinkle Cut  Chips Chicken 170g" "Smiths Chip Thinly  S/Cream&Onion 175g" ...  ##  $ PROD_QTY      : num  2 3 2 5 3 1 1 1 1 2 ...  ##  $ TOT_SALES     : num  6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...  ##  - attr(*, ".internal.selfref")=

EDA (done alreasy using str())

changes!
We saw that the date format is in numeric format which is wrong so we convert it to the date format as shown below

transaction_data$DATE <- as.Date(transaction_data$DATE,origin = "1899-12-30")

Now dates are in the right format

Examine PROD_NAME

Generating summary of the PROD_NAME column

#head(transaction_data$PROD_NAME)  transaction_data[, .N, PROD_NAME]
##                                     PROD_NAME    N  ##   1:   Natural Chip        Compny SeaSalt175g 1468  ##   2:                 CCs Nacho Cheese    175g 1498  ##   3:   Smiths Crinkle Cut  Chips Chicken 170g 1484  ##   4:   Smiths Chip Thinly  S/Cream&Onion 175g 1473  ##   5: Kettle Tortilla ChpsHny&Jlpno Chili 150g 3296  ##  ---                                                ## 110:    Red Rock Deli Chikn&Garlic Aioli 150g 1434  ## 111:      RRD SR Slow Rst     Pork Belly 150g 1526  ## 112:                 RRD Pc Sea Salt     165g 1431  ## 113:       Smith Crinkle Cut   Bolognese 150g 1451  ## 114:                 Doritos Salsa Mild  300g 1472

Warning~
Most of the products in PROD_NAME are chips but others may or may not exist.

changes!
Looks like we are definitely looking at potato chips but how can we check that these are all chips? We can do some basic text analysis by summarising the individual words in the product name.

#### Examine the words in PROD_NAME to see if there are any incorrect entries  #### such as products that are not chips  productWords <- data.table(unlist(strsplit(unique(transaction_data[, PROD_NAME]), "  ")))  setnames(productWords, 'words')

As we are only interested in words that will tell us if the product is chips or not, let's remove all words with digits and special characters such as '&' from our set of product words. We can do this using grepl().

library(stringr)  library(stringi)    #### Removing special characters  productWords$words <- str_replace_all(productWords$words,"[[:punct:]]"," ")      #### Removing digits  productWords$words <- str_replace_all(productWords$words,"[0-9]"," ")  productWords$words <- str_replace_all(productWords$words,"[gG]"," ")    #### Let's look at the most common words by counting the number of times a word appears and  wordsSep <- strsplit(productWords$words," ")  words.freq<-table(unlist(wordsSep))    #### sorting them by this frequency in order of highest to lowest frequency  words.freq <-  as.data.frame(words.freq)  words.freq <- words.freq[order(words.freq$Freq, decreasing = T),]    words.freq
##                Var1 Freq  ## 1                    732  ## 37            Chips   21  ## 164          Smiths   16  ## 55          Crinkle   14  ## 62              Cut   14  ## 92           Kettle   13  ## 26           Cheese   12  ## 152            Salt   12  ## 122             Ori   11  ## 85             inal   10  ## 34             Chip    9  ## 68          Doritos    9  ## 151           Salsa    9  ## 29          Chicken    8  ## 52             Corn    8  ## 54            Cream    8  ## 93              les    8  ## 135            Prin    8  ## 147             RRD    8  ## 32           Chilli    7  ## 204            Vine    7  ## 209              WW    7  ## 4                ar    6  ## 156             Sea    6  ## 168            Sour    6  ## 57           Crisps    5  ## 192          Thinly    5  ## 193           Thins    5  ## 38           Chives    4  ## 65             Deli    4  ## 86        Infuzions    4  ## 95             Lime    4  ## 110         Natural    4  ## 139             Red    4  ## 146            Rock    4  ## 185         Supreme    4  ## 186           Sweet    4  ## 13              BBQ    3  ## 22              CCs    3  ## 49             Cobs    3  ## 66              Dip    3  ## 69               El    3  ## 94               Li    3  ## 103            Mild    3  ## 116             Old    3  ## 118           Onion    3  ## 124            Paso    3  ## 129            Popd    3  ## 159      Sensations    3  ## 171             Soy    3  ## 188             Swt    3  ## 196          Tomato    3  ## 197        Tortilla    3  ## 198        Tostitos    3  ## 200        Twisties    3  ## 208      Woolworths    3  ## 3               And    2  ## 6             arlic    2  ## 19              Bur    2  ## 27          Cheetos    2  ## 28         Cheezels    2  ## 35           ChipCo    2  ## 46              Chs    2  ## 70               er    2  ## 74           French    2  ## 78            Honey    2  ## 84             htly    2  ## 100          Medium    2  ## 109           Nacho    2  ## 132          Potato    2  ## 137               r    2  ## 138            rain    2  ## 142             Rin    2  ## 143              rn    2  ## 149               s    2  ## 150               S    2  ## 154          Salted    2  ## 163           Smith    2  ## 169       SourCream    2  ## 178              SR    2  ## 189             Tan    2  ## 191            Thai    2  ## 201        Tyrrells    2  ## 205           Waves    2  ## 210               y    2  ## 2             Aioli    1  ## 5             arden    1  ## 7                Ba    1  ## 8             Bacon    1  ## 9             Balls    1  ## 10         Barbecue    1  ## 11         Barbeque    1  ## 12            Basil    1  ## 14            Belly    1  ## 15               Bi    1  ## 16             Bolo    1  ## 17              Box    1  ## 18           Btroot    1  ## 20        Camembert    1  ## 21           camole    1  ## 23            Chckn    1  ## 24             Ched    1  ## 25           Cheddr    1  ## 30            Chikn    1  ## 31            Chili    1  ## 33      Chimuchurri    1  ## 36         Chipotle    1  ## 39             Chli    1  ## 40            Chlli    1  ## 41            Chnky    1  ## 42              Chp    1  ## 43       ChpsBtroot    1  ## 44         ChpsFeta    1  ## 45          ChpsHny    1  ## 47           Chutny    1  ## 48               Co    1  ## 50          Coconut    1  ## 51           Compny    1  ## 53         Crackers    1  ## 56            Crips    1  ## 58              Crm    1  ## 59              Crn    1  ## 60         Crnchers    1  ## 61           Crnkle    1  ## 63          CutSalt    1  ## 64                D    1  ## 67           Dorito    1  ## 71               Fi    1  ## 72          Flavour    1  ## 73             Frch    1  ## 75     FriedChicken    1  ## 76            Fries    1  ## 77            Herbs    1  ## 79             Hony    1  ## 80              Hot    1  ## 81              Hrb    1  ## 82               ht    1  ## 83               Ht    1  ## 87           Infzns    1  ## 88              inl    1  ## 89         Jalapeno    1  ## 90              Jam    1  ## 91            Jlpno    1  ## 96              Mac    1  ## 97              Man    1  ## 98            Maple    1  ## 99              Med    1  ## 101         Mexican    1  ## 102        Mexicana    1  ## 104      Mozzarella    1  ## 105           Mstrd    1  ## 106         Mystery    1  ## 107         Mzzrlla    1  ## 108               N    1  ## 111             NCC    1  ## 112            nese    1  ## 113              nl    1  ## 114               o    1  ## 115              Of    1  ## 117            Onin    1  ## 119        OnionDip    1  ## 120    OnionStacked    1  ## 121              Or    1  ## 123        Papadums    1  ## 125              Pc    1  ## 126          Pepper    1  ## 127           Pesto    1  ## 128            Plus    1  ## 130            Pork    1  ## 131             Pot    1  ## 133       PotatoMix    1  ## 134           Prawn    1  ## 136           Puffs    1  ## 140             Rib    1  ## 141         Ricotta    1  ## 144          rnWves    1  ## 145           Roast    1  ## 148             Rst    1  ## 153           saltd    1  ## 155           Sauce    1  ## 157         SeaSalt    1  ## 158 Seasonedchicken    1  ## 160         Siracha    1  ## 161            Slow    1  ## 162             Slt    1  ## 165          Smoked    1  ## 166             Sna    1  ## 167           Snbts    1  ## 170        Southern    1  ## 172              Sp    1  ## 173            Spce    1  ## 174            Spcy    1  ## 175           Spicy    1  ## 176          Splash    1  ## 177              Sr    1  ## 179         Stacked    1  ## 180           Steak    1  ## 181           Sthrn    1  ## 182           Strws    1  ## 183           Style    1  ## 184        Sunbites    1  ## 187      SweetChili    1  ## 190           Tasty    1  ## 194           Tmato    1  ## 195             Tom    1  ## 199         Truffle    1  ## 202              Ve    1  ## 203             Vin    1  ## 206             Whl    1  ## 207            Whle    1

We saw that we have whitespace maximum number of times and the second most occuring word is chips

Remove salsa products

changes!
There are salsa products in the dataset but we are only interested in the chips
category, so let's remove these.

transaction_data[, SALSA := grepl("salsa", tolower(PROD_NAME))]  transaction_data <- transaction_data[SALSA == FALSE, ][, SALSA := NULL]

Next, we can use summary() to check summary statistics such as mean, min and max
values for each feature to see if there are any obvious outliers in the data and if
there are any nulls in any of the columns (NA's : number of nulls will appear in
the output if there are any nulls).

summary(transaction_data)
##       DATE              STORE_NBR     LYLTY_CARD_NBR        TXN_ID         ##  Min.   :2018-07-01   Min.   :  1.0   Min.   :   1000   Min.   :      1    ##  1st Qu.:2018-09-30   1st Qu.: 70.0   1st Qu.:  70015   1st Qu.:  67569    ##  Median :2018-12-30   Median :130.0   Median : 130367   Median : 135183    ##  Mean   :2018-12-30   Mean   :135.1   Mean   : 135531   Mean   : 135131    ##  3rd Qu.:2019-03-31   3rd Qu.:203.0   3rd Qu.: 203084   3rd Qu.: 202654    ##  Max.   :2019-06-30   Max.   :272.0   Max.   :2373711   Max.   :2415841    ##     PROD_NBR       PROD_NAME            PROD_QTY         TOT_SALES        ##  Min.   :  1.00   Length:246742      Min.   :  1.000   Min.   :  1.700    ##  1st Qu.: 26.00   Class :character   1st Qu.:  2.000   1st Qu.:  5.800    ##  Median : 53.00   Mode  :character   Median :  2.000   Median :  7.400    ##  Mean   : 56.35                      Mean   :  1.908   Mean   :  7.321    ##  3rd Qu.: 87.00                      3rd Qu.:  2.000   3rd Qu.:  8.800    ##  Max.   :114.00                      Max.   :200.000   Max.   :650.000

There are no nulls in the columns but product quantity appears to have an outlier
which we should investigate further. Let's investigate further the case where 200
packets of chips are bought in one transaction.

library(tidyverse)  library(dplyr)    prod_qty_200 <- transaction_data %>% filter(PROD_QTY==200)

There are two transactions where 200 packets of chips are bought in one transaction
and both of these transactions were by the same customer.

Let's see if the customer has had other transactions

is_it_same_customer <- transaction_data %>% filter(LYLTY_CARD_NBR == 226000) 

It looks like this customer has only had the two transactions over the year and is
not an ordinary retail customer. The customer might be buying chips for commercial purposes instead. We'll remove this loyalty card number from further analysis.

Removing this customer from the list

transaction_data <- transaction_data[!(transaction_data$LYLTY_CARD_NBR == 226000)]

Re-examine transaction data

summary(transaction_data)
##       DATE              STORE_NBR     LYLTY_CARD_NBR        TXN_ID         ##  Min.   :2018-07-01   Min.   :  1.0   Min.   :   1000   Min.   :      1    ##  1st Qu.:2018-09-30   1st Qu.: 70.0   1st Qu.:  70015   1st Qu.:  67569    ##  Median :2018-12-30   Median :130.0   Median : 130367   Median : 135182    ##  Mean   :2018-12-30   Mean   :135.1   Mean   : 135530   Mean   : 135130    ##  3rd Qu.:2019-03-31   3rd Qu.:203.0   3rd Qu.: 203083   3rd Qu.: 202652    ##  Max.   :2019-06-30   Max.   :272.0   Max.   :2373711   Max.   :2415841    ##     PROD_NBR       PROD_NAME            PROD_QTY       TOT_SALES       ##  Min.   :  1.00   Length:246740      Min.   :1.000   Min.   : 1.700    ##  1st Qu.: 26.00   Class :character   1st Qu.:2.000   1st Qu.: 5.800    ##  Median : 53.00   Mode  :character   Median :2.000   Median : 7.400    ##  Mean   : 56.35                      Mean   :1.906   Mean   : 7.316    ##  3rd Qu.: 87.00                      3rd Qu.:2.000   3rd Qu.: 8.800    ##  Max.   :114.00                      Max.   :5.000   Max.   :29.500

That's better. Now, let's look at the number of transaction lines over time to see
if there are any obvious data issues such as missing data.

Count the number of transactions by date

countByDate <- count(transaction_data, transaction_data$DATE)  countByDate
##      transaction_data$DATE   n  ##   1:            2018-07-01 663  ##   2:            2018-07-02 650  ##   3:            2018-07-03 674  ##   4:            2018-07-04 669  ##   5:            2018-07-05 660  ##  ---                            ## 360:            2019-06-26 657  ## 361:            2019-06-27 669  ## 362:            2019-06-28 673  ## 363:            2019-06-29 703  ## 364:            2019-06-30 704
nrow(countByDate)
## [1] 364
summary(countByDate)
##  transaction_data$DATE       n          ##  Min.   :2018-07-01    Min.   :607.0    ##  1st Qu.:2018-09-29    1st Qu.:658.0    ##  Median :2018-12-30    Median :674.0    ##  Mean   :2018-12-30    Mean   :677.9    ##  3rd Qu.:2019-03-31    3rd Qu.:694.2    ##  Max.   :2019-06-30    Max.   :865.0

There's only 364 rows, meaning only 364 dates which indicates a missing date. Let's
create a sequence of dates from 1 Jul 2018 to 30 Jun 2019 and use this to create a
chart of number of transactions over time to find the missing date.

Create a sequence of dates and join this the count of transactions by date

transaction_by_day <- transaction_data[order(DATE),]

Setting plot themes to format graphs

theme_set(theme_bw())  theme_update(plot.title = element_text(hjust = 0.5))

Plot transactions over time

transOverTime <-ggplot(countByDate, aes(x = countByDate$`transaction_data$DATE`, y = countByDate$n)) +   geom_line() +   labs(x = "Day", y = "Number of transactions", title = "Transactions over time") +   scale_x_date(breaks = "1 month") +   theme(axis.text.x = element_text(angle = 90, vjust = 0.5))    transOverTime

We can see that there is an increase in purchases in December and a break in late
December. Let's zoom in on this.

Filter to December and look at individual days

filterData <- countByDate[countByDate$`transaction_data$DATE` >= "2018-12-01" & countByDate$`transaction_data$DATE` <= "2018-12-31"]    ggplot(filterData, aes(x = filterData$`transaction_data$DATE`, y = filterData$n)) +   geom_line() +   labs(x = "Day", y = "Number of transactions", title = "Transactions in December") +   scale_x_date(breaks = "1 day") +   theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

We can see that the increase in sales occurs in the lead-up to Christmas and that
there are zero sales on Christmas day itself. This is due to shops being closed on
Christmas day.

Now that we are satisfied that the data no longer has outliers, we can move on to
creating other features such as brand of chips or pack size from PROD_NAME. We will
start with pack size.

Pack size

A new column PACK SIZE added to the data frame transaction_data

#### We can work this out by taking the digits that are in PROD_NAME  transaction_data[, PACK_SIZE := parse_number(PROD_NAME)]    #### Let's check if the pack sizes look sensible   df_packSizeVsTransactions <- transaction_data[, .N, PACK_SIZE][order(PACK_SIZE)]    df_packSizeVsTransactions
##     PACK_SIZE     N  ##  1:        70  1507  ##  2:        90  3008  ##  3:       110 22387  ##  4:       125  1454  ##  5:       134 25102  ##  6:       135  3257  ##  7:       150 40203  ##  8:       160  2970  ##  9:       165 15297  ## 10:       170 19983  ## 11:       175 66390  ## 12:       180  1468  ## 13:       190  2995  ## 14:       200  4473  ## 15:       210  6272  ## 16:       220  1564  ## 17:       250  3169  ## 18:       270  6285  ## 19:       330 12540  ## 20:       380  6416

The largest size is 380g and the smallest size is 70g – seems sensible!

Let's plot a histogram of PACK_SIZE since we know that it is a categorical variable and not a continuous variable even though it is numeric.

#ggplot(df_packSizeVsTransactions, aes(x = df_packSizeVsTransactions$PACK_SIZE, y = df_packSizeVsTransactions$N)) +  # geom_line() +   #labs(x = "Pack Sizes", y = "Number of transactions", title = "Transactions #over time") + scale_x_continuous(breaks = seq(70,390,20)) +   #theme(axis.text.x = element_text(angle = 90, vjust = 0.5))    hist(transaction_data[, PACK_SIZE])

Pack sizes created look reasonable.

Brands

Now to create brands, we can use the first word in PROD_NAME to work out the brand
name…

#Create a column which contains the brand of the product, by extracting it from the product name.      transaction_data$BRAND <- gsub("([A-Za-z]+).*", "\\1", transaction_data$PROD_NAME)    transaction_data[, .N, by = BRAND][order(‐N)]
##          BRAND     N  ##  1:     Kettle 41288  ##  2:     Smiths 27390  ##  3:   Pringles 25102  ##  4:    Doritos 22041  ##  5:      Thins 14075  ##  6:        RRD 11894  ##  7:  Infuzions 11057  ##  8:         WW 10320  ##  9:       Cobs  9693  ## 10:   Tostitos  9471  ## 11:   Twisties  9454  ## 12:   Tyrrells  6442  ## 13:      Grain  6272  ## 14:    Natural  6050  ## 15:   Cheezels  4603  ## 16:        CCs  4551  ## 17:        Red  4427  ## 18:     Dorito  3183  ## 19:     Infzns  3144  ## 20:      Smith  2963  ## 21:    Cheetos  2927  ## 22:      Snbts  1576  ## 23:     Burger  1564  ## 24: Woolworths  1516  ## 25:    GrnWves  1468  ## 26:   Sunbites  1432  ## 27:        NCC  1419  ## 28:     French  1418  ##          BRAND     N

Some of the brand names look like they are of the same brands – such as RED and
RRD, which are both Red Rock Deli chips. Let's combine these together.

Clean brand names

transaction_data[BRAND == "RED", BRAND := "RRD"]  transaction_data[BRAND == "SNBTS", BRAND := "SUNBITES"]  transaction_data[BRAND == "INFZNS", BRAND := "INFUZIONS"]  transaction_data[BRAND == "WW", BRAND := "WOOLWORTHS"]  transaction_data[BRAND == "SMITH", BRAND := "SMITHS"]  transaction_data[BRAND == "NCC", BRAND := "NATURAL"]  transaction_data[BRAND == "DORITO", BRAND := "DORITOS"]  transaction_data[BRAND == "GRAIN", BRAND := "GRNWVES"]      ###Check again  transaction_data[, .N, by = BRAND][order(BRAND)]
##          BRAND     N  ##  1:     Burger  1564  ##  2:        CCs  4551  ##  3:    Cheetos  2927  ##  4:   Cheezels  4603  ##  5:       Cobs  9693  ##  6:     Dorito  3183  ##  7:    Doritos 22041  ##  8:     French  1418  ##  9:      Grain  6272  ## 10:    GrnWves  1468  ## 11:  Infuzions 11057  ## 12:     Infzns  3144  ## 13:     Kettle 41288  ## 14:    NATURAL  1419  ## 15:    Natural  6050  ## 16:   Pringles 25102  ## 17:        RRD 11894  ## 18:        Red  4427  ## 19:      Smith  2963  ## 20:     Smiths 27390  ## 21:      Snbts  1576  ## 22:   Sunbites  1432  ## 23:      Thins 14075  ## 24:   Tostitos  9471  ## 25:   Twisties  9454  ## 26:   Tyrrells  6442  ## 27: WOOLWORTHS 10320  ## 28: Woolworths  1516  ##          BRAND     N

Examining customer data

Now that we are happy with the transaction dataset, let's have a look at the
customer dataset.

summary(purchase_beahviour)
##  LYLTY_CARD_NBR                     LIFESTAGE       PREMIUM_CUSTOMER  ##  Min.   :   1000   MIDAGE SINGLES/COUPLES: 7275   Budget    :24470    ##  1st Qu.:  66202   NEW FAMILIES          : 2549   Mainstream:29245    ##  Median : 134040   OLDER FAMILIES        : 9780   Premium   :18922    ##  Mean   : 136186   OLDER SINGLES/COUPLES :14609                       ##  3rd Qu.: 203375   RETIREES              :14805                       ##  Max.   :2373711   YOUNG FAMILIES        : 9178                       ##                    YOUNG SINGLES/COUPLES :14441

Let's have a closer look at the LIFESTAGE and PREMIUM_CUSTOMER columns.

#### Examining the values of lifestage and premium_customer  purchase_beahviour[, .N, by = LIFESTAGE][order(‐N)]
##                 LIFESTAGE     N  ## 1:               RETIREES 14805  ## 2:  OLDER SINGLES/COUPLES 14609  ## 3:  YOUNG SINGLES/COUPLES 14441  ## 4:         OLDER FAMILIES  9780  ## 5:         YOUNG FAMILIES  9178  ## 6: MIDAGE SINGLES/COUPLES  7275  ## 7:           NEW FAMILIES  2549
purchase_beahviour[, .N, by = PREMIUM_CUSTOMER][order(‐N)]
##    PREMIUM_CUSTOMER     N  ## 1:       Mainstream 29245  ## 2:           Budget 24470  ## 3:          Premium 18922

Merge transaction data to customer data

data <- merge(transaction_data, purchase_beahviour, all.x = TRUE)

As the number of rows in data is the same as that of transactionData, we can be
sure that no duplicates were created. This is because we created data by setting
all.x = TRUE (in other words, a left join) which means take all the rows in
transactionData and find rows with matching values in shared columns and then
joining the details in these rows to the x or the first mentioned table.

Let's also check if some customers were not matched on by checking for nulls.

apply(data, 2, function(x) any(is.na(x)))
##   LYLTY_CARD_NBR             DATE        STORE_NBR           TXN_ID   ##            FALSE            FALSE            FALSE            FALSE   ##         PROD_NBR        PROD_NAME         PROD_QTY        TOT_SALES   ##            FALSE            FALSE            FALSE            FALSE   ##        PACK_SIZE            BRAND        LIFESTAGE PREMIUM_CUSTOMER   ##            FALSE            FALSE            FALSE            FALSE

Great, there are no nulls! So all our customers in the transaction data has been
accounted for in the customer dataset.

For Task 2, we write this dataset into a csv file

write.csv(data,"QVI_data.csv")

** Data exploration is now complete! **

Data analysis on customer segments

Now that the data is ready for analysis, we can define some metrics of interest to
the client:
– Who spends the most on chips (total sales), describing customers by lifestage and
how premium their general purchasing behaviour is
– How many customers are in each segment
– How many chips are bought per customer by segment
– What's the average chip price by customer segment
We could also ask our data team for more information. Examples are:
– The customer's total spend over the period and total spend for each transaction
to understand what proportion of their grocery spend is on chips
– Proportion of customers in each customer segment overall to compare against the
mix of customers who purchase chips
Let's start with calculating total sales by LIFESTAGE and PREMIUM_CUSTOMER and
plotting the split by

Total sales by LIFESTAGE and PREMIUM_CUSTOMER

total_sales <- data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER)    pf.total_sales <- summarise(total_sales,sales_count=sum(TOT_SALES))    summary(pf.total_sales)    #### Create plot  p <- ggplot(pf.total_sales) + geom_mosaic(aes(weight = sales_count, x = product(PREMIUM_CUSTOMER, LIFESTAGE),fill = PREMIUM_CUSTOMER)) + labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of sales") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))     p +geom_text(data = ggplot_build(p)$data[[1]], aes(x = (xmin + xmax)/2 , y = (ymin + ymax)/2, label = as.character(paste(round(.wt/sum(.wt),3)*100, '%'))), inherit.aes = F)

Sales are coming mainly from Budget – older families, Mainstream – young
singles/couples, and Mainstream – retirees

Number of customers by LIFESTAGE and PREMIUM_CUSTOMER

Let's see if the higher sales are due to there being more customers who buy chips.

total_sales <- data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER)    no_of_customers <- summarise(total_sales,customer_count = length(unique(LYLTY_CARD_NBR)))       summary(no_of_customers)    #### Create plot  p <- ggplot(data = no_of_customers) + geom_mosaic(aes(weight = customer_count, x = product(PREMIUM_CUSTOMER, LIFESTAGE), fill = PREMIUM_CUSTOMER)) + labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of customers") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+ geom_text(data = ggplot_build(p)$data[[1]], aes(x = (xmin + xmax)/2 , y = (ymin + ymax)/2, label = as.character(paste(round(.wt/sum(.wt),3)*100, '%'))))      p

There are more Mainstream – young singles/couples and Mainstream – retirees who buy chips. This contributes to there being more sales to these customer segments but this is not a major driver for the Budget – Older families segment.

Higher sales may also be driven by more units of chips being bought per customer.

Let's have a look at this next.

Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER

total_sales_1 <-data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER)    units <-  summarise(total_sales_1, units_count = (sum(PROD_QTY)/uniqueN(LYLTY_CARD_NBR)))    summary(units)    ###create plot  ggplot(data = units, aes(weight = units_count, x = LIFESTAGE, fill = PREMIUM_CUSTOMER)) + geom_bar(position = position_dodge()) +  labs(x = "Lifestage", y = "Avg units per transaction", title = "Units per customer") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

check <- units[order(units$units_count, decreasing = T),]

Older families and young families in general buy more chips per customer

Average price per unit by LIFESTAGE and PREMIUM_CUSTOMER

Let's also investigate the average price per unit chips bought for each customer segment as this is also a driver of total sales.

total_sales_2 <-data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER)    pricePerUnit <-  summarise(total_sales_2, price_per_unit = (sum(TOT_SALES)/sum(PROD_QTY)))    ####plot  ggplot(data=pricePerUnit, aes(weight = price_per_unit,x = LIFESTAGE, fill = PREMIUM_CUSTOMER)) + geom_bar(position = position_dodge()) + labs(x = "Lifestage", y = "Avg price per unit", title = "Price per unit") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Mainstream midage and young singles and couples are more willing to pay more per packet of chips compared to their budget and premium counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own consumption. This is also supported by there being fewer premium midage and young singles and couples buying chips compared to their mainstream counterparts.

As the difference in average price per unit isn't large, we can check if this difference is statistically different.

Perform an independent t-test between mainstream vs premium and budget midage and young singles and couples

# If this p-value is above .05, then there is not a significant difference in test scores.    pricePerUnit <‐ data[, price := TOT_SALES/PROD_QTY]    t.test(data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") & PREMIUM_CUSTOMER == "Mainstream", price],data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") & PREMIUM_CUSTOMER != "Mainstream", price], alternative = "greater")
##   ##  Welch Two Sample t-test  ##   ## data:  data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") &  and data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") &     PREMIUM_CUSTOMER == "Mainstream", price] and     PREMIUM_CUSTOMER != "Mainstream", price]  ## t = 37.624, df = 54791, p-value < 2.2e-16  ## alternative hypothesis: true difference in means is greater than 0  ## 95 percent confidence interval:  ##  0.3187234       Inf  ## sample estimates:  ## mean of x mean of y   ##  4.039786  3.706491

The t-test results in a p-value < 2.2e-16, i.e. the unit price for mainstream, young and mid-age singles and couples ARE significantly higher than that of budget or premium, young and midage singles and couples.

Deep dive into specific customer segments for insights

We have found quite a few interesting insights that we can dive deeper into.
We might want to target customer segments that contribute the most to sales to retain them or further increase sales. Let's look at Mainstream – young singles/couples. For instance, let's find out if they tend to buy a particular brand of chips.

Deep dive into Mainstream, young singles/couples

#### Deep dive into Mainstream, young singles/couples  segment1 <- data[LIFESTAGE == "YOUNG SINGLES/COUPLES" & PREMIUM_CUSTOMER == "Mainstream",]  other <- data[!(LIFESTAGE == "YOUNG SINGLES/COUPLES" & PREMIUM_CUSTOMER =="Mainstream"),]    #### Brand affinity compared to the rest of the population  quantity_segment1 <- segment1[, sum(PROD_QTY)]    quantity_other <- other[, sum(PROD_QTY)]    quantity_segment1_by_brand <- segment1[, .(targetSegment = sum(PROD_QTY)/quantity_segment1), by = BRAND]    quantity_other_by_brand <- other[, .(other = sum(PROD_QTY)/quantity_other), by = BRAND]    brand_proportions <- merge(quantity_segment1_by_brand, quantity_other_by_brand)[, affinityToBrand := targetSegment/other]    brand_proportions[order(‐affinityToBrand)]    ggplot(brand_proportions, aes(brand_proportions$BRAND,brand_proportions$affinityToBrand)) + geom_bar(stat = "identity",fill = "yellow") + labs(x = "Brand", y = "Customers Affinity to Brand", title = "Favorite brands of Customers") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

We can see that:

• Mainstream young singles/couples are 23% more likely to purchase Tyrrells chips compared to the
rest of the population
• Mainstream young singles/couples are 56% less likely to purchase Burger Rings compared to the rest
of the population

[INSIGHTS]
Let's also find out if our target segment tends to buy larger packs of chips.

Preferred pack size compared to the rest of the population

quantity_segment1_by_pack <- segment1[, .(targetSegment = sum(PROD_QTY)/quantity_segment1), by = PACK_SIZE]    quantity_other_by_pack <- other[, .(other = sum(PROD_QTY)/quantity_other), by = PACK_SIZE]    pack_proportions <- merge(quantity_segment1_by_pack, quantity_other_by_pack)[, affinityToPack := targetSegment/other]    pack_proportions[order(‐affinityToPack)]

We can see that the preferred PACK_SIZE is 270g.

data[PACK_SIZE == 270, unique(PROD_NAME)]
## [1] "Twisties Cheese     270g" "Twisties Chicken270g"

Conclusion

Let's recap what we've found!
Sales have mainly been due to Budget – older families, Mainstream young singles/couples, and Mainstream – retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers. Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behaviour.

We've also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category's performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and couples frequent more often to increase visibilty and impulse behaviour.

Quantium can help the Category Manager with recommendations of where these segments are and further help them with measuring the impact of the changed placement. We'll work on measuring the impact of trials in the next task and putting all these together in the third task.

Outro

Task 1 was basic but crucial. All the basic analysis and visualizations were made here. The general patterns and trends among customers' choices were explored through visualizations.

The solutions to the other tasks will be uploaded soon. Till then, any feedbacks, queries or recommendations are appreciated on any of my social media handles.

Refer to my Github Profile

Stay tuned for more tutorials!
Thank You!

To leave a comment for the author, please follow the link and comment on their blog: R Blogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Task 1 - Retail Strategy and Analytics first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Handling errors using purrr’s possibly() and safely()

Posted: 30 Aug 2020 05:00 PM PDT

[This article was first published on Very statisticious on Very statisticious, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One topic I haven't discussed in my previous posts about automating tasks with loops or doing simulations is how to deal with errors. If we have unanticipated errors a map() or lapply() loop will come to a screeching halt with no output to show for the time spent. When your task is time-consuming, this can feel pretty frustrating, since the whole process has to be restarted.

How to deal with errors? Using functions try() or tryCatch() when building a function is the traditional way to catch and address potential errors. In the past I've struggled to remember how to use these, though, and functions possibly() and safely() from package purrr are convenient alternatives that I find a little easier to use.

In this post I'll show examples on how to use these two functions for handling errors. I'll also demonstrate the use of the related function quietly() to capture other types of output, such as warnings and messages.

R packages

The functions I'm highlighting today are from package purrr. I'll also use lme4 for fitting simulated data to linear mixed models.

library(purrr) # v. 0.3.4  library(lme4) # v. 1.1-23

Using possibly() to return values instead of errors

When doing a repetitive task like fitting many models with a map() loop, an error in one of the models will shut down the whole process. We can anticipate this issue and bypass it by defining a value to return if a model errors out via possibly().

I created the very small dataset below to demonstrate the issue. The goal is to fit a linear model of y vs x for each group. I made exactly two groups here, a and b, to make it easy to see what goes wrong and why. Usually we have many more groups and potential problems can be harder to spot.

dat = structure(list(group = c("a", "a", "a", "a", "a", "a", "b", "b", "b"),                        x = c("A", "A", "A", "B", "B", "B", "A", "A", "A"),                        y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)),                   class = "data.frame",                   row.names = c(NA, -9L))  dat
#   group x    y  # 1     a A 10.9  # 2     a A 11.1  # 3     a A 10.5  # 4     a B  9.7  # 5     a B 10.5  # 6     a B 10.9  # 7     b A 13.0  # 8     b A  9.9  # 9     b A 10.3

I'll first split the dataset by group to get a list of data.frames to loop through.

dat_split = split(dat, dat$group)

Then I'll loop through each dataset in the list with map() and fit a linear model with lm(). Instead of getting output, though, I get an error.

map(dat_split, ~lm(y ~ x, data = .x) )
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts  can be applied only to factors with 2 or more levels

What's going on? If you look at the dataset again, you'll see that x from group b contains only a single value. Once you know that you can see the error actually is telling us what the problem is: we can't use a factor with only one level.

Model a fits fine, since x has two values.

lm(y ~ x, data = dat, subset = group == "a")
#   # Call:  # lm(formula = y ~ x, data = dat, subset = group == "a")  #   # Coefficients:  # (Intercept)           xB    #     10.8333      -0.4667

It is the b model that fails.

lm(y ~ x, data = dat, subset = group == "b")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts  can be applied only to factors with 2 or more levels

You can imagine that the problem of having only a single value for the factor in some groups could be easy to miss if you working with a large number groups. This is where possibly() can help, allowing us to keep going through all groups regardless of errors. We can then find and explore problem groups.

Wrapping a function with possibly()

The possibly() function is a wrapper function. It wraps around an existing function. Other than defining the function to wrap, the main argument of interest is otherwise. In otherwise we define what value to return if we get an error from the function we are wrapping.

I make a new wrapped function called posslm1(), which wraps lm() and returns "Error" if an error occurs when fitting the model.

posslm1 = possibly(.f = lm, otherwise = "Error")

When I use posslm1() in my model fitting loop, you can see that loop now finishes. Model b contains the string "Error" instead of a model.

map(dat_split, ~posslm1(y ~ x, data = .x) )
# $a  #   # Call:  # .f(formula = ..1, data = ..2)  #   # Coefficients:  # (Intercept)           xB    #     10.8333      -0.4667    #   #   # $b  # [1] "Error"

Here's another example of possibly() wrapped around lm(), this time using otherwise = NULL. Depending on what we plan to do with the output, using NULL or NA as the return value can be useful when using possibly().

Now group b is NULL in the output.

posslm2 = possibly(.f = lm, otherwise = NULL)  ( mods = map(dat_split, ~posslm2(y ~ x, data = .x) ) )
# $a  #   # Call:  # .f(formula = ..1, data = ..2)  #   # Coefficients:  # (Intercept)           xB    #     10.8333      -0.4667    #   #   # $b  # NULL

Finding the groups with errors

Once the loop is done, we can examine the groups that had errors when fitting models. For example, I can use purrr::keep() to keep only the results that are NULL.

mods %>%       keep(~is.null(.x) )
# $b  # NULL

This allows me to pull out the names for the groups that had errors. Getting the names in this way is one reason I like that split() returns named lists.

group_errs = mods %>%       keep(~is.null(.x) ) %>%       names()  group_errs
# [1] "b"

Once I have the names of the groups with errors, I can pull any problematic groups out of the original dataset or the split list to examine them more closely. (I use %in% here in case group_errs is a vector.)

dat[dat$group %in% group_errs, ]
#   group x    y  # 7     b A 13.0  # 8     b A  9.9  # 9     b A 10.3
dat_split[group_errs]
# $b  #   group x    y  # 7     b A 13.0  # 8     b A  9.9  # 9     b A 10.3

Using compact() to remove empty elements

You may come to a point where you've looked at the problem groups and decide that the models with errors shouldn't be used in further analysis. In that case, if all the groups with errors are NULL, you can use purrr::compact() to remove the empty elements from the list. This can make subsequent loops to get output more straightforward in some cases.

compact(mods)
# $a  #   # Call:  # .f(formula = ..1, data = ..2)  #   # Coefficients:  # (Intercept)           xB    #     10.8333      -0.4667

Using safely() to capture results and errors

Rather than replacing the errors with values, safely() returns both the results and the errors in a list. This function is also a wrapper function. It defaults to using otherwise = NULL, and I generally haven't had reason to change away from that default.

Here's an example, wrapping lm() in safely() and then using the wrapped function safelm() to fit the models.

safelm = safely(.f = lm)  mods2 = map(dat_split, ~safelm(y ~ x, data = .x) )

The output for each group is now a list with two elements, one for results (if there was no error) and the other for the error (if there was an error).

Here's what this looks like for model a, which doesn't have an error. The output contains a result but no error.

mods2[[1]]
# $result  #   # Call:  # .f(formula = ..1, data = ..2)  #   # Coefficients:  # (Intercept)           xB    #     10.8333      -0.4667    #   #   # $error  # NULL

Model b didn't work, of course, so the results are NULL but the error was captured in error.

mods2[[2]]
# $result  # NULL  #   # $error  # 

Exploring the errors

One reason to save the errors using safely() is so we can take a look at what the errors were for each group. This is most useful with informative errors like the one in my example.

Errors can be extracted with a map() loop, pulling out the "error" element from each group.

map(mods2, "error")
# $a  # NULL  #   # $b  # 

Extracting results

Results can be extracted similarly, and, if relevant, NULL results can be removed via compact().

mods2 %>%       map("result") %>%       compact()
# $a  #   # Call:  # .f(formula = ..1, data = ..2)  #   # Coefficients:  # (Intercept)           xB    #     10.8333      -0.4667

Using quietly() to capture messages

The quietly() function doesn't handle errors, but instead captures other types of output such as warnings and messages along with any results. This is useful for exploring what kinds of warnings come up when doing simulations, for example.

A few years ago I wrote a post showing a simulation for a linear mixed model. I use the following function, pulled from that earlier post.

twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {       standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)       stand = rep(LETTERS[1:nstand], each = nplot)       ploteff = rnorm(nstand*nplot, 0, sigma)       resp = mu + standeff + ploteff       dat = data.frame(stand, resp)       lmer(resp ~ 1 + (1|stand), data = dat)  }

One thing I skipped discussing in that post were the messages returned for some simulations. However, I can certainly picture scenarios where it would be interesting and important to capture warnings and messages to see, e.g., how often they occur even when we know the data comes from the model.

Here I'll set the seed so the results are reproducible and then run the function 10 times. You see I get two messages, indicating that two of the ten models returned a message. In this case, the message indicates that the random effect variance is estimated to be exactly 0 in the model.

set.seed(16)  sims = replicate(10, twolevel_fun(), simplify = FALSE )
# boundary (singular) fit: see ?isSingular  # boundary (singular) fit: see ?isSingular

It turns out that the second model in the output list is one with a message. You can see at the bottom of the model output below that there is 1 lme4 warning.

sims[[2]]
# Linear mixed model fit by REML ['lmerMod']  # Formula: resp ~ 1 + (1 | stand)  #    Data: dat  # REML criterion at convergence: 45.8277  # Random effects:  #  Groups   Name        Std.Dev.  #  stand    (Intercept) 0.0000    #  Residual             0.7469    # Number of obs: 20, groups:  stand, 5  # Fixed Effects:  # (Intercept)    #       10.92    # convergence code 0; 0 optimizer warnings; 1 lme4 warnings

The lme4 package stores warnings and messages in the model object, so I can pull the message out of the model object.

sims[[2]]@optinfo$conv$lme4
# $messages  # [1] "boundary (singular) fit: see ?isSingular"

But I think quietly() is more convenient for this task. This is another wrapper function, and I'm going to wrap it around lmer(). I do this because I'm focusing specifically on messages that happen when I fit the model. However, I could have wrapped twolevel_fun() and captured any messages from the entire simulation process.

I use my new function qlmer() inside my simulation function.

qlmer = quietly(.f = lmer)  qtwolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {       standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)       stand = rep(LETTERS[1:nstand], each = nplot)       ploteff = rnorm(nstand*nplot, 0, sigma)       resp = mu + standeff + ploteff       dat = data.frame(stand, resp)       qlmer(resp ~ 1 + (1|stand), data = dat)  }

I set the seed back to 16 so I get the same models and then run the function using qlmer() 10 times. Note this is considered quiet because the messages are now captured in the output by quietly() instead of printed.

The wrapped function returns a list with 4 elements, including the results, any printed output, warnings, and messages. You can see this for the second model here.

set.seed(16)  sims2 = replicate(10, qtwolevel_fun(), simplify = FALSE)  sims2[[2]]
# $result  # Linear mixed model fit by REML ['lmerMod']  # Formula: resp ~ 1 + (1 | stand)  #    Data: ..2  # REML criterion at convergence: 45.8277  # Random effects:  #  Groups   Name        Std.Dev.  #  stand    (Intercept) 0.0000    #  Residual             0.7469    # Number of obs: 20, groups:  stand, 5  # Fixed Effects:  # (Intercept)    #       10.92    # convergence code 0; 0 optimizer warnings; 1 lme4 warnings   #   # $output  # [1] ""  #   # $warnings  # character(0)  #   # $messages  # [1] "boundary (singular) fit: see ?isSingular\n"

In a simulation setting, I think seeing how many times different messages and warnings come up could be pretty interesting. It might inform how problematic a message is. If a message is common in simulation we may feel more confident that such a message from a model fit to our real data is not a big issue.

For example, I could pull out all the messages and then put the results into a vector with unlist().

sims2 %>%       map("messages") %>%        unlist()
# [1] "boundary (singular) fit: see ?isSingular\n"  # [2] "boundary (singular) fit: see ?isSingular\n"

If I wanted to extract multiple parts of the output, such as keeping both messages and warnings, I can use the extract brackets in map().

These results don't look much different compared to the output above since there are no warnings in my example. However, note the result is now in a named vector so I could potentially keep track of which are messages and which are warnings if I needed to.

sims2 %>%       map(`[`, c("messages", "warnings") ) %>%       unlist()
#                                     messages   # "boundary (singular) fit: see ?isSingular\n"   #                                     messages   # "boundary (singular) fit: see ?isSingular\n"

I showed only fairly simple way to use these three functions. However, you certainly may find yourself using them for more complex tasks. For example, I've been in situations in the past where I wanted to keep only models that didn't have errors when building parametric bootstrap confidence intervals. If they had existed at the time, I could have used possibly() or safely() in a while() loop, where the bootstrap data would be redrawn until a model fit without error. Very useful! 😉

Just the code, please

Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # v. 0.3.4  library(lme4) # v. 1.1-23    dat = structure(list(group = c("a", "a", "a", "a", "a", "a", "b", "b", "b"),                        x = c("A", "A", "A", "B", "B", "B", "A", "A", "A"),                        y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)),                   class = "data.frame",                   row.names = c(NA, -9L))  dat    dat_split = split(dat, dat$group)  map(dat_split, ~lm(y ~ x, data = .x) )    lm(y ~ x, data = dat, subset = group == "a")  lm(y ~ x, data = dat, subset = group == "b")    posslm1 = possibly(.f = lm, otherwise = "Error")  map(dat_split, ~posslm1(y ~ x, data = .x) )    posslm2 = possibly(.f = lm, otherwise = NULL)  ( mods = map(dat_split, ~posslm2(y ~ x, data = .x) ) )    mods %>%       keep(~is.null(.x) )    group_errs = mods %>%       keep(~is.null(.x) ) %>%       names()  group_errs    dat[dat$group %in% group_errs, ]  dat_split[group_errs]    compact(mods)    safelm = safely(.f = lm)  mods2 = map(dat_split, ~safelm(y ~ x, data = .x) )  mods2[[1]]  mods2[[2]]    map(mods2, "error")    mods2 %>%       map("result") %>%       compact()    twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {       standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)       stand = rep(LETTERS[1:nstand], each = nplot)       ploteff = rnorm(nstand*nplot, 0, sigma)       resp = mu + standeff + ploteff       dat = data.frame(stand, resp)       lmer(resp ~ 1 + (1|stand), data = dat)  }    set.seed(16)  sims = replicate(10, twolevel_fun(), simplify = FALSE )  sims[[2]]  sims[[2]]@optinfo$conv$lme4    qlmer = quietly(.f = lmer)  qtwolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {       standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)       stand = rep(LETTERS[1:nstand], each = nplot)       ploteff = rnorm(nstand*nplot, 0, sigma)       resp = mu + standeff + ploteff       dat = data.frame(stand, resp)       qlmer(resp ~ 1 + (1|stand), data = dat)  }    set.seed(16)  sims2 = replicate(10, qtwolevel_fun(), simplify = FALSE)  sims2[[2]]    sims2 %>%       map("messages") %>%        unlist()    sims2 %>%       map(`[`, c("messages", "warnings") ) %>%       unlist()

To leave a comment for the author, please follow the link and comment on their blog: Very statisticious on Very statisticious.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Handling errors using purrr's possibly() and safely() first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Transitioning a Soil Mechanics Course to R/exams

Posted: 30 Aug 2020 03:00 PM PDT

[This article was first published on R/exams, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Experiences with transitioning to R/exams for multiple automated and randomized online exams in a soil mechanics course at Universität Innsbruck (UIBK).

Transitioning a Soil Mechanics Course to R/exams

Guest post by Wolfgang Fellin (Universität Innsbruck, Division of Geotechnical and Tunnel Engineering).

Background

Our exercises in soil mechanics are attended by about 120 students every year. Students receive
their grades based on a final exam, with a multiple-choice part and additional small exercises with
calculations. Due to the COVID-19 restrictions we were forced to switch from the written exams to
online exams at rather short notice, for which we decided to use R/exams to randomize the exercises.
The possibility that the exercises produced can be used in the following (hopefully COVID-free years) as part of a blended
learning course made the decision for the (not so small) effort to switch to R/exams easy.

Transition of multiple-choice exercises

To leverage the about 100 multiple-choice exercises from our old question pool, we had to convert them to
the R/exams format. The pool was stored in a MySQL data base
including LaTeX text for the questions, answer alternatives, images for some exercises (in EPS format)
and some additional meta-information regarding the topic category and the difficulty. Due to
the standardized format of the question pool – and the kind support by Achim Zeileis and Christiane Ernst –
it was possible to extract all questions, embed them into the same template for R/LaTeX questions (.Rnw),
and convert all EPS figures to both SVG and PDF format.
A typical example is shown below (in a screen shot from OpenOLAT,
the learning management system at UIBK).

grundwasser-1-068.png

The corresponding R/LaTeX exercise file is grundwasser-1-068.Rnw which we also converted to
R/Markdown for this blog post: grundwasser-1-068.Rmd. Both rely on the corresponding
image file being available in the same directory or a sub-directory. Different graphics formats are used for
different exam output formats: image68_1.svg, image68_1.pdf,
image68_1.eps.

Essentially, the exercises consist of:

  • some R code for including the right graphics file (SVG vs. PDF) depending on the type of output format (HTML vs. PDF),
  • the (static) question text and answer alternatives in LaTeX format,
  • meta-information including exextra tags for our custom topic category and difficulty level,
  • the exshuffle tag is used for randomly subsampling the available answer alternatives.

New numeric exercises

While the subsampling approach in the multiple-choice exercises provides a certain level of
randomization, I wanted to set up new numeric exercises in which all students have to calculate
the correct solution(s) based on their personalized parameters. Therefore, I set up new numeric
exercises in cloze format where several quantities have to be computed based on the same
question text. A typical example is shown below.

num-setzung-1-steinbrenner.png

Again, the corresponding exercise files are available as:
num-setzung-1-steinbrenner.Rnw,
num-setzung-1-steinbrenner.Rmd.
And the graphics files are:
steinbrenner_fdmt.svg,
steinbrenner_fdmt.pdf.

Despite being a newcomer to both R and R/exams, setting up these exercises was quite easy
because I have been using MATLAB for a long time and could quickly grasp basic R code.
Also the exercise templates available online helped as did the
friendly e-mail support by Achim for resolving many of the small "details" along the way.

Exam implementation

Following the R/exams tutorial on summative online exams in OpenOLAT
I prepared the online exam using the exams2openolat() function. The R code I used along with a couple
of comments is provided below.

## load package  library("exams")    ## exercises for the exam  ## (the actual exam had 16 exercises overall, not just 2)  exm <- c("grundwasser-1-068.Rnw", "num-setzung-1-steinbrenner.Rnw")    ## date used as seed and file name  dat <- 20200528  set.seed(dat)    rxm <- exams2openolat(exm,    edir = "/path/to/exercise/folder/",    n = 200,                ## number of random versions for each exercise    name = paste("BmGb1-Onlinetest", dat, sep = "_"),    points = 1,             ## all exercises yield 1 point    cutvalue = 8,           ## threshold for passing the exam        ## evaluation rule for multiple-choice exercises (standalone and within cloze):    ## every wrong answer cancels a correct answer    mchoice = list(eval = exams_eval(partial = TRUE, negative = FALSE, rule = "true")),    cloze   = list(eval = exams_eval(partial = TRUE, negative = FALSE, rule = "true")),      solutionswitch = FALSE, ## do not show full solution explanation    maxattempts = 1,    shufflesections = TRUE, ## sequence of exercises is shuffled randomly    navigation = "linear",  ## disable switching back and forth between exercises    duration = 120,         ## 2 hours    stitle = "Exercise",    ## section title (for group of exercises)    ititle = "Question"     ## item title (for individual exercises)  )

Challenges and outlook

The main problem by producing randomized numeric examples was to find limits for the input so that
the output is physically reasonable. Especially for newly created exercises, it is absolutely necessary
that other colleagues act as test calculators, such that errors can be detected before students find
them in their exams. The stress testing facilities in
R/exams can also help.

Since the course will continue being online in 2020/21 due to ongoing COVID-19 restrictions,
I use R/exams now also for the practice exercises which students have to calculate during the semester.
Thus, I produced about 70 new exercises, which are offered in OpenOLAT in 12 weekly online tests.
After COVID-19 I will use these examples as addition to written ones, where longer and more complex
task could be trained.

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Transitioning a Soil Mechanics Course to R/exams first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Easiest flowcharts eveR?

Posted: 30 Aug 2020 04:29 AM PDT

[This article was first published on R – G-Forge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A guide to flowcharts using my Gmisc package. The image is CC by Michael Staats.

A flowchart is a type of diagram that represents a workflow or process. The building blocks are boxes and the arrows that connect them. If you have submitted any research paper the last 10 years you have almost inevitably been asked to produce a flowchart on how you generated your data. While there are excellent click-and-draw tools I have always found it to be much nicer to code my charts. In this post I will go through some of the abilities in my Gmisc package that makes this process smoother.

Main example

Lets start with the end result when you use Gmisc the boxGrob() with connectGrob() so that you know if you should continue to read.

Traditional flow chart

the code for this is in the vignette, a slightly adapted version looks like this:

library(Gmisc)  library(magrittr)  library(glue)     # The key boxes that we want to plot  org_cohort <- boxGrob(glue("Stockholm population",                             "n = {pop}",                             pop = txtInt(1632798),                             .sep = "\n"))  eligible <- boxGrob(glue("Eligible",                            "n = {pop}",                             pop = txtInt(10032),                             .sep = "\n"))  included <- boxGrob(glue("Randomized",                           "n = {incl}",                           incl = txtInt(122),                           .sep = "\n"))  grp_a <- boxGrob(glue("Treatment A",                        "n = {recr}",                        recr = txtInt(43),                        .sep = "\n"))     grp_b <- boxGrob(glue("Treatment B",                        "n = {recr}",                        recr = txtInt(122 - 43 - 30),                        .sep = "\n"))     excluded <- boxGrob(glue("Excluded (n = {tot}):",                           " - not interested: {uninterested}",                           " - contra-indicated: {contra}",                           tot = 30,                           uninterested = 12,                           contra = 30 - 12,                           .sep = "\n"),                      just = "left")     # Move boxes to where we want them  vert <- spreadVertical(org_cohort,                         eligible = eligible,                         included = included,                         grps = grp_a)  grps <- alignVertical(reference = vert$grps,                        grp_a, grp_b) %>%    spreadHorizontal()  vert$grps <- NULL     y <- coords(vert$included)$top +    distance(vert$eligible, vert$included, half = TRUE, center = FALSE)  excluded <- moveBox(excluded,                      x = .8,                      y = y)     # Connect vertical arrows, skip last box  for (i in 1:(length(vert) - 1)) {    connectGrob(vert[[i]], vert[[i + 1]], type = "vert") %>%      print  }     # Connnect included to the two last boxes  connectGrob(vert$included, grps[[1]], type = "N")  connectGrob(vert$included, grps[[2]], type = "N")     # Add a connection to the exclusions  connectGrob(vert$eligible, excluded, type = "L")     # Print boxes  vert  grps  excluded

Example breakdown

There is a lot happening and it may seem overwhelming but it we break it down to smaller chunks it probably makes more sense.

Packages

The first section is just the packages that I use, and apart from the Gmisc package with the main functions I have magrittr that is just for the %>% pipe, and the glue that is convenient for the text generation that allows us to use interpreted string literals that have become standard tooling in many languages (e.g. JavaSript and Python).

Basic boxGrob() calls

After loading the packages we create each box that we want to output. Note that we save each box into a variable:

org_cohort <- boxGrob(glue("Stockholm population",                             "n = {pop}",                             pop = txtInt(1632798),                             .sep = "\n"))

This avoids plotting the box which is the default action when print is called (R does calls print on any object that is outputted to the terminal). By storing the box in a variable we can use this for manipulating the box prior to output. If we would have written as below:

boxGrob(glue("Stockholm population",               "n = {pop}",               pop = txtInt(1632798),               .sep = "\n"))

Just a single box in the center

Note that the box is generated in the middle of the image (also known as the main viewport in the grid system). We can choose how to place the box by specifying the position parameters.

boxGrob("top left", y = 1, x = 0, bjust = c(0, 1))  boxGrob("center", y = 0.5, x = 0.5, bjust = c(0.5, 0.5))  boxGrob("bottom right", y = 0, x = 1, bjust = c(1, 0))

General position options for a box

Spreading and moving the boxes

While we can position the boxes exactly where we want, I have found it even more useful to move them relative to the available space. The section below does exactly this.

vert <- spreadVertical(org_cohort,                         eligible = eligible,                         included = included,                         grps = grp_a)  grps <- alignVertical(reference = vert$grps,                        grp_a, grp_b) %>%    spreadHorizontal()  vert$grps <- NULL     excluded <- moveBox(excluded,                      x = .8,                      y = coords(vert$included)$top + distance(vert$eligible, vert$included, half = TRUE, center = FALSE))

The spreadVertical() takes each element and calculates the position of each where the first is aligned at the top while the bottom is aligned at the bottom. The elements in between are then spread evenly throughout the available to space. There are some options on how to spread the objects, the default is to have the space between the boxes to be identical but there is also the option of having the center of each box to be evenly spread (see the .type parameter).

The alignVertical() aligns the elements in relation to the reference object. In this case we chose to find the bottom alignment using a "fake" grp_a box. As we only have this box so that we can use it for future alignment we dopr the box with vert$grps <- NULL.

Note that all of the align/spread functions return a list with the boxes in the new positions (if you print the original boxes they will not have moved). Thus make sure you print the returned elements if you want to see the objects, just as we do at the end of the code block.

vert  grps  excluded

Moving a box

In the example we want the exclusions to be equally spaced between the eligible and included which we can do using moveBox() that allows us to change any of the coordinates for the original box. Just as previously, we save the box onto the original variable or the box would appear not to have moved once we try to print it.

Here we also make use of the coords() function that gives us access to the coordinates of a box and the distance() that gives us the distance between boxes (the center = FALSE is for retrieving the distance between the boxes edges and not from the center point).

y <- coords(vert$included)$top +    distance(vert$eligible, vert$included, half = TRUE, center = FALSE)  excluded <- moveBox(excluded,                      x = .8,                      y = y)  <pre>     <h2>Generating the connecting arrowsh2>     Once we are done with positioning all the boxes we need to connect them using the arrows using the <code>connectGrob()code>. The function accepts two boxes and draws a line between them. The appearance  of the line is decided by the <code>typecode> argument. My apologies it the allowed arguments <code>"vertical", "horizontal", "L", "-", "Z", "N"code> are not super intuitive, finding good names is hard. Feel free to suggest better options/explanations. Anyway, below we simply loop through them all and plot the arrows using <code>print()code>. Note that we only need to call <code>print()code> within the for loop as the others are automatically printed.     <pre lang="rsplus">  for (i in 1:(length(vert) - 1)) {    connectGrob(vert[[i]], vert[[i + 1]], type = "vert") %>%      print  }  connectGrob(vert$included, grps[[1]], type = "N")  connectGrob(vert$included, grps[[2]], type = "N")     connectGrob(vert$eligible, excluded, type = "L")

Short summary

So the basic workflow is

  • generate boxes,
  • position them,
  • connect arrows to them, and
  • print

Practical tips

I have found that generating the boxes simultaneous to when I actually exclude the examples in my data set keeps the risk of invalid counts to a minimum. Sometimes I generate a list that I later convert to a flowchart, but the principle is the same - make sure the graph is closely related to your data.

If you want to style your boxes you can set the options(boxGrobTxt=..., boxGrob=...) and all boxes will have the same styling. The fancy boxPropGrob() allows you to show data splits and has even more options that you may want to check out, although usually you don't have more than one boxPropGrob().

To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Easiest flowcharts eveR? first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

tsmp is going big!

Posted: 28 Aug 2020 05:00 PM PDT

[This article was first published on R Bloggers on Francisco Bischoff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Since the beginning of the tsmp package, it was evident that a series of algorithms around the Matrix Profile would pop-up sooner or later.

After the creation of the Matrix Profile Foundation (MPF), the tsmp package had doubled the number of monthly downloads, and that is a good thing!

The current version of tsmp, as shown in the previous post had added the new Pan-Matrix Profile and introduced the Matrix Profile API that aims to standardize high-level tools across multiple programming languages.

Now my focus will be on speed and robustness of the package. Currently I can say that the speed improvement is 25% for STAMP and STOMP.

The tsmp package will keep the current structure, but the "core" algorithms will be moved to a new dependency package called matrixprofiler.

This package will be specialized on low-level (Rcpp) implementation of MASS, STAMP, STOMP, etc. This will let the tsmp more flexible implement algorithms that use the Matrix Profile as the base for the application tasks as "motif search", "snippet search", or other algorithms that may appear.

matrixprofiler will tend to be less updated and very stable. Those who just want the basics, will get to work with this package alone, but it is safer to install the tsmp and have the former as dependency.

Stay tuned for the first release.

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on Francisco Bischoff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post tsmp is going big! first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Comments