[This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I'm a proud liberal arts graduate myself who, with some fumbling, ended up in the world of data analytics. It may sound odd, but I never fancied myself much of a "math person," and I still love to explore the world through the arts and qualitative methods.
I still hold that it's because of these interests that I decided to dig in on data analytics: I wanted to be as efficient, productive and automated as possible in using quantitative data, so that I had the time and energy to explore it qualitatively.
This distinction holds a truth that we can use to unite data analytics with the liberal arts: computers are good at processing data, and humans are good at making sense of it.
The good news about computers is that they do what you tell them to do. The bad news is that they do what you tell them to do.
Ted Nelson
Why the liberal arts in an age of big data and technology? The liberal arts teach you to have a perspective and make sense of the world. As I said in my post on creating the future with the liberal arts, "The liberal arts student appreciates the grit, fragility and triumph of the human condition. We can only see where we are going if we know where we've been."
Humans may have a gift to contextualize large amounts of long-term information, but we are terrible at holding short-term memory. For example, most of us can only hold seven things at a time in working memory.
(If you don't believe me at how bad we are at holding short-term memory, take this test.)
Computers, on the other hand, thrive at executing orders that require processing lots of defined chunks of data. Here is where computational thinking and data literacy holds so much promise for liberal arts students.
Here is a proposed basic two-semester course sequence that I proposed to a liberal arts professor friend of mine. The professor had mentioned the typical scenario of liberal arts grads being ill-prepared for the job market, and that the school was looking for ways to position their students better for careers success.
I know this scenario well — I was one myself, but I've come not to regret receiving my education in the first place. Instead, I've looked for ways to integrate data analytics education with liberal arts education, specifically so that liberal arts students, who have so much value to provide employers, are seen in a compelling light.
To do so, I've integrated a coding bootcamp-like experience to fit a three-credit-hour college course. Students will learn how to build spreadsheet models, read information from databases, and script data collection and analysis processes. These skills will make liberal arts graduates more competitive on the job market.
My second course is geared toward liberal arts students who wish to pursue graduate school and careers in academia. These students can benefit from computational processes as they are increasingly being used in those fields. The course is heavy on text analysis and natural language processing — again, to provide a method for comparing many bits of short-term memory at scale.
I hope the below learning guide can benefit your liberal arts organization. Universities must adapt, but I do not believe the best way to do so is by abandoning liberal arts programs. There is a way to augment this field of study with computational and data analytics methods.
I imagine this as a two-semester sequence for upperclassmen. The first course, "Intro to Data Analytics" is more of a general survey of data analytics with an emphasis on applications in the workforce. The second, "Data Analytics for the Liberal Arts" is geared toward students interested in graduate school in the humanities or similar. However, these topics would be of use to others as natural language processing has become popular in industry.
Intro to Data Analytics
This is a full-semester, 3-credit hour course. I consider this as a sort of "data analytics bootcamp" conducted as a typical semester-long college course.
Learning outcomes:
Student can use multiple data analytics technologies in building a data analysis project
Student can create visualizations and other data-backed assets using information design principles
Student can conduct compelling presentations based on data analysis
Topics:
What is data analytics? Why is it so popular? Should it be so popular? What can liberal arts students bring to the table? (This can be a bit contrarian and liberal arts triumphalist.)
Basic spreadsheet modeling: be able to build a break-even pricing model or a staffing model. Almost anyone in business will have to build a budget at some point and it will probably be in Excel.
Databases: Understand the basic architecture that powers modern business and how it's changed in the last 15 years. Be able to write basic commands in structured query language (SQL) to read from databases. Understand the difference between relational and non-relational databases, structured and unstructured data.
Information design: Much of this is based on cognitive science. Learn when to use which chart versus tables, etc. How to design effective slide decks, visualizations and dashboards and use these assets to present sound data-backed presentations and recommendations. Pairs nicely with the liberal arts emphasis on sound communication.
Coding: Perform basic task automation and data analysis with R.
Assessments:
Build a break-even model for an ecommerce store. Present findings to management.
Build an interactive dashboard investigating sales performance over time
Perform basic data exploration and analysis of a relational database
Conduct reproducible data analysis exercise and presentation using a series of data analytics tools
Data Analytics for the Liberal Arts
This is a full-semester, 3-credit hour course. The text could be Text Analysis with R for Student of Literature with some supplemental material. It may be wise to make "Intro to Data Analytics" a pre-req along with a basic stats course.
This course will be conducted entirely in R. The "target demo" would be students interested in grad school for the humanities or related, but the appeal could easily extend as natural language processing is becoming quite popular in industry.
Learning outcomes:
Student can conduct exploratory data analysis of unstructured data sets
Student can build reproducible data analysis projects in line with the CRISP-DM or similar methodology
Student can create compelling visualizations to aid in their findings
Topics:
Advantages and disadvantages of data analytics in the liberal arts
Basic exploratory data analysis: describing, summarizing, visualizing one and multiple variables
Unstructured data collection: APIs, web scraping
What makes data "tidy" and why it matters
Structured vs unstructured data
Word frequency analysis
Document similarity
Document clustering & classification
Topic modeling
Network analysis
Assessments:
Collect data from an API or the Web, explore and prepare it for quantitative analysis
Visualize word frequencies and occurrences across the corpus of one or more authors
Build a model to classify documents based on predicted original author
Conduct network analysis on historical or literary figures
If helping your undergraduate liberal arts population succeed in their careers and in grad school through data analytics is something you're wrestling with, please set up a free call with me and we can talk.
This download is part of my resource library. For exclusive free access, subscribe below.
To leave a comment for the author, please follow the link and comment on their blog: George J. Mount.
[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Risk ratios (RR) and odds ratios (OR) are both used to analyze factors associated with increased risk of disease. They are different in that a risk ratio is the ratio of two risks and an odds ratio is the ratio of two odds. Risks and odds are calculated differently. For example, with one toss of a fair die, the risk of rolling a 1 is 1/6 = 0.167. The odds of rolling a 1 are 1/5 = 0.200. In textbook terms, risk is successes divided by total trials and odds are successes divided by failures. In this example rolling a 1 is a success and rolling 2 through 5 is a failure.
In R, glm (generalized linear model) is used to compute adjusted risk ratios (ARR's) and adjusted odds ratios (AOR's). Adjusting eliminates the influence of correlations between factors on the ratios. In statistical terms, adjusting controls for confounding. In addition to confounding, the RR's and OR's can be influenced by interaction which is also called effect modification. This occurs when the ratio for one group is different from the ratio for another group. For example, non-smokers who work with asbestos might have 4.0 times the risk of getting lung cancer compared to non-smokers who do not work with asbestos. This is for non-smokers. For smokers, this risk ratio might be much larger at 8.0, meaning smokers who work with asbestos have 8.0 times the risk of getting lung cancer compared to smokers who do not work with asbestos. In this case there is interaction between smoking and asbestos exposure because the RR for non-smokers is 4.0 and the RR for smokers is 8.0. Alternatively, it could be said that smoking modifies the effect of the asbestos exposure on lung cancer.
For a given data set, the AOR's and ARR's will almost always agree in terms of statistical significance. The AOR's and ARR's will be different but the p values will be close. This is not true for interaction terms. Often, using the same data, an analysis of interaction using OR's will indicate significant interaction while the RR's will indicate no interaction. This phenomenon is explained very well in the article "Distributional interaction: Interpretational problems when using incidence odds ratios to assess interaction" by Ulka B Campbell, Nicolle M Gatto and Sharon Schwartz. (Epidemiologic Perspectives & Innovations 2005, 2:1 doi:10.1186/1742-5573-2-1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1079911/.
The authors described the problem for RR's and OR's and included mathematical explanations. The purpose of this article is to show how this problem occurs when using adjusted RR's and adjusted OR's with the R glm function to assess interaction.
Creating a test data set
The first step is to create a test data set. The R code below produces a data set of 1,000,000 lines with simulated variables for two risk factors, a grouping variable and an outcome variable. This is designed to simulate a situation where there are two groups of people that have different risks of disease. There are also two risk factors and these are designed to have the same risk ratios for the disease for simulated people in both groups.
The R output above creates the two risk factor variables (rf1 and rf2) and the group variable (grp). The proportions are 0.20, for rf1, 0.30 for rf2 and 0.50 for grp 0 and grp 1. This simulates a scenario where the risk factors are not too rare and the two groups are similar to a male/female grouping. The proportions will vary slightly due to randomizing with the runif R function.
The block of output above creates a new variable, xg12f, that combines the three variables, grp, rf1 and rf2. This makes it easier to create the outcome variable named disease. Each value of xg12f defines a unique combination of the variables grp, rf1 and rf2. The disease proportion (disprop) is initially set to 0.01 and then the series of ifelse statements increases the disease proportion by multiples of disprop based on the risk factors and group. This simulates a situation where the disease is rare for people with no risk factors (xg12f = 0) and then increases with various combinations of risk factors and group. For example, a simulated person in group 0 with risk factor 1 but not risk factor 2 would have a value for xg12f of 10. The third ifelse statement sets the disease proportion for simulated persons in this category to 1.7 times the disprop value. In general, simulated persons with risk factors have higher disease proportions and simulated persons in group 1 have higher disease proportions. Simulated persons in group 1 with both risk factors (xg12f = 111) have the highest multiple at 5.1 times the proportion of disease for simulated persons in group 0 with no risk factors (xg12f = 0). As shown above, the overall proportion of simulated persons with the disease is 0.019679.
The R output above uses the R function glm to produce ARR's with 95% confidence intervals and p values. In this glm the family is binomial with link = log. This produces coefficients that are the logs of the ARR's and these are used to produce the table of ARR's and 95% CI that appears below the glm output. It should be noted that the intercept is the log of the disease proportion for simulated persons that have no risk factors. In the table of ARR's the data for the intercept is not the ARR but rather the proportion of disease for simulated persons with no risk factors, 0.01.
As expected, the ARR's are statistically significant for rf1, rf2 and grp. Also as expected, the interaction terms rf1:rf2, rf1:grp and rf2:grp, are not statistically significant.
Next are the AOR's.
> > r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp, + family=binomial(link=logit)) > summary(r1)Call: glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp + rf2 * grp, family = binomial(link = logit))Deviance Residuals: Min 1Q Median 3Q Max -0.3265 -0.2000 -0.2000 -0.1423 3.0326 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.588289 0.018205 -252.037 <2e-16 *** rf1 0.505839 0.030181 16.760 <2e-16 *** rf2 0.405558 0.027242 14.887 <2e-16 *** grp 0.686728 0.021929 31.316 <2e-16 *** rf1:rf2 0.002334 0.032418 0.072 0.943 rf1:grp 0.042153 0.033593 1.255 0.210 rf2:grp 0.040418 0.031327 1.290 0.197 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 193574 on 999999 degrees of freedom Residual deviance: 189361 on 999993 degrees of freedom AIC: 189375Number of Fisher Scoring iterations: 7> x1=exp(confint.default(r1)) > or=exp(coef(r1)) > round(cbind(or,x1),digits=2) or 2.5 % 97.5 % (Intercept) 0.01 0.01 0.01 rf1 1.66 1.56 1.76 rf2 1.50 1.42 1.58 grp 1.99 1.90 2.07 rf1:rf2 1.00 0.94 1.07 rf1:grp 1.04 0.98 1.11 rf2:grp 1.04 0.98 1.11 > The R output above is similar to the previous glm output but in this glm the family is binomial with link = logit instead of log. This produces coefficients that are the logs of the AOR's and these are used to produce the table of AOR's and 95% CI. The line for the intercept in the table is not AOR's but is the odds of disease for simulated persons with no risk factors. As with ARR's, the AOR's are statistically significant for rf1, rf2 and grp. The interaction terms rf1:rf2, rf1:grp and rf2:grp, are not statistically significant which also agrees with the glm output for the ARR's.
Second Simulation
> #sim 2 > > disprop=0.05 > > disease=ifelse(xg12f == 0 & dep1 < disprop,1,0) > disease=ifelse(xg12f == 1 & dep1 < (1.5*disprop),1,disease) > disease=ifelse(xg12f == 10 & dep1 < (1.7*disprop),1,disease) > disease=ifelse(xg12f == 11 & dep1 < (1.5*1.7*disprop),1,disease) > > disease=ifelse(xg12f == 100 & dep1 < (2*disprop),1,disease) > disease=ifelse(xg12f == 101 & dep1 < (3*disprop),1,disease) > disease=ifelse(xg12f == 110 & dep1 < (3.4*disprop),1,disease) > disease=ifelse(xg12f == 111 & dep1 < (5.1*disprop),1,disease) > > sum(disease)/nrow(d1) [1] 0.09847 > The R output above uses the same data file but the disprop variable is increased from 0.01 to 0.05. This reflects a situation where the disease proportion is higher. As shown above, the overall proportion of simulated persons with the disease is 0.09847. > > r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp, + family=binomial(link=log)) > summary(r1)Call: glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp + rf2 * grp, family = binomial(link = log))Deviance Residuals: Min 1Q Median 3Q Max -0.7718 -0.4588 -0.4168 -0.3207 2.4468 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.9935084 0.0078756 -380.099 <2e-16 *** rf1 0.5071062 0.0127189 39.870 <2e-16 *** rf2 0.4029193 0.0115901 34.764 <2e-16 *** grp 0.6901061 0.0093582 73.743 <2e-16 *** rf1:rf2 0.0007967 0.0130359 0.061 0.951 rf1:grp 0.0155518 0.0139306 1.116 0.264 rf2:grp 0.0206406 0.0131138 1.574 0.115 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 643416 on 999999 degrees of freedom Residual deviance: 620356 on 999993 degrees of freedom AIC: 620370Number of Fisher Scoring iterations: 6> x1=exp(confint.default(r1)) > rr=exp(coef(r1)) > round(cbind(rr,x1),digits=2) rr 2.5 % 97.5 % (Intercept) 0.05 0.05 0.05 rf1 1.66 1.62 1.70 rf2 1.50 1.46 1.53 grp 1.99 1.96 2.03 rf1:rf2 1.00 0.98 1.03 rf1:grp 1.02 0.99 1.04 rf2:grp 1.02 0.99 1.05 > The R output above is glm output with family = binomial and link = log. As before, the coefficients are the logs of ARR's and are used to produce the table of ARR's. Also as before the line for the intercept in the table is not ARR's but is the prevalence of disease for simulated persons with no risk factors. In this case it is 0.05 in contrast to the earlier glm result with the previous data file which was 0.01. As expected, this output shows statistically significant ARR's for the risk factors and no statistically significant results for the interaction terms. > > r1=glm(disease~rf1+rf2+grp+rf1*rf2+rf1*grp+rf2*grp, + family=binomial(link=logit)) > summary(r1)Call: glm(formula = disease ~ rf1 + rf2 + grp + rf1 * rf2 + rf1 * grp + rf2 * grp, family = binomial(link = logit))Deviance Residuals: Min 1Q Median 3Q Max -0.7701 -0.4586 -0.4157 -0.3210 2.4459 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.939696 0.008344 -352.305 < 2e-16 *** rf1 0.534243 0.014054 38.012 < 2e-16 *** rf2 0.423218 0.012628 33.515 < 2e-16 *** grp 0.740304 0.010119 73.158 < 2e-16 *** rf1:rf2 0.042163 0.015615 2.700 0.00693 ** rf1:grp 0.072082 0.015824 4.555 5.23e-06 *** rf2:grp 0.063938 0.014672 4.358 1.31e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 643416 on 999999 degrees of freedom Residual deviance: 620356 on 999993 degrees of freedom AIC: 620370Number of Fisher Scoring iterations: 5> x1=exp(confint.default(r1)) > or=exp(coef(r1)) > round(cbind(or,x1),digits=2) or 2.5 % 97.5 % (Intercept) 0.05 0.05 0.05 rf1 1.71 1.66 1.75 rf2 1.53 1.49 1.57 grp 2.10 2.06 2.14 rf1:rf2 1.04 1.01 1.08 rf1:grp 1.07 1.04 1.11 rf2:grp 1.07 1.04 1.10 > The R output above is glm output with family = binomial and link = logit. The coefficients are the logs of AOR's and are used to produce the table of AOR's. The line for the intercept in the table is not AOR's but is the odds of disease for simulated persons with no risk factors. As expected, this output shows statistically significant AOR's for the risk factors. However, in contrast to the results with the ARR's, this glm output shows statistically significant results for the interaction terms. In short, with this data set, the interaction terms are not statistically significant when using ARR's but they are statistically significant when using AOR's.
Lastly, the value of the disprop variable is increased again, this time to 0.10. The output below shows the creation of the new disease variable and shows the overall proportion of simulated persons with the disease is increased to 0.196368.
The output below is the, by now familiar, glm output for ARR's and AOR' s. With the increased proportion of disease, the ARR's still show the interaction terms as not statistically significant. But the AOR's show the interaction terms as having p values less than 2 X 10-16 which is pretty low. Once again, the ARR's show no interaction while the AOR's indicate strong interaction.
These simulations show that given the same data, different generalized linear models (glm) will sometimes lead to different conclusions about interaction between the dependent and independent variables. When the disease is rare, adjusted odds ratios and adjusted risk ratios tend to agree regarding the statistical significance of the interaction terms. However, as the prevalence of the disease increases, the adjusted odds ratios and adjusted risk ratios tend to produce conflicting results for the interaction terms. This is discussed and explained mathematically in the article referred to above: "Distributional interaction: Interpretational problems when using incidence odds ratios to assess interaction" by Ulka B Campbell, Nicolle M Gatto and Sharon Schwartz. (Epidemiologic Perspectives & Innovations 2005, 2:1 doi:10.1186/1742-5573-2-1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1079911/. The purpose of this paper was to illustrate how this works with adjusted risk ratios and adjusted odds ratios using the R glm function.
For practicing statisticians and epidemiologists, it might be good advice to use risk ratios whenever denominators are available; and use odds ratios only when denominators are not available, such as case-control studies.
For example, with case-control studies, data is collected for selected persons with a disease (cases) and also for a specific number of persons without the disease (controls). With this data there are no denominators for persons at risk of getting the disease and disease rates cannot be calculated. Without disease rates, rate ratios cannot be calculated, but case control data can be used to calculate odds ratios even though the rates are unavailable. For more on this see "Statistical Methods in Cancer Research Volume I: The Analysis of Case-Control Studies", IARC Scientific Publication No. 32, Authors: Breslow NE, Day NE at: https://publications.iarc.fr/Book-And-Report-Series/Iarc-Scientific-Publications/Statistical-Methods-In-Cancer-Research-Volume-I-The-Analysis-Of-Case-Control-Studies-1980
Here is the R program used in this analysis. It should run when copied and pasted into the R window.
[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to the second part of this two-part series on data manipulation in R. This article aims to present the reader with different ways of data aggregation and sorting. Here is the composition of this article.
Aggregation in R
Why aggregate data?
Data used in this article
Aggregation using base R
Aggregation using the aggregate function
Assigning a name to an aggregated column in aggregate function in R
Aggregating multiple columns using an aggregate function in R
Aggregating multiple columns data by multiple functions using the aggregate function in R
Aggregation using by() function
Aggregation using sweep() function
Aggregation using dplyr
Aggregation using group_by and summarize functions
Aggregation by group_by and summarize functions with multiple variables and functions
Aggregation using group_by and summarize functions with a range of variables
Aggregation using group_by and summarize_if function
Aggregation using group_by and summarize_at function
Aggregation using group_by and summarize_all function
Aggregation using data.table
Converting data frame to data table
Aggregating single variable using data table
Assigning a custom name to aggregated variable in data table
Aggregating multiple variables using multiple aggregation functions in data table
Aggregating variables as well as filtering observations in data table
Counting the number of observations within a group in data table
Aggregation using sqldf
Sorting in R
Why sort data?
Sorting using base R
Sorting data using order function
Sorting data in descending order using order function
Sorting using dplyr
Sorting data using arrange function
Sorting data in descending order using arrange function
Sorting using data.table
Sorting data using order function
Sorting data in descending order using order function
Sorting data using setorder function
Sorting data in descending order using setorder function
Sorting using sqldf
Aggregation in R
To begin with, let's look at the term aggregation. As per the Cambridge dictionary, "the process of combining things or amounts into a single group or total" is aggregation.
Why aggregate data?
Usually, the data has two types, qualitative and quantitative. These two are statistical terms. Qualitative data defines the characteristics of the data. Labels, properties, attributes, and categories are all examples of qualitative data. As the name suggests, data that express the quality is qualitative data. On the other hand, quantitative data represent numbers. In other disciplines like data warehousing or business intelligence, qualitative data is equivalent to dimensions and quantitative data to measures.
Data analysis is a complex process and involves several steps. Some of these steps may include data to be examined by its quality. Usually, the qualitative data's granularity is higher. Granularity is the level of detail.
For example, a dataset that contains all countries of the world may have multiple variables describing qualitative data. The name of the country would be at the most granular level, as all countries' names would be unique and, no two countries will have the same name. Whereas, granularity level would rise as we look at the countries by continents and then by the hemisphere.
Similarly, in an analysis, data is examined on various levels by its different qualities. At this point, aggregation comes into the picture. It is required if you want to explore quantitative data elements by its quality that sits higher than the most granular level.
In this article, we will practice different aggregation functions and options offered by base R and other packages like dplyr and data.table. Note that this article does not aim to list all the functions that (if used in a way) can aggregate data, whereas the aim here is to explain the concept of data aggregation. R is a rich language that offers different ways of doing the same thing.
Data used in this article
We will use the data that we have prepared in our previous post How to prepare data for analysis in R. Click on this link to download the original data. This data is available under the PDDL license. However, to save space and make it cleaner, we will remove a few columns from the data frame. It's an opportunity for us to revise how to remove a column from the data frame. Let's first create the data frame and understand the data.
## Symbol Sector Price X52.Week.Low X52.Week.High ## 1 MMM Industrials 222.89 259.770 175.490 ## 2 AOS Industrials 60.24 68.390 48.925 ## 3 ABT Health Care 56.27 64.600 42.280 ## 4 ABBV Health Care 108.48 125.860 60.050 ## 5 ACN Information Technology 150.51 162.600 114.820 ## 6 ATVI Information Technology 65.83 74.945 38.930 ## 7 AYI Industrials 145.41 225.360 142.000 ## 8 ADBE Information Technology 185.16 204.450 114.451 ## 9 AAP Consumer Discretionary 109.63 169.550 78.810 ## 10 AMD Information Technology 11.22 15.650 9.700
Our final data now has five variables and 505 observations.
Aggregation using base R
Aggregation using the aggregate function
When talking about aggregation, the aggregate function is the most obvious choice. We can use the aggregate function with data frames and time series. We can be specific with the signature if we are using the data frame by writing aggregate.data.frame. Both aggregate and aggregate.data.frame will result in the same data.
Aggregate is a generic function that can be used for both data frames and time series.
We will look at this function by different scenarios. Before the use of any aggregation, the use case must exist. It means why you want to aggregate data. This information is crucial as this will guide you to extract correctly summarized data that will answer your question.
So the scenario here is to "find out the total price of all shares by all sectors". If you recall from the section above on qualitative and quantitative data, the variables sector and price represents qualitative and quantitative data, respectively. This information is crucial to put the variables into their correct place within the function.
aggregate(financials$Price, by = list(financials$Sector), FUN = sum)
## Group.1 x ## 1 Consumer Discretionary 10418.90 ## 2 Consumer Staples 2711.98 ## 3 Energy 1852.40 ## 4 Financials 6055.81 ## 5 Health Care 8083.46 ## 6 Industrials 7831.47 ## 7 Information Technology 8347.00 ## 8 Materials 2559.67 ## 9 Real Estate 2927.52 ## 10 Telecommunication Services 100.81 ## 11 Utilities 1545.45
Assigning a name to an aggregated column in aggregate function in R
The result above shows that the function aggregate split the original data into small subsets by sector variable and applied sum function over the price variable. The result above shows x as the name of the summarised column. If you want to rename it, a little tweak in the code using setName function will do the trick. Here is an example.
setNames(aggregate(financials$Price, by = list(financials$Sector), FUN = sum),c("Sector","Total.Price"))
Aggregating multiple columns using the aggregate function in R
Now to apply the same function on multiple variables, all you have to do is to supply an expression to subset the required columns from the data frame as an argument. Here is an example, we have used setName function on top to assign meaningful names to the variables.
The scenario here is to "find out the average price, average 52-weeks low price, and the average 52-week high price of all shares by all sectors".
setNames(aggregate(financials[,c("Price","X52.Week.Low","X52.Week.High")], by = list(financials$Sector), FUN = mean), c("Sector","Average.Price","Average.Low","Average.High"))
Aggregating multiple columns data by multiple functions using the aggregate function in R
Moving on to the next level of the scenario, we would like to apply multiple functions over the different variables. It is possible by defining a custom function. The requirement now is to "find out the minimum and maximum values of price, 52-weeks low price of all shares by all sectors". Here is an example.
aggregate(financials[c("Price","X52.Week.Low")], by = list(financials$Sector), FUN = function(x) c(min(x),max(x)))
With the aggregate function, possibilities are endless. We can do a lot using the aggregate function. What we have seen in the sections above is just a part. Here are some other arguments which you can add to the aggregate function based on your requirements.
Argument
Use
na.action
Use this to specify if your data contain NA or missing values and how you want to handle them
simplify
If used results will be presented in vector or matrix form
formula
Depending on subsetting requirement formulas like y ~ x or cbind(y1,y2) ~ x1 + x2 can be used
data
The data frame in question which has the variables to subset and aggregate
drop
Use if you want to drop unused combination of grouping values
subset
Use it to limit the operation to certain observations
Lastly, to name a few, you can use sum, count, first, last, mean, median, min, max, and sd functions with the aggregate function. If you think this list should have more function names, then please add a comment in the comment box below.
Aggregation using by() function
The by function is different from aggregate as it is a wrapper for tapply function, this means it also encapsulates the functionality of tapply function. The return value from "by" function depends on the usage of simplify argument. If the value of simplify argument is TRUE then it returns a list or array otherwise it always returns a list. Here is an example of by function. In this example, we are trying to find the sum of price variable by sector.
All other arguments in this function are self-explanatory except INDICES that represent factors or a list of factors.
by(data=financials$Price, INDICES = financials$Sector, FUN = sum, simplify = TRUE)
There is one more function I would like to mention here that you can use to aggregate data. Function sweep returns an array by sweeping out summary statistics. This specific function may not fit into all kinds of aggregation requirements but worth mentioning if you want to do some operation while summarising data. This function wouldn't be my first choice though it may fit into yours. Here is an example.
The function above is returning the sum of variable Price. Over the years, I didn't use the sweep function for my work. If you know the aggregation scenario where this function would be most useful, then please comment in the comment box below. All other arguments are obvious except STATS, my understanding of this argument is that this is the value that you would like to apply to the variable in conjunction with the function argument. So, I have set STATS to 0 and used sum as the function, which results in the total Price variable. If I set STATS to 1, then all the values of variable Price would be increased with one, and the total will be different.
Though sweep function doesn't result in grouped or subsetted data as others, I thought this function is worth mentioning. Similarly, other functions from the apply family are available to aggregate data, but you may encounter limitations on grouping or performance requirements. If you want, you can achieve aggregation results using these functions, but there are better and faster options available, and from this point onwards, we will concentrate on those.
Aggregation using dplyr
Aggregation using group_by and summarize functions
The three functions we have seen above may fit into different aggregation scenarios, but in terms of ease of use, I consider the group_by and summarize functions of the dplyr package. The performance of functions is subjective and dependent on the size and operation of the data. I personally never evaluated the performance though wherever I have read, group_by is praised for its performance. Let's look at an example of aggregating data using the group by function.
The scenario here is to "find out the total price of all shares by all sectors".
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 2 ## Sector total.price ## ## 1 Consumer Discretionary 10419. ## 2 Consumer Staples 2712. ## 3 Energy 1852. ## 4 Financials 6056. ## 5 Health Care 8083. ## 6 Industrials 7831. ## 7 Information Technology 8347 ## 8 Materials 2560. ## 9 Real Estate 2928. ## 10 Telecommunication Services 101. ## 11 Utilities 1545.
Simply supplying the variable name within the group_by function is enough to set the grouping of observations by the value of the selected variable. The summarize function is the engine where the data is operated upon using the function of your choice. In the example above, I have chosen the sum function which is applied over grouped observation. This strategy or method of aggregation is known as split-apply-combine. First, we are splitting the observations into groups followed by applying a function and then combining results to present.
If you were observant, then you may have noticed a warning as well. This warning is a hint to remind you how you want to control the grouping of data. You can suppress this warning by using options(dplyr.summarise.inform = FALSE). For more information, please click this link.
Also, did you notice how the results are printed above?
Nothing wrong with that, it is because the return value is in the form of tibble. You can print.data.frame() function to print is in a nice format. Here is an example.
Lastly, did you notice that the summarised variable is now named total.price? By simply putting the new name in front of the aggregation function will do the trick. It is not mandatory, but I am sure you wouldn't like the default name.
Aggregation by group_by and summarize functions with multiple variables and functions
Let's progress ahead with our scenario and find out the total price, minimum 52-week low and maximum 52-week high price of all shares and by all sectors. The aim here is to show the use of multiple variables with multiple aggregation functions.
Aggregation using group_by and summarize_at function
Summarize_at, on the other hand, helps you to aggregate more than one variable in one go. The requirement is that the variables are supplied as vector or vars.
Aggregation using group_by and summarize_all function
Lastly, summarize_all function helps to aggregate all the values from the data frame, and we can use multiple aggregation functions as well. Note in our data frame we have two character variables, let's prepare the data for using summarize_all function and aggregate data to show total and minimum values of all variable across the data frame.
Aggregation using the data table is a step further from dplyr in terms of simplicity. We write the aggregation code as if we are subsetting the data frame. Before we progress, let's convert our data frame into a data table.
financials <- setDT(financials) class(financials)
## [1] "data.table" "data.frame"
Aggregating single variable using data table
Let's look at an example where the scenario is to find out the total price of all shares by all sectors.
Assigning a custom name to an aggregated variable in data table
The result above shows the aggregated value by the selected variable. However, the aggregated variable's name is not friendly. Let's give it a proper name.
Aggregating multiple variables using multiple aggregation functions in data table
What we have seen until now is one custom-named variable aggregated by a different variable. Let take it further and see if we can apply more than one aggregation function over more than one variable. The scenario is to find out the total price and minimum 52-week low price from all shares by all sectors.
Aggregating variables as well as filtering observations in data table
What if we want to aggregate as well as filter the observations by supplying specific conditions. The data table offers a solution to this. Here is an example.
Counting the number of observations within a group in data table
Although we can use different aggregation functions, what if we want to count the observations within a group? We can do this using .N special variable.
financials[,.N, by=Sector]
## Sector N ## 1: Industrials 67 ## 2: Health Care 61 ## 3: Information Technology 70 ## 4: Consumer Discretionary 84 ## 5: Utilities 28 ## 6: Financials 68 ## 7: Materials 25 ## 8: Real Estate 33 ## 9: Consumer Staples 34 ## 10: Energy 32 ## 11: Telecommunication Services 3
If the requirement is to group observations using multiple variables, then the following syntax can be used for by argument.
by=.(variable to group 1, variable to group 2)
Aggregation using sqldf
I love sqldf, it makes things easy, but if you are not familiar with SQL then it may be a little challenging. Aggregation in sqldf require us to understand SQL, and hence I would just list one example here. SQL is an easy language to learn and I will soon be posing articles on SQL.
Let's define our scenario, find out the total price and minimum 52-week low price from all shares by all sectors where aggregated total price is greater than 5000. This scenario includes more than one variable with the use of more than one aggregation function as well as filtration criterion over grouped observations.
sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector having sum(Price) > 5000')
It is a fact that the human brain looks for patterns. Identifying patterns is crucial while performing analysis, and sorted data helps a lot in finding these patterns. Sort help understanding the data, moreover, in profiling the data. Sort functionality help put data into an order which can provide information to help your analysis.
Sorting data using base R
Sorting data using order function
In base R, function order help sort the data. The usage of the order function is straight forward. Here is an example.
Sorting data in descending order using order function
We are subsetting variables Symbol and Price from data frame financials and sorting the output by Symbol variable's values. If we want to sort the data in descending order, then function desc comes to our rescue.
Data table has two functions order and setorder to help sort the data. Here is an example but first convert the data frame to data table and then sort.
Data sorting using sqldf is simply achieved using the "order by" clause in the SQL command.
financials <- read.csv("constituents-financials_csv.csv") sqldf('select Sector, sum(Price), min("X52.Week.Low") from financials group by Sector order by Sector')
Concludingly, you may have noticed that we have not discussed NA values at all. Almost all aggregation function allows na.rm argument. Please try and comment if you find out that any particular function did not support it.
Well done, you've made to the end. Thank you for reading this article. I hope you've liked it. Please share it with your friends and if you are agreeing or disagreeing with any point above then please comment in the comment box below.
[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:
Assume we want to estimate the causal effect beta of x on y. However, we have an unobserved confounder z that affects both x and y. If we don't add the confounder z as control variable in the regression of y on x, the OLS estimator of beta will be biased. That is the so called omitted variable bias.
Let's simulate a data set and illustrate the omitted variable bias:
n=10000alpha=beta=gamma=1z=rnorm(n,0,1)eps.x=rnorm(n,0,1)eps.y=rnorm(n,0,1)x=alpha*z+eps.xy=beta*x+gamma*z+eps.y# Estimate short regression with z omittedcoef(lm(y~x))[2]
## x ## 1.486573
While the true causal effect beta is equal to 1, our OLS estimator where we omit z is around 1.5. This means it has a positive bias of roughly 0.5.
Before we continue, let's have a quiz (click here if the Google form quiz is not correctly shown.):
Let's see what happens if we increase the impact of the confounder z on x, say to alpha=1000.
This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both y and x. Thus the omitted variable bias probably becomes worse if the confounder z affects y or x more strongly. While this intuition is correct for small alpha, it is wrong once alpha is sufficiently large.
For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:
[asy. \; Bias(\hat \beta) = \gamma\alpha\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_x)}] (From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)
Let's plot the bias for different values of $\alpha$:
For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.
Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.
Typical presentation of the omitted variable bias formula
Note that the omitted variable bias formula is usually presented as follows:
[ Bias(\hat \beta) = \gamma \hat \delta] where $\hat \delta$ is the OLS estimate of the linear regression [z = const + \delta x + u ]
(This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as [ x=\tilde{const} + \frac 1 \delta z + \tilde u ] suggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the y and x in a simple linear regression can be a bit surprising, see my previous post.)
If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get [\lim_{Var(\varepsilon_x)\rightarrow 0 } asy. \; Bias(\hat \beta) = \frac \gamma\alpha]
However, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.
To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.
[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
High-Performance Time Series Forecasting Course is an amazing course designed to teach Data Scientists and Business Analysts the business forecasting tools and techniques that win time series competitions and deliver your organization results! We've combined an innovative program with a clear-cut path to forecasting. You'll undergo a complete transformation learning the most in-demand skills that organizations need right now. Time to accelerate your career!
Create a free account to get the launch discount in 7 Days!
High-Performance Time Series Improve your forecasting abilities and accelerate your career
Watch our 2-minute course video that describes the innovative, 3-part roadmap for learning high-performance time series forecasting.
Watch Your Journey (2-Minute Video) – A Streamlined Path to High-Performance Forecasting
Learn from Time Series Competitions Analyze the winning submissions
High-Performance Time Series Forecasting is a state-of-the-art course designed to teach data scientists and business analysts how to apply the latest forecasting techniques to their businesses by learning from the strategies that won 4 Key Time Series Competitions.
A Project-Based Course Learn By Completing Real Business Projects with Real Business Data
This is a project-based course, and here's the scenario. You are forecasting data related to key performance indicators for an e-learning company. Your primary goals are to improve sales demand forecasting by 10% or more and to set up system for forecasting email subscriptions.
To complete these projects, you will need to learn the state-of-the-art time series forecasting techniques along with time-series feature engineering and working with external data sources. You'll learn 3 key software libraries along the way!
The Curriculum By Learn An End-To-End Forecasting Process
You begin by learning the core data science skills needed to process, transform, visualize, and feature engineer time series data in preparation for machine learning.
Competition Overview – 4 competitions and strategies that won – 30 min
TS Jumpstart – 1+ Hours
Visualization – 1.5 hours
Data Wrangling – 1.5 Hours
Transformations – 1.5 Hours
Challenge 1 – Investigating Revenue and Google Analytics Data – 30 min
[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Jay Ulfelder, PhD, serves as Program Manager for the Nonviolent Action Lab, part of the Carr Center for Human Rights Policy at the Harvard Kennedy School. He has used R to work at the intersection of social science and data science for nearly two decades.
Where are people in the United States protesting in 2020, and what are they protesting about? How large have those crowds been? How many protesters have been arrested or injured? And how does this year's groundswell of protest activity compare to the past several years, which had already produced some of the largest single-day gatherings in U.S. history?
These are the kinds of questions the Crowd Counting Consortium (CCC) Crowd Dataset helps answer. Begun after the 2017 Women's March by Professors Erica Chenoweth (Harvard University) and Jeremy Pressman (University of Connecticut), the CCC's database on political crowds has grown into one of the most comprehensive open sources of near-real time information on protests, marches, demonstrations, strikes, and similar political gatherings in the contemporary United States. At the start of August 2020, the database included nearly 50,000 events. These data have been used in numerous academic and press pieces, including a recent New York Times story on the historic scale of this year's Black Lives Matter uprising.
As rich as the data are, they have been a challenge to use. The CCC shares its data on political crowds via a stack of monthly Google Sheets with formats that can vary from sheet to sheet in small but confounding ways. Column names don't always match, and certain columns have been added or dropped over time. Some sheets include separate tabs for specific macro-events or campaigns (e.g., a coordinated climate strike), while others group everything in a single sheet. And, of course, typos happen.
To make this tremendous resource more accessible to researchers, activists, journalists, and data scientists, the Nonviolent Action Lab at Harvard's Carr Center for Human Rights Policy—a new venture started by CCC co-founder Chenoweth—has created a GitHub repository to host a compiled, cleaned, and augmented version of the CCC's database.
In addition to all the information contained in the monthly sheets, the compiled version adds two big feature sets.
Geolocation. After compiling the data, the Lab uses the googleway package to run the cities and towns in which the events took place through Google Maps's Geolocation API and extracts geocoordinates from the results, along with clean versions of the locality, county, and state names associated with them.
Issue Tags. In the original data, protesters' claims—in other words, what the protest is about—are recorded by human coders in a loosely structured way. To allow researchers to group or filter the data by theme, the Nonviolent Action Lab maintains a dictionary of a few dozen major political issues in the U.S. (e.g., "guns", "migration", "racism") and keyword– and keyphrase-based regular expressions associated with them. By mapping this dictionary over the claim strings, we generate a set of binary issue tags that gets added to the compiled database as a vector of semicolon-separated strings.
To make the CCC data more accessible to a wider audience, the Lab has also built a Shiny dashboard that lets users filter events in various ways and then map and plot the results. Users can filter by date range, year, or campaign as well as issue and political valence (pro-Trump, anti-Trump, or neither).
The dashboard has two main tabs. The first uses the leaflet package to map the events with markers that contain summary details and links to the source(s) the human coders used to research them.
The second tab uses plotly and the streamgraph html widget package to render interactive plots of trends over time in the occurrence of the selected events, the number of participants in them, and the political issues associated with them.
The point of the Nonviolent Action Lab's repository and dashboard is to make the Crowd Counting Consortium''s data more accessible and more useful to as wide an audience as possible. If you use either of these resources and find bugs or errors or have suggestions on how to improve them, please let us know.
To provide feedback on the compiled data set or the Shiny dashboard, please open an issue on the GitHub repository, here.
If you think you see an error in the CCC's data or know about an event that it doesn't cover, please use this form to submit a correction or record via the CCC website.
To leave a comment for the author, please follow the link and comment on their blog: R Views.
[This article was first published on R Blogs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Hello Everyone! It's been long since I posted something new. I completed this Virtual Experience Program a month back and thought I could write about the solutions of all the 3 tasks for anyone who might be seeking for the same.
This post is specifically about Task 1 – Data Preparation and Customer Analytics
You are provided initially with the solution modules but still some parts and codes are a bit tough to crack. I did it my way but surely there could be various other solutions to the same problem.
## PROD_NAME N ## 1: Natural Chip Compny SeaSalt175g 1468 ## 2: CCs Nacho Cheese 175g 1498 ## 3: Smiths Crinkle Cut Chips Chicken 170g 1484 ## 4: Smiths Chip Thinly S/Cream&Onion 175g 1473 ## 5: Kettle Tortilla ChpsHny&Jlpno Chili 150g 3296 ## --- ## 110: Red Rock Deli Chikn&Garlic Aioli 150g 1434 ## 111: RRD SR Slow Rst Pork Belly 150g 1526 ## 112: RRD Pc Sea Salt 165g 1431 ## 113: Smith Crinkle Cut Bolognese 150g 1451 ## 114: Doritos Salsa Mild 300g 1472
Warning~ Most of the products in PROD_NAME are chips but others may or may not exist.
changes! Looks like we are definitely looking at potato chips but how can we check that these are all chips? We can do some basic text analysis by summarising the individual words in the product name.
#### Examine the words in PROD_NAME to see if there are any incorrect entries #### such as products that are not chips productWords <- data.table(unlist(strsplit(unique(transaction_data[, PROD_NAME]), " "))) setnames(productWords, 'words')
As we are only interested in words that will tell us if the product is chips or not, let's remove all words with digits and special characters such as '&' from our set of product words. We can do this using grepl().
library(stringr) library(stringi) #### Removing special characters productWords$words <- str_replace_all(productWords$words,"[[:punct:]]"," ") #### Removing digits productWords$words <- str_replace_all(productWords$words,"[0-9]"," ") productWords$words <- str_replace_all(productWords$words,"[gG]"," ") #### Let's look at the most common words by counting the number of times a word appears and wordsSep <- strsplit(productWords$words," ") words.freq<-table(unlist(wordsSep)) #### sorting them by this frequency in order of highest to lowest frequency words.freq <- as.data.frame(words.freq) words.freq <- words.freq[order(words.freq$Freq, decreasing = T),] words.freq
Next, we can use summary() to check summary statistics such as mean, min and max values for each feature to see if there are any obvious outliers in the data and if there are any nulls in any of the columns (NA's : number of nulls will appear in the output if there are any nulls).
summary(transaction_data)
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID ## Min. :2018-07-01 Min. : 1.0 Min. : 1000 Min. : 1 ## 1st Qu.:2018-09-30 1st Qu.: 70.0 1st Qu.: 70015 1st Qu.: 67569 ## Median :2018-12-30 Median :130.0 Median : 130367 Median : 135183 ## Mean :2018-12-30 Mean :135.1 Mean : 135531 Mean : 135131 ## 3rd Qu.:2019-03-31 3rd Qu.:203.0 3rd Qu.: 203084 3rd Qu.: 202654 ## Max. :2019-06-30 Max. :272.0 Max. :2373711 Max. :2415841 ## PROD_NBR PROD_NAME PROD_QTY TOT_SALES ## Min. : 1.00 Length:246742 Min. : 1.000 Min. : 1.700 ## 1st Qu.: 26.00 Class :character 1st Qu.: 2.000 1st Qu.: 5.800 ## Median : 53.00 Mode :character Median : 2.000 Median : 7.400 ## Mean : 56.35 Mean : 1.908 Mean : 7.321 ## 3rd Qu.: 87.00 3rd Qu.: 2.000 3rd Qu.: 8.800 ## Max. :114.00 Max. :200.000 Max. :650.000
There are no nulls in the columns but product quantity appears to have an outlier which we should investigate further. Let's investigate further the case where 200 packets of chips are bought in one transaction.
It looks like this customer has only had the two transactions over the year and is not an ordinary retail customer. The customer might be buying chips for commercial purposes instead. We'll remove this loyalty card number from further analysis.
## transaction_data$DATE n ## Min. :2018-07-01 Min. :607.0 ## 1st Qu.:2018-09-29 1st Qu.:658.0 ## Median :2018-12-30 Median :674.0 ## Mean :2018-12-30 Mean :677.9 ## 3rd Qu.:2019-03-31 3rd Qu.:694.2 ## Max. :2019-06-30 Max. :865.0
There's only 364 rows, meaning only 364 dates which indicates a missing date. Let's create a sequence of dates from 1 Jul 2018 to 30 Jun 2019 and use this to create a chart of number of transactions over time to find the missing date.
Create a sequence of dates and join this the count of transactions by date
transOverTime <-ggplot(countByDate, aes(x = countByDate$`transaction_data$DATE`, y = countByDate$n)) + geom_line() + labs(x = "Day", y = "Number of transactions", title = "Transactions over time") + scale_x_date(breaks = "1 month") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) transOverTime
We can see that there is an increase in purchases in December and a break in late December. Let's zoom in on this.
Filter to December and look at individual days
filterData <- countByDate[countByDate$`transaction_data$DATE` >= "2018-12-01" & countByDate$`transaction_data$DATE` <= "2018-12-31"] ggplot(filterData, aes(x = filterData$`transaction_data$DATE`, y = filterData$n)) + geom_line() + labs(x = "Day", y = "Number of transactions", title = "Transactions in December") + scale_x_date(breaks = "1 day") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
We can see that the increase in sales occurs in the lead-up to Christmas and that there are zero sales on Christmas day itself. This is due to shops being closed on Christmas day.
Now that we are satisfied that the data no longer has outliers, we can move on to creating other features such as brand of chips or pack size from PROD_NAME. We will start with pack size.
Pack size
A new column PACK SIZE added to the data frame transaction_data
#### We can work this out by taking the digits that are in PROD_NAME transaction_data[, PACK_SIZE := parse_number(PROD_NAME)] #### Let's check if the pack sizes look sensible df_packSizeVsTransactions <- transaction_data[, .N, PACK_SIZE][order(PACK_SIZE)] df_packSizeVsTransactions
The largest size is 380g and the smallest size is 70g – seems sensible!
Let's plot a histogram of PACK_SIZE since we know that it is a categorical variable and not a continuous variable even though it is numeric.
#ggplot(df_packSizeVsTransactions, aes(x = df_packSizeVsTransactions$PACK_SIZE, y = df_packSizeVsTransactions$N)) + # geom_line() + #labs(x = "Pack Sizes", y = "Number of transactions", title = "Transactions #over time") + scale_x_continuous(breaks = seq(70,390,20)) + #theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) hist(transaction_data[, PACK_SIZE])
Pack sizes created look reasonable.
Brands
Now to create brands, we can use the first word in PROD_NAME to work out the brand name…
#Create a column which contains the brand of the product, by extracting it from the product name. transaction_data$BRAND <- gsub("([A-Za-z]+).*", "\\1", transaction_data$PROD_NAME) transaction_data[, .N, by = BRAND][order(‐N)]
data <- merge(transaction_data, purchase_beahviour, all.x = TRUE)
As the number of rows in data is the same as that of transactionData, we can be sure that no duplicates were created. This is because we created data by setting all.x = TRUE (in other words, a left join) which means take all the rows in transactionData and find rows with matching values in shared columns and then joining the details in these rows to the x or the first mentioned table.
Let's also check if some customers were not matched on by checking for nulls.
Great, there are no nulls! So all our customers in the transaction data has been accounted for in the customer dataset.
For Task 2, we write this dataset into a csv file
write.csv(data,"QVI_data.csv")
** Data exploration is now complete! **
Data analysis on customer segments
Now that the data is ready for analysis, we can define some metrics of interest to the client: – Who spends the most on chips (total sales), describing customers by lifestage and how premium their general purchasing behaviour is – How many customers are in each segment – How many chips are bought per customer by segment – What's the average chip price by customer segment We could also ask our data team for more information. Examples are: – The customer's total spend over the period and total spend for each transaction to understand what proportion of their grocery spend is on chips – Proportion of customers in each customer segment overall to compare against the mix of customers who purchase chips Let's start with calculating total sales by LIFESTAGE and PREMIUM_CUSTOMER and plotting the split by
Total sales by LIFESTAGE and PREMIUM_CUSTOMER
total_sales <- data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER) pf.total_sales <- summarise(total_sales,sales_count=sum(TOT_SALES)) summary(pf.total_sales) #### Create plot p <- ggplot(pf.total_sales) + geom_mosaic(aes(weight = sales_count, x = product(PREMIUM_CUSTOMER, LIFESTAGE),fill = PREMIUM_CUSTOMER)) + labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of sales") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) p +geom_text(data = ggplot_build(p)$data[[1]], aes(x = (xmin + xmax)/2 , y = (ymin + ymax)/2, label = as.character(paste(round(.wt/sum(.wt),3)*100, '%'))), inherit.aes = F)
Sales are coming mainly from Budget – older families, Mainstream – young singles/couples, and Mainstream – retirees
Number of customers by LIFESTAGE and PREMIUM_CUSTOMER
Let's see if the higher sales are due to there being more customers who buy chips.
total_sales <- data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER) no_of_customers <- summarise(total_sales,customer_count = length(unique(LYLTY_CARD_NBR))) summary(no_of_customers) #### Create plot p <- ggplot(data = no_of_customers) + geom_mosaic(aes(weight = customer_count, x = product(PREMIUM_CUSTOMER, LIFESTAGE), fill = PREMIUM_CUSTOMER)) + labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of customers") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+ geom_text(data = ggplot_build(p)$data[[1]], aes(x = (xmin + xmax)/2 , y = (ymin + ymax)/2, label = as.character(paste(round(.wt/sum(.wt),3)*100, '%')))) p
There are more Mainstream – young singles/couples and Mainstream – retirees who buy chips. This contributes to there being more sales to these customer segments but this is not a major driver for the Budget – Older families segment.
Higher sales may also be driven by more units of chips being bought per customer.
Let's have a look at this next.
Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER
total_sales_1 <-data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER) units <- summarise(total_sales_1, units_count = (sum(PROD_QTY)/uniqueN(LYLTY_CARD_NBR))) summary(units) ###create plot ggplot(data = units, aes(weight = units_count, x = LIFESTAGE, fill = PREMIUM_CUSTOMER)) + geom_bar(position = position_dodge()) + labs(x = "Lifestage", y = "Avg units per transaction", title = "Units per customer") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
Older families and young families in general buy more chips per customer
Average price per unit by LIFESTAGE and PREMIUM_CUSTOMER
Let's also investigate the average price per unit chips bought for each customer segment as this is also a driver of total sales.
total_sales_2 <-data %>% group_by(LIFESTAGE,PREMIUM_CUSTOMER) pricePerUnit <- summarise(total_sales_2, price_per_unit = (sum(TOT_SALES)/sum(PROD_QTY))) ####plot ggplot(data=pricePerUnit, aes(weight = price_per_unit,x = LIFESTAGE, fill = PREMIUM_CUSTOMER)) + geom_bar(position = position_dodge()) + labs(x = "Lifestage", y = "Avg price per unit", title = "Price per unit") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
Mainstream midage and young singles and couples are more willing to pay more per packet of chips compared to their budget and premium counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own consumption. This is also supported by there being fewer premium midage and young singles and couples buying chips compared to their mainstream counterparts.
As the difference in average price per unit isn't large, we can check if this difference is statistically different.
Perform an independent t-test between mainstream vs premium and budget midage and young singles and couples
# If this p-value is above .05, then there is not a significant difference in test scores. pricePerUnit <‐ data[, price := TOT_SALES/PROD_QTY] t.test(data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") & PREMIUM_CUSTOMER == "Mainstream", price],data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") & PREMIUM_CUSTOMER != "Mainstream", price], alternative = "greater")
## ## Welch Two Sample t-test ## ## data: data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") & and data[LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES") & PREMIUM_CUSTOMER == "Mainstream", price] and PREMIUM_CUSTOMER != "Mainstream", price] ## t = 37.624, df = 54791, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## 0.3187234 Inf ## sample estimates: ## mean of x mean of y ## 4.039786 3.706491
The t-test results in a p-value < 2.2e-16, i.e. the unit price for mainstream, young and mid-age singles and couples ARE significantly higher than that of budget or premium, young and midage singles and couples.
Deep dive into specific customer segments for insights
We have found quite a few interesting insights that we can dive deeper into. We might want to target customer segments that contribute the most to sales to retain them or further increase sales. Let's look at Mainstream – young singles/couples. For instance, let's find out if they tend to buy a particular brand of chips.
Deep dive into Mainstream, young singles/couples
#### Deep dive into Mainstream, young singles/couples segment1 <- data[LIFESTAGE == "YOUNG SINGLES/COUPLES" & PREMIUM_CUSTOMER == "Mainstream",] other <- data[!(LIFESTAGE == "YOUNG SINGLES/COUPLES" & PREMIUM_CUSTOMER =="Mainstream"),] #### Brand affinity compared to the rest of the population quantity_segment1 <- segment1[, sum(PROD_QTY)] quantity_other <- other[, sum(PROD_QTY)] quantity_segment1_by_brand <- segment1[, .(targetSegment = sum(PROD_QTY)/quantity_segment1), by = BRAND] quantity_other_by_brand <- other[, .(other = sum(PROD_QTY)/quantity_other), by = BRAND] brand_proportions <- merge(quantity_segment1_by_brand, quantity_other_by_brand)[, affinityToBrand := targetSegment/other] brand_proportions[order(‐affinityToBrand)] ggplot(brand_proportions, aes(brand_proportions$BRAND,brand_proportions$affinityToBrand)) + geom_bar(stat = "identity",fill = "yellow") + labs(x = "Brand", y = "Customers Affinity to Brand", title = "Favorite brands of Customers") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
We can see that:
• Mainstream young singles/couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population • Mainstream young singles/couples are 56% less likely to purchase Burger Rings compared to the rest of the population
[INSIGHTS] Let's also find out if our target segment tends to buy larger packs of chips.
Preferred pack size compared to the rest of the population
Let's recap what we've found! Sales have mainly been due to Budget – older families, Mainstream young singles/couples, and Mainstream – retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers. Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behaviour.
We've also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category's performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and couples frequent more often to increase visibilty and impulse behaviour.
Quantium can help the Category Manager with recommendations of where these segments are and further help them with measuring the impact of the changed placement. We'll work on measuring the impact of trials in the next task and putting all these together in the third task.
Outro
Task 1 was basic but crucial. All the basic analysis and visualizations were made here. The general patterns and trends among customers' choices were explored through visualizations.
The solutions to the other tasks will be uploaded soon. Till then, any feedbacks, queries or recommendations are appreciated on any of my social media handles.
One topic I haven't discussed in my previous posts about automating tasks with loops or doing simulations is how to deal with errors. If we have unanticipated errors a map() or lapply() loop will come to a screeching halt with no output to show for the time spent. When your task is time-consuming, this can feel pretty frustrating, since the whole process has to be restarted.
How to deal with errors? Using functions try() or tryCatch() when building a function is the traditional way to catch and address potential errors. In the past I've struggled to remember how to use these, though, and functions possibly() and safely() from package purrr are convenient alternatives that I find a little easier to use.
In this post I'll show examples on how to use these two functions for handling errors. I'll also demonstrate the use of the related function quietly() to capture other types of output, such as warnings and messages.
The functions I'm highlighting today are from package purrr. I'll also use lme4 for fitting simulated data to linear mixed models.
library(purrr) # v. 0.3.4 library(lme4) # v. 1.1-23
Using possibly() to return values instead of errors
When doing a repetitive task like fitting many models with a map() loop, an error in one of the models will shut down the whole process. We can anticipate this issue and bypass it by defining a value to return if a model errors out via possibly().
I created the very small dataset below to demonstrate the issue. The goal is to fit a linear model of y vs x for each group. I made exactly two groups here, a and b, to make it easy to see what goes wrong and why. Usually we have many more groups and potential problems can be harder to spot.
dat = structure(list(group = c("a", "a", "a", "a", "a", "a", "b", "b", "b"), x = c("A", "A", "A", "B", "B", "B", "A", "A", "A"), y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)), class = "data.frame", row.names = c(NA, -9L)) dat
# group x y # 1 a A 10.9 # 2 a A 11.1 # 3 a A 10.5 # 4 a B 9.7 # 5 a B 10.5 # 6 a B 10.9 # 7 b A 13.0 # 8 b A 9.9 # 9 b A 10.3
I'll first split the dataset by group to get a list of data.frames to loop through.
dat_split = split(dat, dat$group)
Then I'll loop through each dataset in the list with map() and fit a linear model with lm(). Instead of getting output, though, I get an error.
map(dat_split, ~lm(y ~ x, data = .x) )
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
What's going on? If you look at the dataset again, you'll see that x from group b contains only a single value. Once you know that you can see the error actually is telling us what the problem is: we can't use a factor with only one level.
Model a fits fine, since x has two values.
lm(y ~ x, data = dat, subset = group == "a")
# # Call: # lm(formula = y ~ x, data = dat, subset = group == "a") # # Coefficients: # (Intercept) xB # 10.8333 -0.4667
It is the b model that fails.
lm(y ~ x, data = dat, subset = group == "b")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
You can imagine that the problem of having only a single value for the factor in some groups could be easy to miss if you working with a large number groups. This is where possibly() can help, allowing us to keep going through all groups regardless of errors. We can then find and explore problem groups.
Wrapping a function with possibly()
The possibly() function is a wrapper function. It wraps around an existing function. Other than defining the function to wrap, the main argument of interest is otherwise. In otherwise we define what value to return if we get an error from the function we are wrapping.
I make a new wrapped function called posslm1(), which wraps lm() and returns "Error" if an error occurs when fitting the model.
posslm1 = possibly(.f = lm, otherwise = "Error")
When I use posslm1() in my model fitting loop, you can see that loop now finishes. Model b contains the string "Error" instead of a model.
Here's another example of possibly() wrapped around lm(), this time using otherwise = NULL. Depending on what we plan to do with the output, using NULL or NA as the return value can be useful when using possibly().
Once the loop is done, we can examine the groups that had errors when fitting models. For example, I can use purrr::keep() to keep only the results that are NULL.
mods %>% keep(~is.null(.x) )
# $b # NULL
This allows me to pull out the names for the groups that had errors. Getting the names in this way is one reason I like that split() returns named lists.
Once I have the names of the groups with errors, I can pull any problematic groups out of the original dataset or the split list to examine them more closely. (I use %in% here in case group_errs is a vector.)
dat[dat$group %in% group_errs, ]
# group x y # 7 b A 13.0 # 8 b A 9.9 # 9 b A 10.3
dat_split[group_errs]
# $b # group x y # 7 b A 13.0 # 8 b A 9.9 # 9 b A 10.3
Using compact() to remove empty elements
You may come to a point where you've looked at the problem groups and decide that the models with errors shouldn't be used in further analysis. In that case, if all the groups with errors are NULL, you can use purrr::compact() to remove the empty elements from the list. This can make subsequent loops to get output more straightforward in some cases.
Rather than replacing the errors with values, safely() returns both the results and the errors in a list. This function is also a wrapper function. It defaults to using otherwise = NULL, and I generally haven't had reason to change away from that default.
Here's an example, wrapping lm() in safely() and then using the wrapped function safelm() to fit the models.
The output for each group is now a list with two elements, one for results (if there was no error) and the other for the error (if there was an error).
Here's what this looks like for model a, which doesn't have an error. The output contains a result but no error.
Model b didn't work, of course, so the results are NULL but the error was captured in error.
mods2[[2]]
# $result # NULL # # $error #
Exploring the errors
One reason to save the errors using safely() is so we can take a look at what the errors were for each group. This is most useful with informative errors like the one in my example.
Errors can be extracted with a map() loop, pulling out the "error" element from each group.
map(mods2, "error")
# $a # NULL # # $b #
Extracting results
Results can be extracted similarly, and, if relevant, NULL results can be removed via compact().
The quietly() function doesn't handle errors, but instead captures other types of output such as warnings and messages along with any results. This is useful for exploring what kinds of warnings come up when doing simulations, for example.
A few years ago I wrote a post showing a simulation for a linear mixed model. I use the following function, pulled from that earlier post.
twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) { standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot) stand = rep(LETTERS[1:nstand], each = nplot) ploteff = rnorm(nstand*nplot, 0, sigma) resp = mu + standeff + ploteff dat = data.frame(stand, resp) lmer(resp ~ 1 + (1|stand), data = dat) }
One thing I skipped discussing in that post were the messages returned for some simulations. However, I can certainly picture scenarios where it would be interesting and important to capture warnings and messages to see, e.g., how often they occur even when we know the data comes from the model.
Here I'll set the seed so the results are reproducible and then run the function 10 times. You see I get two messages, indicating that two of the ten models returned a message. In this case, the message indicates that the random effect variance is estimated to be exactly 0 in the model.
# boundary (singular) fit: see ?isSingular # boundary (singular) fit: see ?isSingular
It turns out that the second model in the output list is one with a message. You can see at the bottom of the model output below that there is 1 lme4 warning.
sims[[2]]
# Linear mixed model fit by REML ['lmerMod'] # Formula: resp ~ 1 + (1 | stand) # Data: dat # REML criterion at convergence: 45.8277 # Random effects: # Groups Name Std.Dev. # stand (Intercept) 0.0000 # Residual 0.7469 # Number of obs: 20, groups: stand, 5 # Fixed Effects: # (Intercept) # 10.92 # convergence code 0; 0 optimizer warnings; 1 lme4 warnings
The lme4 package stores warnings and messages in the model object, so I can pull the message out of the model object.
sims[[2]]@optinfo$conv$lme4
# $messages # [1] "boundary (singular) fit: see ?isSingular"
But I think quietly() is more convenient for this task. This is another wrapper function, and I'm going to wrap it around lmer(). I do this because I'm focusing specifically on messages that happen when I fit the model. However, I could have wrapped twolevel_fun() and captured any messages from the entire simulation process.
I use my new function qlmer() inside my simulation function.
qlmer = quietly(.f = lmer) qtwolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) { standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot) stand = rep(LETTERS[1:nstand], each = nplot) ploteff = rnorm(nstand*nplot, 0, sigma) resp = mu + standeff + ploteff dat = data.frame(stand, resp) qlmer(resp ~ 1 + (1|stand), data = dat) }
I set the seed back to 16 so I get the same models and then run the function using qlmer() 10 times. Note this is considered quiet because the messages are now captured in the output by quietly() instead of printed.
The wrapped function returns a list with 4 elements, including the results, any printed output, warnings, and messages. You can see this for the second model here.
# $result # Linear mixed model fit by REML ['lmerMod'] # Formula: resp ~ 1 + (1 | stand) # Data: ..2 # REML criterion at convergence: 45.8277 # Random effects: # Groups Name Std.Dev. # stand (Intercept) 0.0000 # Residual 0.7469 # Number of obs: 20, groups: stand, 5 # Fixed Effects: # (Intercept) # 10.92 # convergence code 0; 0 optimizer warnings; 1 lme4 warnings # # $output # [1] "" # # $warnings # character(0) # # $messages # [1] "boundary (singular) fit: see ?isSingular\n"
In a simulation setting, I think seeing how many times different messages and warnings come up could be pretty interesting. It might inform how problematic a message is. If a message is common in simulation we may feel more confident that such a message from a model fit to our real data is not a big issue.
For example, I could pull out all the messages and then put the results into a vector with unlist().
sims2 %>% map("messages") %>% unlist()
# [1] "boundary (singular) fit: see ?isSingular\n" # [2] "boundary (singular) fit: see ?isSingular\n"
If I wanted to extract multiple parts of the output, such as keeping both messages and warnings, I can use the extract brackets in map().
These results don't look much different compared to the output above since there are no warnings in my example. However, note the result is now in a named vector so I could potentially keep track of which are messages and which are warnings if I needed to.
# messages # "boundary (singular) fit: see ?isSingular\n" # messages # "boundary (singular) fit: see ?isSingular\n"
I showed only fairly simple way to use these three functions. However, you certainly may find yourself using them for more complex tasks. For example, I've been in situations in the past where I wanted to keep only models that didn't have errors when building parametric bootstrap confidence intervals. If they had existed at the time, I could have used possibly() or safely() in a while() loop, where the bootstrap data would be redrawn until a model fit without error. Very useful!
Just the code, please
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.
[This article was first published on R/exams, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Experiences with transitioning to R/exams for multiple automated and randomized online exams in a soil mechanics course at Universität Innsbruck (UIBK).
Guest post by Wolfgang Fellin (Universität Innsbruck, Division of Geotechnical and Tunnel Engineering).
Background
Our exercises in soil mechanics are attended by about 120 students every year. Students receive their grades based on a final exam, with a multiple-choice part and additional small exercises with calculations. Due to the COVID-19 restrictions we were forced to switch from the written exams to online exams at rather short notice, for which we decided to use R/exams to randomize the exercises. The possibility that the exercises produced can be used in the following (hopefully COVID-free years) as part of a blended learning course made the decision for the (not so small) effort to switch to R/exams easy.
Transition of multiple-choice exercises
To leverage the about 100 multiple-choice exercises from our old question pool, we had to convert them to the R/exams format. The pool was stored in a MySQL data base including LaTeX text for the questions, answer alternatives, images for some exercises (in EPS format) and some additional meta-information regarding the topic category and the difficulty. Due to the standardized format of the question pool – and the kind support by Achim Zeileis and Christiane Ernst – it was possible to extract all questions, embed them into the same template for R/LaTeX questions (.Rnw), and convert all EPS figures to both SVG and PDF format. A typical example is shown below (in a screen shot from OpenOLAT, the learning management system at UIBK).
The corresponding R/LaTeX exercise file is grundwasser-1-068.Rnw which we also converted to R/Markdown for this blog post: grundwasser-1-068.Rmd. Both rely on the corresponding image file being available in the same directory or a sub-directory. Different graphics formats are used for different exam output formats: image68_1.svg, image68_1.pdf, image68_1.eps.
Essentially, the exercises consist of:
some R code for including the right graphics file (SVG vs. PDF) depending on the type of output format (HTML vs. PDF),
the (static) question text and answer alternatives in LaTeX format,
meta-information including exextra tags for our custom topic category and difficulty level,
the exshuffle tag is used for randomly subsampling the available answer alternatives.
New numeric exercises
While the subsampling approach in the multiple-choice exercises provides a certain level of randomization, I wanted to set up new numeric exercises in which all students have to calculate the correct solution(s) based on their personalized parameters. Therefore, I set up new numeric exercises in cloze format where several quantities have to be computed based on the same question text. A typical example is shown below.
Despite being a newcomer to both R and R/exams, setting up these exercises was quite easy because I have been using MATLAB for a long time and could quickly grasp basic R code. Also the exercise templates available online helped as did the friendly e-mail support by Achim for resolving many of the small "details" along the way.
Exam implementation
Following the R/exams tutorial on summative online exams in OpenOLAT I prepared the online exam using the exams2openolat() function. The R code I used along with a couple of comments is provided below.
## load package library("exams") ## exercises for the exam ## (the actual exam had 16 exercises overall, not just 2) exm <- c("grundwasser-1-068.Rnw", "num-setzung-1-steinbrenner.Rnw") ## date used as seed and file name dat <- 20200528 set.seed(dat) rxm <- exams2openolat(exm, edir = "/path/to/exercise/folder/", n = 200, ## number of random versions for each exercise name = paste("BmGb1-Onlinetest", dat, sep = "_"), points = 1, ## all exercises yield 1 point cutvalue = 8, ## threshold for passing the exam ## evaluation rule for multiple-choice exercises (standalone and within cloze): ## every wrong answer cancels a correct answer mchoice = list(eval = exams_eval(partial = TRUE, negative = FALSE, rule = "true")), cloze = list(eval = exams_eval(partial = TRUE, negative = FALSE, rule = "true")), solutionswitch = FALSE, ## do not show full solution explanation maxattempts = 1, shufflesections = TRUE, ## sequence of exercises is shuffled randomly navigation = "linear", ## disable switching back and forth between exercises duration = 120, ## 2 hours stitle = "Exercise", ## section title (for group of exercises) ititle = "Question" ## item title (for individual exercises) )
Challenges and outlook
The main problem by producing randomized numeric examples was to find limits for the input so that the output is physically reasonable. Especially for newly created exercises, it is absolutely necessary that other colleagues act as test calculators, such that errors can be detected before students find them in their exams. The stress testing facilities in R/exams can also help.
Since the course will continue being online in 2020/21 due to ongoing COVID-19 restrictions, I use R/exams now also for the practice exercises which students have to calculate during the semester. Thus, I produced about 70 new exercises, which are offered in OpenOLAT in 12 weekly online tests. After COVID-19 I will use these examples as addition to written ones, where longer and more complex task could be trained.
To leave a comment for the author, please follow the link and comment on their blog: R/exams.
[This article was first published on R – G-Forge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A guide to flowcharts using my Gmisc package. The image is CC by Michael Staats.
A flowchart is a type of diagram that represents a workflow or process. The building blocks are boxes and the arrows that connect them. If you have submitted any research paper the last 10 years you have almost inevitably been asked to produce a flowchart on how you generated your data. While there are excellent click-and-draw tools I have always found it to be much nicer to code my charts. In this post I will go through some of the abilities in my Gmisc package that makes this process smoother.
Main example
Lets start with the end result when you use Gmisc the boxGrob() with connectGrob() so that you know if you should continue to read.
the code for this is in the vignette, a slightly adapted version looks like this:
library(Gmisc)library(magrittr)library(glue)# The key boxes that we want to plot org_cohort <- boxGrob(glue("Stockholm population", "n = {pop}", pop = txtInt(1632798), .sep="\n")) eligible <- boxGrob(glue("Eligible", "n = {pop}", pop = txtInt(10032), .sep="\n")) included <- boxGrob(glue("Randomized", "n = {incl}", incl = txtInt(122), .sep="\n")) grp_a <- boxGrob(glue("Treatment A", "n = {recr}", recr = txtInt(43), .sep="\n")) grp_b <- boxGrob(glue("Treatment B", "n = {recr}", recr = txtInt(122-43-30), .sep="\n")) excluded <- boxGrob(glue("Excluded (n = {tot}):", " - not interested: {uninterested}", " - contra-indicated: {contra}", tot =30, uninterested =12, contra =30-12, .sep="\n"), just ="left")# Move boxes to where we want them vert <- spreadVertical(org_cohort, eligible = eligible, included = included, grps = grp_a) grps <- alignVertical(reference = vert$grps, grp_a, grp_b)%>% spreadHorizontal() vert$grps <- NULL y <- coords(vert$included)$top + distance(vert$eligible, vert$included, half = TRUE, center = FALSE) excluded <- moveBox(excluded, x = .8, y = y)# Connect vertical arrows, skip last boxfor(i in1:(length(vert)-1)){ connectGrob(vert[[i]], vert[[i +1]], type ="vert")%>%print}# Connnect included to the two last boxes connectGrob(vert$included, grps[[1]], type ="N") connectGrob(vert$included, grps[[2]], type ="N")# Add a connection to the exclusions connectGrob(vert$eligible, excluded, type ="L")# Print boxes vert grps excluded
Example breakdown
There is a lot happening and it may seem overwhelming but it we break it down to smaller chunks it probably makes more sense.
Packages
The first section is just the packages that I use, and apart from the Gmisc package with the main functions I have magrittr that is just for the %>% pipe, and the glue that is convenient for the text generation that allows us to use interpreted string literals that have become standard tooling in many languages (e.g. JavaSript and Python).
Basic boxGrob() calls
After loading the packages we create each box that we want to output. Note that we save each box into a variable:
This avoids plotting the box which is the default action when print is called (R does calls print on any object that is outputted to the terminal). By storing the box in a variable we can use this for manipulating the box prior to output. If we would have written as below:
boxGrob(glue("Stockholm population", "n = {pop}", pop = txtInt(1632798), .sep="\n"))
Note that the box is generated in the middle of the image (also known as the main viewport in the grid system). We can choose how to place the box by specifying the position parameters.
boxGrob("top left", y =1, x =0, bjust =c(0, 1)) boxGrob("center", y =0.5, x =0.5, bjust =c(0.5, 0.5)) boxGrob("bottom right", y =0, x =1, bjust =c(1, 0))
Spreading and moving the boxes
While we can position the boxes exactly where we want, I have found it even more useful to move them relative to the available space. The section below does exactly this.
vert <- spreadVertical(org_cohort, eligible = eligible, included = included, grps = grp_a) grps <- alignVertical(reference = vert$grps, grp_a, grp_b)%>% spreadHorizontal() vert$grps <- NULL excluded <- moveBox(excluded, x = .8, y = coords(vert$included)$top + distance(vert$eligible, vert$included, half = TRUE, center = FALSE))
The spreadVertical() takes each element and calculates the position of each where the first is aligned at the top while the bottom is aligned at the bottom. The elements in between are then spread evenly throughout the available to space. There are some options on how to spread the objects, the default is to have the space between the boxes to be identical but there is also the option of having the center of each box to be evenly spread (see the .type parameter).
The alignVertical() aligns the elements in relation to the reference object. In this case we chose to find the bottom alignment using a "fake"grp_a box. As we only have this box so that we can use it for future alignment we dopr the box with vert$grps <- NULL.
Note that all of the align/spread functions return a list with the boxes in the new positions (if you print the original boxes they will not have moved). Thus make sure you print the returned elements if you want to see the objects, just as we do at the end of the code block.
vert grps excluded
Moving a box
In the example we want the exclusions to be equally spaced between the eligible and included which we can do using moveBox() that allows us to change any of the coordinates for the original box. Just as previously, we save the box onto the original variable or the box would appear not to have moved once we try to print it.
Here we also make use of the coords() function that gives us access to the coordinates of a box and the distance() that gives us the distance between boxes (the center = FALSE is for retrieving the distance between the boxes edges and not from the center point).
y <- coords(vert$included)$top + distance(vert$eligible, vert$included, half = TRUE, center = FALSE) excluded <- moveBox(excluded, x = .8, y = y)<pre><h2>Generating the connecting arrowsh2> Once we are done with positioning all the boxes we need to connect them using the arrows using the <code>connectGrob()code>. Thefunction accepts two boxes and draws a line between them. The appearance of the lineis decided by the <code>typecode> argument. My apologies it the allowed arguments <code>"vertical", "horizontal", "L", "-", "Z", "N"code> are not super intuitive, finding good namesis hard. Feel free to suggest better options/explanations. Anyway, below we simply loop through them all and plot the arrows using <code>print()code>. Note that we only need to call<code>print()code>within the for loop as the others are automatically printed. <pre lang="rsplus">for(i in1:(length(vert)-1)){ connectGrob(vert[[i]], vert[[i +1]], type ="vert")%>%print} connectGrob(vert$included, grps[[1]], type ="N") connectGrob(vert$included, grps[[2]], type ="N") connectGrob(vert$eligible, excluded, type ="L")
Short summary
So the basic workflow is
generate boxes,
position them,
connect arrows to them, and
print
Practical tips
I have found that generating the boxes simultaneous to when I actually exclude the examples in my data set keeps the risk of invalid counts to a minimum. Sometimes I generate a list that I later convert to a flowchart, but the principle is the same - make sure the graph is closely related to your data.
If you want to style your boxes you can set the options(boxGrobTxt=..., boxGrob=...) and all boxes will have the same styling. The fancy boxPropGrob() allows you to show data splits and has even more options that you may want to check out, although usually you don't have more than one boxPropGrob().
To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.
[This article was first published on R Bloggers on Francisco Bischoff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Since the beginning of the tsmp package, it was evident that a series of algorithms around the Matrix Profile would pop-up sooner or later.
After the creation of the Matrix Profile Foundation (MPF), the tsmp package had doubled the number of monthly downloads, and that is a good thing!
The current version of tsmp, as shown in the previous post had added the new Pan-Matrix Profile and introduced the Matrix Profile API that aims to standardize high-level tools across multiple programming languages.
Now my focus will be on speed and robustness of the package. Currently I can say that the speed improvement is 25% for STAMP and STOMP.
The tsmp package will keep the current structure, but the "core" algorithms will be moved to a new dependency package called matrixprofiler.
This package will be specialized on low-level (Rcpp) implementation of MASS, STAMP, STOMP, etc. This will let the tsmp more flexible implement algorithms that use the Matrix Profile as the base for the application tasks as "motif search", "snippet search", or other algorithms that may appear.
matrixprofiler will tend to be less updated and very stable. Those who just want the basics, will get to work with this package alone, but it is safer to install the tsmp and have the former as dependency.
Comments
Post a Comment