[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
What is the daily correlation of Confirmed versus Death Cases in Covid-19. In other words, the people who have passed away, on average, how many days ago they have been reported (i.e. "Confirmed") as Covid-19 new cases.
To answer this question, we can take the correlation between the Daily Confirmed vs Daily Deaths and trying different lag values of the confirmed cases, since the assumption is that it will take some days for someone to pass away since has been diagnosed with Covid-19.
The problem with the data is that are affected by the number of tests and also during some days like weekends they do not report all the cases. This implies that our analysis is not valid, but we will try to see what get. We will analyze Italy.
Italy: Correlation Between Confirmed Cases and Deaths
As we see, the argmax correlation is at k=6, which implies (if the data were accurate), that from the people who have passed away, most of them diagnosed with Covid-19 6 days ago.
Italy: Correlation Between Confirmed Cases and Deaths SMA 5
Let's do the same analysis, but this time by taking into consideration the Simple Moving Average of 5 days.
Again, the maximum correlation observed on the 6th day.
Discussion
I would like to stress out that this analysis is not valid because we lack much of the information about the way of collecting and reporting the data. However, it is clear that there is a lag between the Confirmed cases and Deaths but we cannot specify the number accurately.
To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.
[This article was first published on R Blogs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Hey Everyone! I'm back with anoher tutorial on the Task 2 of thsame virtual experience program. I completed this Virtual Experience Program a month back and I have posted the solutions of Task 1. You should visit that too before continuing this tutorial.
This post is specifically about Task 2 – Experimentation and uplift testing You can view this Virtual Experience Program and enroll for the same.
Through the entire Task 1, I learnt how simple and efficient their solution module is rather than my way of writing code. So, instead I learnt their efficient yet short and simple coding and applied it to Task 2.
The client has selected store numbers 77, 86 and 88 as trial stores and want control stores to be established stores that are operational for the entire observation period. We would want to match trial stores to control stores that are similar to the trial store prior to the trial period of Feb 2019 in terms of :
Monthly overall sales revenue
Monthly number of customers
Monthly number of transactions per customer
Let's first create the metrics of interest and filter to stores that are present throughout the pre-trial period.
#### Calculate these measures over time for each store #### Add a new month ID column in the data with the format yyyymm. library(lubridate) library(tidyverse) library(dplyr) monthYear <- format(as.Date(data$DATE),"%Y%m") data[, YEARMONTH := monthYear] data$YEARMONTH <- as.numeric(as.character(data$YEARMONTH)) #### Next, we define the measure calculations to use during the analysis. ####For each store and month calculate total sales, number of customers,transactions per customer, chips per customer and the average price per unit. measureOverTime <- data %>% group_by(STORE_NBR,YEARMONTH) %>% summarise(totSales = sum(TOT_SALES),nCustomers = uniqueN(LYLTY_CARD_NBR),nTxnPerCust = uniqueN(TXN_ID)/uniqueN(LYLTY_CARD_NBR), nChipsPerTxn = sum(PROD_QTY)/uniqueN(TXN_ID), avgPricePerUnit = (sum(TOT_SALES)/sum(PROD_QTY))) #### Filter to the pre-trial period and stores with full observation periods storesWithFullObs <- as.data.table(table(measureOverTime$STORE_NBR)) storesWithFullObs <- storesWithFullObs %>% filter(N==12) storesWithFullObs<-setNames(storesWithFullObs,c("STORE_NBR","N")) preTrialMeasures <- measureOverTime %>% filter(YEARMONTH < 201902,STORE_NBR %in% storesWithFullObs$STORE_NBR)
Now we need to work out a way of ranking how similar each potential control store is to the trial store. We ca calculate how correlated the performance of each store is to the trial store.
Let's write a function for this so that we don't have to calculate this for each trial store and control store pair.
Apart from correlation, we can also calculate a standardised metric based on the absolute difference between the trial store's performance and each control store's performance.
Now let's use the functions to find the control stores! We'll select control stores based on how similar monthly total sales in dollar amounts and monthly number of customers are to the trial stores. So we will need to use our functions to get four scores, two for each of total sales and total customers
#### Use the function you created to calculate correlations against store 77 using total sales and number of customers. trial_store <- 77 corr_nSales <- calCorr(preTrialMeasures,trialStore_sales,trial_store) corr_nSales <- unique(corr_nSales) corr_nCustomers <- calculateCorrelation(preTrialMeasures, trialStore_sales, trial_store ) corr_nCustomers <- unique(corr_nCustomers) #### Use the functions for calculating magnitude magnitude_nSales <- calculateMagnitudeDistance1(preTrialMeasures, trialStore_sales, trial_store) magnitude_nSales <- standMag1(magnitude_nSales) magnitude_nCustomers <‐ calculateMagnitudeDistance2(preTrialMeasures,trialStore_sales, trial_store) magnitude_nCustomers <- standMag2(magnitude_nCustomers)
We'll need to combine the all the scores calculated using our function to create a composite score to rank on.
Let's take a simple average of the correlation and magnitude scores for each driver. Note that if we consider it more important for the trend of the drivers to be similar, we can increase the weight of the correlation score (a simple average gives a weight of 0.5 to the corr_weight) or if we consider the absolute size of the drivers to be more important, we can lower the weight of the correlation score.
The store with the highest score is then selected as the control store since it is most similar to the trial store.
#### Select control stores based on the highest matching store (closest to 1 but not the store itself, i.e. the second ranked highest store) control_store <- score_Control[order(-finalControlScore),] control_store <- control_store$Store2 control_store <- control_store[2]
Now that we have found a control store, let's check visually if the drivers are indeed similar in the period before the trial.
#### Visual checks on trends based on the drivers measureOverTimeSales <- as.data.table(measureOverTime) pastSales <- measureOverTimeSales[, Store_type := ifelse(STORE_NBR == trial_store, "Trial",ifelse(STORE_NBR == control_store,"Control", "Other stores"))][, totSales := mean(totSales), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/%100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][YEARMONTH < 201903 , ] ##Visualize ggplot(pastSales, aes(TransactionMonth, totSales, color = Store_type)) + geom_line() + labs(x = "Month of Operation", y = "Total Sales", title = "Total Sales by Month")
Next, number of Customers
Conduct visual checks on customer count trends by comparing the trial store to the control store and other stores.
measureOverTimeCusts <- as.data.table(measureOverTime) pastCustomers <- measureOverTimeCusts[, Store_type := ifelse(STORE_NBR == trial_store, "Trial",ifelse(STORE_NBR == control_store,"Control", "Other stores"))][, numberCustomers := mean(nCustomers), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/%100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][YEARMONTH < 201903 ] ###Visualize ggplot(pastCustomers, aes(TransactionMonth, numberCustomers, color = Store_type)) + geom_line() + labs(x = "Month of Operation", y = "Total Number of Customers", title = "Total Number of Customers by Month")
Assessment of Trial
The trial period goes from the start of February 2019 to April 2019. We now want to see if there has been an uplift in overall chip sales.
We'll start with scaling the control store's sales to a level similar to control for any differences between the two stores outside of the trial period.
Now that we have comparable sales figures for the control store, we can calculate the percentage difference between the scaled control sales and the trial store's sales during the trial period.
#### As our null hypothesis is that the trial period is the same as the pre-trial period, let's take the standard deviation based on the scaled percentage difference in the pre-trial period stdDev <- sd(percentageDiff[YEARMONTH < 201902, percentageDiff]) #### Note that there are 8 months in the pre-trial period #### hence 8 - 1 = 7 degrees of freedom degreesOfFreedom <- 7 #### We will test with a null hypothesis of there being 0 difference between trial and control stores. #### Calculate the t-values for the trial months. percentageDiff[ , tvalue := (percentageDiff - 0)/stdDev][ , TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "-"),"%Y-%m-%d")][YEARMONTH < 201905 & YEARMONTH > 201901, .(TransactionMonth, tvalue)] #### Also,find the 95th percentile of the t distribution with the appropriate degrees of freedom to check whether the hypothesis is statistically significant. qt(0.95, df = degreesOfFreedom)
We can observe that the t-value is much larger than the 95th percentile value of the t-distribution for March and April i.e. the increase in sales in the trial store in March and April is statistically greater than in the control store.
Let's create a more visual version of this by plotting the sales of the control store, the sales of the trial stores and the 95th percentile value of sales of the control store.
The results show that the trial in store 77 is significantly different to its control store in the trial period as the trial store performance lies outside th 5% to 95% confidence interval of the control store in two of the three trial months.
Let's have a look at assessing this for number of customers as well.
Let's again see if the difference is significant visually!
#### As our null hypothesis is that the trial period is the same as the pre-trial period, let's take the standard deviation based on the scaled percentage difference in the pre-trial period stdDev <- sd(percentageDiff[YEARMONTH < 201902 , percentageDiff]) degreesOfFreedom <- 7 #### Trial and control store number of customers measureOverTimeCusts <- as.data.table(measureOverTime) pastCustomers <- measureOverTimeCusts[, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store, "Control", "Other stores"))][, nCusts := mean(nCustomers), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d") ][Store_type %in% c("Trial", "Control"), ] ###Control 95th percentile pastCustomers_Control95 <- pastCustomers[Store_type == "Control",][, nCusts := nCusts * (1 + stdDev * 2)][, Store_type := "Control 95th % confidence interval"] ###Control 5th percentile pastCustomers_Control5 <- pastCustomers[Store_type == "Control",][, nCusts := nCusts * (1 + stdDev * 2)][, Store_type := "Control 5th % confidence interval"] trialAssessment <- rbind(pastCustomers,pastCustomers_Control95,pastCustomers_Control5) ###Visualize ggplot(trialAssessment, aes(TransactionMonth, nCusts, color = Store_type)) + geom_rect(data = trialAssessment[YEARMONTH < 201905 & YEARMONTH > 201901 , ], aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0, ymax = Inf, coor = NULL), show.legend = F) + geom_line(aes(linetype = Store_type)) + labs(x = "Month Of Operation", y = "Total Number of Customers", title = "Total Number of Customers by Month")
It looks like the number of customers is significantly higher in all of the three months. This seems to suggest that the trial had a significant impact on increasing the number of customers in trial store 86 but as we saw, sales were not significantly higher. We should check with the Category Manager if there were special deals in the trial store that were may have resulted in lower prices, impacting the results.
Let's repeat finding the control store and assessing the impact of the trial for each of the other two trial stores.
Trial Store 86
data <- as.data.table(data) measureOverTime <- data[, .(totSales = sum(TOT_SALES), nCustomers = uniqueN(LYLTY_CARD_NBR), nTxnPerCust = uniqueN(TXN_ID)/uniqueN(LYLTY_CARD_NBR), nChipsPerTxn = sum(PROD_QTY)/uniqueN(TXN_ID), avgPricePerUnit = sum(TOT_SALES)/sum(PROD_QTY)), by = c("STORE_NBR", "YEARMONTH")][order(STORE_NBR,YEARMONTH)] ### USe the fucntions for calculating correlation trial_store <- 86 trialStore_sales <- preTrialMeasures %>% filter(STORE_NBR ==86) trialStore_sales <- trialStore_sales %>% select(STORE_NBR,YEARMONTH,totSales,nCustomers) corr_nSales <- calCorr(preTrialMeasures,trialStore_sales,trial_store) corr_nSales <- unique(corr_nSales) corr_nCustomers <- calculateCorrelation(preTrialMeasures, trialStore_sales, trial_store ) corr_nCustomers <- unique(corr_nCustomers) #### Use the functions for calculating magnitude magnitude_nSales <- calculateMagnitudeDistance1(preTrialMeasures, trialStore_sales, trial_store) magnitude_nSales <- standMag1(magnitude_nSales) magnitude_nCustomers <‐ calculateMagnitudeDistance2(preTrialMeasures,trialStore_sales, trial_store) magnitude_nCustomers <- standMag2(magnitude_nCustomers) #### Now, create a combined score composed of correlation and magnitude corr_weight <- 0.5 score_nSales <- merge(corr_nSales,magnitude_nSales, by = c("Store1", "Store2")) score_nSales <- score_nSales %>% mutate(scoreNSales = (score_nSales$corr_measure * corr_weight)+(score_nSales$magN_measure * (1 - corr_weight))) score_nCustomers <- merge(corr_nCustomers,magnitude_nCustomers, by = c("Store1", "Store2")) score_nCustomers <- score_nCustomers %>% mutate(scoreNCust = (score_nCustomers$corr_measure * corr_weight)+(score_nCustomers$magN_measure * (1 - corr_weight))) #### Finally, combine scores across the drivers using a simple average. score_Control <- merge(score_nSales,score_nCustomers, by = c("Store1", "Store2")) score_Control <- score_Control %>% mutate(finalControlScore = (scoreNSales * 0.5) + (scoreNCust * 0.5)) #### Select control stores based on the highest matching store #### (closest to 1 but not the store itself, i.e. the second ranked highest store) control_store <- score_Control[order(-finalControlScore),] control_store <- control_store$Store2 control_store <- control_store[2]
Looks like store 155 will be a control store for trial store 86.
Again, let's check visually if the drivers are indeed similar in the period before the trial.
measureOverTimeCusts <- as.data.table(measureOverTime) pastCustomers <- measureOverTimeCusts[, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store, "Control", "Other stores"))][, nCusts := mean(nCustomers), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][YEARMONTH < 201903 ,] ###Visualize ggplot(pastCustomers, aes(TransactionMonth, nCusts, color = Store_type)) + geom_line() + labs(x = "Month of operation", y = "Total number of customers", title = "Total number of customers by month")
Good, the trend in number of customers is also similar.
Let's now assess the impact of the trial on sales.
#### Scale pre-trial control sales to match pre-trial trial store sales scalingFactorForControlSales <- preTrialMeasures[STORE_NBR == trial_store & YEARMONTH < 201902, sum(totSales)]/preTrialMeasures[STORE_NBR == control_store & YEARMONTH < 201902, sum(totSales)] #### Apply the scaling factor measureOverTimeSales <- as.data.table(measureOverTime) scaledControlSales <- measureOverTimeSales[STORE_NBR == control_store, ][ , controlSales := totSales * scalingFactorForControlSales] ###Calculate the percentage difference between scaled control sales and trial sales measureOverTime <- as.data.table(measureOverTime) percentageDiff <- merge(scaledControlSales[, c("YEARMONTH", "controlSales")], measureOverTime[STORE_NBR == trial_store, c("totSales", "YEARMONTH")], by = "YEARMONTH")[ , percentageDiff := abs(controlSales - totSales)/controlSales] #### As our null hypothesis is that the trial period is the same as the pre-trial period, let's take the standard deviation based on the scaled percentage difference in the pre-trial period stdDev <- sd(percentageDiff[YEARMONTH < 201902, percentageDiff]) degreesOfFreedom <- 7 #### Trial and control store total sales measureOverTimeSales <‐ as.data.table(measureOverTime) pastSales <‐ measureOverTimeSales[, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store,"Control", "Other stores")) ][, totSales := mean(totSales), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][Store_type %in% c("Trial", "Control"), ] #### Control store 95th percentile pastSales_Controls95 <‐ pastSales[Store_type == "Control",][, totSales := totSales * (1 + stdDev * 2)][, Store_type := "Control 95th % confidence interval"] #### Control store 5th percentile pastSales_Controls5 <‐ pastSales[Store_type == "Control",][, totSales := totSales * (1 ‐ stdDev * 2)][, Store_type := "Control 5th % confidence interval"] trialAssessment <‐ rbind(pastSales, pastSales_Controls95, pastSales_Controls5) #### Plotting these in one nice graph ggplot(trialAssessment, aes(TransactionMonth, totSales, color = Store_type)) + geom_rect(data = trialAssessment[ YEARMONTH < 201905 & YEARMONTH > 201901 ,], aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0 , ymax = Inf, color = NULL), show.legend = FALSE) + geom_line(aes(linetype = Store_type)) + labs(x = "Month of operation", y = "Total sales", title = "Total sales by month")
The results show that the trial in store 86 is significantly different to its control store in the trial period as the trial store performance lies outside of the 5% to 95% confidence interval of the control store in two of the three trial months.
Let's have a look at assessing this for number of customers as well.
scalingFactorForControlCust <‐ preTrialMeasures[STORE_NBR == trial_store & YEARMONTH < 201902, sum(nCustomers)]/preTrialMeasures[STORE_NBR == control_store & YEARMONTH < 201902, sum(nCustomers)] #### Apply the scaling factor measureOverTimeCusts <‐ as.data.table(measureOverTime) scaledControlCustomers <‐ measureOverTimeCusts[STORE_NBR == control_store,][ , controlCustomers := nCustomers * scalingFactorForControlCust][, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store, "Control", "Other stores"))] #### Calculate the percentage difference between scaled control sales and trial sales percentageDiff <‐ merge(scaledControlCustomers[, c("YEARMONTH", "controlCustomers")], measureOverTime[STORE_NBR == trial_store, c("nCustomers", "YEARMONTH")], by = "YEARMONTH")[, percentageDiff :=abs(controlCustomers‐nCustomers)/controlCustomers] #### As our null hypothesis is that the trial period is the same as the pre‐trial period, let's take the standard deviation based on the scaled percentage difference in the pre‐trial period stdDev <‐ sd(percentageDiff[YEARMONTH < 201902 , percentageDiff]) degreesOfFreedom <‐ 7 # note that there are 8 months in the pre‐trial period hence 8 ‐ 1 = 7 degrees of freedom #### Trial and control store number of customers measureOverTimeCusts <- as.data.table(measureOverTime) pastCustomers <- measureOverTimeCusts[, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store, "Control", "Other stores"))][, nCusts := mean(nCustomers), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][Store_type %in% c("Trial", "Control"), ] #### Control store 95th percentile pastCustomers_Controls95 <‐ pastCustomers[Store_type == "Control",][, nCusts := nCusts * (1 + stdDev * 2) ][, Store_type := "Control 95th % confidence interval"] #### Control store 5th percentile pastCustomers_Controls5 <‐ pastCustomers[Store_type == "Control",][, nCusts := nCusts * (1 ‐ stdDev * 2)][, Store_type := "Control 5th % confidence interval"] trialAssessment <‐ rbind(pastCustomers, pastCustomers_Controls95, pastCustomers_Controls5) #### Visualize ggplot(trialAssessment, aes(TransactionMonth, nCusts, color = Store_type)) + geom_rect(data = trialAssessment[ YEARMONTH < 201905 & YEARMONTH > 201901 ,], aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0 , ymax = Inf, color = NULL), show.legend = FALSE) + geom_line() + labs(x = "Month of operation", y = "Total number of customers", title = "Total number of customers by month")
Trial Store 88
data <- as.data.table(data) measureOverTime <- data[, .(totSales = sum(TOT_SALES), nCustomers = uniqueN(LYLTY_CARD_NBR), nTxnPerCust = uniqueN(TXN_ID)/uniqueN(LYLTY_CARD_NBR), nChipsPerTxn = sum(PROD_QTY)/uniqueN(TXN_ID), avgPricePerUnit = sum(TOT_SALES)/sum(PROD_QTY)), by = c("STORE_NBR", "YEARMONTH")][order(STORE_NBR,YEARMONTH)] ### USe the fucntions for calculating correlation trial_store <- 88 trialStore_sales <- preTrialMeasures %>% filter(STORE_NBR ==88) trialStore_sales <- trialStore_sales %>% select(STORE_NBR,YEARMONTH,totSales,nCustomers) corr_nSales <- calCorr(preTrialMeasures,trialStore_sales,trial_store) corr_nSales <- unique(corr_nSales) corr_nCustomers <- calculateCorrelation(preTrialMeasures, trialStore_sales, trial_store ) corr_nCustomers <- unique(corr_nCustomers) #### Use the functions for calculating magnitude magnitude_nSales <- calculateMagnitudeDistance1(preTrialMeasures, trialStore_sales, trial_store) magnitude_nSales <- standMag1(magnitude_nSales) magnitude_nCustomers <‐ calculateMagnitudeDistance2(preTrialMeasures,trialStore_sales, trial_store) magnitude_nCustomers <- standMag2(magnitude_nCustomers) #### Now, create a combined score composed of correlation and magnitude corr_weight <- 0.5 score_nSales <- merge(corr_nSales,magnitude_nSales, by = c("Store1", "Store2")) score_nSales <- score_nSales %>% mutate(scoreNSales = (score_nSales$corr_measure * corr_weight)+(score_nSales$magN_measure * (1 -corr_weight))) score_nCustomers <- merge(corr_nCustomers,magnitude_nCustomers, by = c("Store1", "Store2")) score_nCustomers <- score_nCustomers %>% mutate(scoreNCust = (score_nCustomers$corr_measure * corr_weight)+(score_nCustomers$magN_measure * (1 - corr_weight))) #### Finally, combine scores across the drivers using a simple average. score_Control <- merge(score_nSales,score_nCustomers, by = c("Store1", "Store2")) score_Control <- score_Control %>% mutate(finalControlScore = (scoreNSales * 0.5) + (scoreNCust * 0.5)) #### Select control stores based on the highest matching store #### (closest to 1 but not the store itself, i.e. the second ranked highest store) control_store <- score_Control[order(-finalControlScore),] control_store <- control_store$Store2 control_store <- control_store[2]
Looks like store 178 will be a control store for trial store 88.
Again, let's check visually if the drivers are indeed similar in the period before the trial.
measureOverTimeCusts <- as.data.table(measureOverTime) pastCustomers <- measureOverTimeCusts[, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store, "Control", "Other stores"))][, nCusts := mean(nCustomers), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][YEARMONTH < 201903 ,] ###Visualize ggplot(pastCustomers, aes(TransactionMonth, nCusts, color = Store_type)) + geom_line() + labs(x = "Month of operation", y = "Total number of customers", title = "Total number of customers by month")
Let's now assess the impact of the trial on sales.
#### Scale pre-trial control sales to match pre-trial trial store sales scalingFactorForControlSales <- preTrialMeasures[STORE_NBR == trial_store & YEARMONTH < 201902, sum(totSales)]/preTrialMeasures[STORE_NBR == control_store & YEARMONTH < 201902, sum(totSales)] #### Apply the scaling factor measureOverTimeSales <- as.data.table(measureOverTime) scaledControlSales <- measureOverTimeSales[STORE_NBR == control_store, ][ , controlSales := totSales * scalingFactorForControlSales] ###Calculate the percentage difference between scaled control sales and trial sales measureOverTime <- as.data.table(measureOverTime) percentageDiff <- merge(scaledControlSales[, c("YEARMONTH", "controlSales")], measureOverTime[STORE_NBR == trial_store, c("totSales", "YEARMONTH")], by = "YEARMONTH")[ , percentageDiff := abs(controlSales - totSales)/controlSales] #### As our null hypothesis is that the trial period is the same as the pre-trial period, let's take the standard deviation based on the scaled percentage difference in the pre-trial period stdDev <- sd(percentageDiff[YEARMONTH < 201902, percentageDiff]) degreesOfFreedom <- 7 #### Trial and control store total sales measureOverTimeSales <‐ as.data.table(measureOverTime) pastSales <‐ measureOverTimeSales[, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store,"Control", "Other stores")) ][, totSales := mean(totSales), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][Store_type %in% c("Trial", "Control"), ] #### Control store 95th percentile pastSales_Controls95 <‐ pastSales[Store_type == "Control",][, totSales := totSales * (1 + stdDev * 2)][, Store_type := "Control 95th % confidence interval"] #### Control store 5th percentile pastSales_Controls5 <‐ pastSales[Store_type == "Control",][, totSales := totSales * (1 ‐ stdDev * 2)][, Store_type := "Control 5th % confidence interval"] trialAssessment <‐ rbind(pastSales, pastSales_Controls95, pastSales_Controls5) #### Plotting these in one nice graph ggplot(trialAssessment, aes(TransactionMonth, totSales, color = Store_type)) + geom_rect(data = trialAssessment[ YEARMONTH < 201905 & YEARMONTH > 201901 ,], aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0 , ymax = Inf, color = NULL), show.legend = FALSE) + geom_line(aes(linetype = Store_type)) + labs(x = "Month of operation", y = "Total sales", title = "Total sales by month")
Let's have a look at assessing this for number of customers as well.
scalingFactorForControlCust <‐ preTrialMeasures[STORE_NBR == trial_store & YEARMONTH < 201902, sum(nCustomers)]/preTrialMeasures[STORE_NBR == control_store & YEARMONTH < 201902, sum(nCustomers)] #### Apply the scaling factor measureOverTimeCusts <‐ as.data.table(measureOverTime) scaledControlCustomers <‐ measureOverTimeCusts[STORE_NBR == control_store,][ , controlCustomers := nCustomers * scalingFactorForControlCust][, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store, "Control", "Other stores"))] #### Calculate the percentage difference between scaled control sales and trial sales percentageDiff <‐ merge(scaledControlCustomers[, c("YEARMONTH", "controlCustomers")], measureOverTime[STORE_NBR == trial_store, c("nCustomers", "YEARMONTH")], by = "YEARMONTH")[, percentageDiff :=abs(controlCustomers‐nCustomers)/controlCustomers] #### As our null hypothesis is that the trial period is the same as the pre‐trial period, let's take the standard deviation based on the scaled percentage difference in the pre‐trial period stdDev <‐ sd(percentageDiff[YEARMONTH < 201902 , percentageDiff]) degreesOfFreedom <‐ 7 # note that there are 8 months in the pre‐trial period hence 8 ‐ 1 = 7 degrees of freedom #### Trial and control store number of customers measureOverTimeCusts <- as.data.table(measureOverTime) pastCustomers <- measureOverTimeCusts[, Store_type := ifelse(STORE_NBR == trial_store, "Trial", ifelse(STORE_NBR == control_store, "Control", "Other stores"))][, nCusts := mean(nCustomers), by = c("YEARMONTH","Store_type")][, TransactionMonth := as.Date(paste(YEARMONTH %/% 100, YEARMONTH %% 100, 1, sep = "‐"), "%Y‐%m‐%d")][Store_type %in% c("Trial", "Control"), ] #### Control store 95th percentile pastCustomers_Controls95 <‐ pastCustomers[Store_type == "Control",][, nCusts := nCusts * (1 + stdDev * 2) ][, Store_type := "Control 95th % confidence interval"] #### Control store 5th percentile pastCustomers_Controls5 <‐ pastCustomers[Store_type == "Control",][, nCusts := nCusts * (1 ‐ stdDev * 2)][, Store_type := "Control 5th % confidence interval"] trialAssessment <‐ rbind(pastCustomers, pastCustomers_Controls95, pastCustomers_Controls5) #### Visualize ggplot(trialAssessment, aes(TransactionMonth, nCusts, color = Store_type)) + geom_rect(data = trialAssessment[ YEARMONTH < 201905 & YEARMONTH > 201901 ,], aes(xmin = min(TransactionMonth), xmax = max(TransactionMonth), ymin = 0 , ymax = Inf, color = NULL), show.legend = FALSE) + geom_line() + labs(x = "Month of operation", y = "Total number of customers", title = "Total number of customers by month")
Total number of customers in the trial period for the trial store is significantly higher than the control store for two out of three months, which indicates a positive trial effect.
Conclusion
Good work! We've found control stores 233, 155, 178 for trial stores 77, 86 and 88 respectively. The results for trial stores 77 and 86 during the trial period show a significant difference in at least two of the three trial months but this is not the case for trial store 88. We can check with the client if the implementation of the trial was different in trial store 88 but overall, the trial shows a significant increase in sales. Now that we have finished our analysis, we can prepare our presentation to the Category Manager.
Outro
Task 2 was crucial step in analysis so as to identify benchmark stores that would test the impact of the trial store layouts on customer sales. Collation and summarization of all the findings for each store so as to provide a recommendation that we can share outlining the impact on sales during the trial period.
Task 3 is about creating a presentation of all the findings we have gathered through our analysis in Task 1 and 2. I used Google Slides to create my own. You can use the same or any other mode or even the module provided. Task 3 is quite easy but still on demand I can upload the steps to create a presentation for Task 3
Get in touch using any of my social media handles or mail me you queries!
Till then, any feedbacks, queries or recommendations are appreciated on any of my social media handles.
[This article was first published on Data Enthusiast's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Not long ago, I published a post "Three ggplot2 Themes Optimized for the Web", where I made some tweaks to my three favorite ggplot2 themes – theme_bw(), theme_classic(), and theme_void() – to make them more readable and generally look better in graphics posted online, particularly in blog posts and similar publications.
I was happy to see that some people liked those and suggested that I should make a package. I tended to view packages as large collections of code and functions, but as Sébastien Rochette wisely put it, "If you have one function, create a package! If this simplifies your life, why not?" And since I will be frequently using these themes in subsequent posts, I'd like to make it as convenient as possible for the reader to install and use them.
So here is the ggwebthemes package! It has the same three themes, which I have tweaked and improved some more.
The package is not yet on CRAN. You can install ggwebthemes from GitLab:
Please report bugs and/or submit feature requests here. Since I am currently using WordPress (but thinking about switching to a static site), I am particularly interested in how these themes would behave on Hugo and in blogs created with R Markdown/blogdown, so if you have any feedback, it will be most appreciated.
Note: To avoid confusing the readers, I will be removing the original post "Three ggplot2 Themes Optimized for the Web", which contains early versions of these themes. You can still find it on R-Bloggers in case you need it.
[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Caution / who should read this – although this blog post uses some data on reporting times for Covid-19 deaths in Florida, it's not really about that issue at all. It's a quick dip into how to model data that comes from a mixture of two or more different probability distributions.
Mixture distributions
I have thoughts about mixture distributions. Sometimes we get data that could be modelled simply with classical techniques if we make assumptions about its distribution which are mostly right. But the distribution is 'contaminated' with another distribution. Very small departures from your assumptions can cause major challenges which are not always visually obvious.
Consider this example, drawn from Rand Wilcox's excllent Modern Statistics for the Social and Bahvaioral Sciences. A standard normal distribution has been contanimated with 10% of its data coming from a second distribution, also normal but with a much higher variance. Without very large samples and/or a particularly alert approach to the modelling, this sort of data can cause serious pitfalls for inference.
Exactly what the pitfalls are, I don't have time to talk about right now, but they are serious and involve long tails and difficulties in estimating variance.
That example was drawn with this code, which also sets us up for application next in this post to a real life example that crossed my path a week ago:
library(tidyverse)library(scales)library(ggrepel)library(glue)library(ggtext)library(rstan)#---------Example adapted from Wilcox Modern Statistics for the Social and Behavioral Sciences--------- # Wilcox's Figure 3.8 set.seed(123)egd<-tibble(diet=rep(c("Diet","No diet"),c(10000,90000)))%>%mutate(weight_loss=ifelse(diet=="Diet",rnorm(n(),0,10),rnorm(n(),0,1)))ggplot(egd,aes(x=weight_loss))+geom_density(fill="grey",alpha=0.5,colour="darkblue")+stat_function(fun=dnorm,colour="darkgreen",n=1000)+coord_cartesian(xlim=c(-3,3))+annotate("text",x=1,y=0.3,label="sigma^2==1",parse=TRUE,colour="darkgreen",hjust=0)+annotate("text",x=-1.1,y=0.17,label=glue("sigma^2=={round(var(egd$weight_loss), 1)}"),parse=TRUE,colour="darkblue",hjust=0)+annotate("text",x=-1.1,y=0.13,label="Mixture of standard normal (90%) and\nnormal with sd=10 (10%)",colour="darkblue",hjust=0,size=3)+labs(caption="Adapted from Figure 3.8 of Wilcox, Modern Statistics for the Social and Behavioral Sciences",x="Weight loss",title="The perils of a mixed or 'contanimated' distribution",subtitle="Comparison of a standard normal and a mixed normal distribution")+theme(plot.subtitle=element_textbox())
Distribution of reporting delays for deaths in Florida
The real life example came from a Twitter thread by Jason Salemi on 28 August 2020, about the distribution of delays in reporting Covid-19 deaths in Florida. I can't find the actual Twitter thread now, but the discussion in passing mentioned the interesting fact that the mean delay was noticeably smaller than the median. This is unusual for data that has a strict lower bound (report isn't possible before the actual death) and no upper bound, because typically a long rightwards tail and a few outliers will drag the mean to noticeably higher value than the median.
The example has extra interest for me in thinking about the data generating process for this sort of reporting on deaths. Recently in Victoria it has become clear that there are (at least) two different ways Covid-19 deaths are being reported – via aged care, and everything else. The aged care deaths are characterised by appearing in the official statistics only at some delay (up to a month or more), and in big clumps. This looks like the sort of process that needs to be modelled by a mixture distribution.
Dr Salemi kindly forwarded me the actual Florida data, which looks like this:
The mean delay is 23.25 days and the median is 31. I've converted one freqency of minus one in the original data into zero to make my life simpler. That chart was produced with this R code:
#--------Florida delay in death reporting data----------- # From Jason Salemi 28 August 2020 via Twitter orig_d<-tibble(freq=c(11,18,2,2,1,5,2,3,0,1,2,1,3,1,0,3,1,1,0,2,0,0,1,3,0,0,1,0,0,1,2,2,2,7,8,4,7,6,6,3,3,6,0,2,2,1,3,0,2,3,0,0,1,0,0,1))%>%mutate(delay=1:n()-1)%>%mutate(cumulative_freq=cumsum(freq),quantile=cumulative_freq/sum(freq))orig_x<-with(orig_d,rep(delay,freq))mean(orig_x)median(orig_x)ggplot(orig_d,aes(x=as.ordered(delay),y=freq))+geom_vline(xintercept=mean(orig_x),colour="blue",size=2)+geom_vline(xintercept=median(orig_x),colour="red",size=2)+geom_col(width=1,colour="grey50",alpha=0.9)+geom_text(aes(label=ifelse(freq>0,freq,""),y=freq+1),size=3,colour="steelblue")+scale_x_discrete(breaks=0:5*10)+labs(x="Time elapsed from death occuring to appearing in confirmed statistics",y="Frequency",title="Reporting lags in Covid-19 deaths in Florida",caption="Data from Jason L. Salemi",subtitle="An interesting example of a mixed distribution; with two modes, and mean < median.")+theme(plot.subtitle=element_textbox())
It seemed to me from this visualisation that this was likely to be from a mixture of two distributions, but I was still surprised that the median and mean worked out this way. This prompted me to consider "is it easy to construct a mixture distribution, given just the overall median and mean?"
Brute force – fitting a mixture of two Poisson distributions based on just mean and median
It wasn't hard to do this if I was prepared to assume both distributions contributing to the mixture are Poisson distributions. This means I have just three parameters to estimate – the lambda (mean and variance) of each Poisson distribution, and the proportion of the data that comes from the first of the two distributions.
If you have the mean of the first distribution and the proportion of data from that distribution, you can calculate the mean of the second that is needed to provide the given total mean. Because:
So:
So really we only have two degrees of freedom to worry about. By brute force we can make a grid of possible values of and , calculate the as above in a way that the overall mean is 23.25, and then use the probability mass functions of Poisson distributions to calculate the median of the resulting mixture directly. Then we just need to see which combinations of the parameters give the correct median (31.0).
It turns out there are not one but many mixtures of two Poisson distributions that will give you this combination of mean and median. Here is a summary of which combinations of parameters can give you this:
… and here are four examples of the actual fitted distributions. These all have the same mean and median, but differ in other characteristics
To do all this, I created two functions in R:
dpois2() calculates the density of a mixture of two Poisson distributions
median_pois2() returns the median of a mixture of two Poisson distributions
And then just calculated the median for all combinations of a grid of possible parameters as described above. Here's the code that does that and produces those two charts:
#' Density of a mixture of two Poisson distributions #' #' @param x vector of non-negative integers #' @param lambda1 mean of first Poisson distribution #' @param lambda2 mean of second Poisson distribution #' @param p1 proportion of values that come from the first distribution dpois2<-function(x,lambda1,lambda2,p1){d<-dpois(x,lambda1)*p1+dpois(x,lambda2)*(1-p1)return(d)}#' Approximate the median of a mixture of two Poisson distributions #' #' @param lambda1 mean of first Poisson distribution #' @param lambda2 mean of second Poisson distribution #' @param p1 proportion of values that come from the first distribution #' @param maxx highest value data to take into account median_pois2<-function(lambda1,lambda2,p1,maxx=1000){df<-data.frame(x=0:maxx)df$dens<-dpois2(df$x,lambda1,lambda2,p1)df$F<-cumsum(df$dens)which_row<-which(abs(df$F-0.5)==min(abs(df$F-0.5)))[1]med<-df[which_row,]$x-1return(med)}trial_grid<-expand.grid(p1=0:200/200,lambda1=1:30/6)%>%as_tibble()%>%mutate(lambda2=(23.25-p1*lambda1)/(1-p1))%>%mutate(med=Vectorize(median_pois2)(lambda1,lambda2,p1))# Visualisation of trade-off between different parameters set.seed(42)trial_grid%>%filter(round(med,2)==31.0)%>%mutate(mu2_lab=ifelse(runif(n())>0.9,round(lambda2,1),""))%>%ggplot(aes(x=p1,y=lambda1))+geom_smooth(se=FALSE,colour="lightblue")+geom_point(size=2,colour="grey")+geom_text_repel(aes(label=mu2_lab),colour="steelblue")+labs(x="Proportion of variables from first Poisson distribution",y="Mean of first Poisson distribution",title="Different mixtures to give you the same mean and median",subtitle="Parameters of a mixture of two Poisson distributions that have a mean of 23.25 and median of 31.")+annotate("text",x=0.378,y=2,label="Labels show mean of second Poisson\ndistribution (selected points only).",hjust=0,colour="steelblue")# Four example distributions: top_params<-trial_grid%>%arrange(abs(med-31),runif(n()))%>%slice(1:4)dens_results<-list()x<-0:70# Get the actual densities: for(iin1:nrow(top_params)){d<-top_params[i,]dens_results[[i]]<-with(d,data.frame(x=x,dens=dpois2(x,lambda1,lambda2,p1),lambda1=lambda1,lambda2=lambda2,p1=p1,parameter_set=i))}# Not sure why I have to define my own labeller function, but I can't see # otherwise how I pass multi_line through to label_parsed in facet_wrap() ! lp<-function(labels,multi_line=FALSE){label_parsed(labels,multi_line=multi_line)}# Draw the chart for the four example distributions: bind_rows(dens_results)%>%as_tibble()%>%mutate(par_lab1=glue("lambda[1]=={round(lambda1, 1)}"),par_lab2=glue("lambda[2]=={round(lambda2, 2)}"),par_lab3=glue("pi[1]=={p1}"))%>%ggplot(aes(x=x,y=dens))+facet_wrap(~par_lab1+par_lab2+par_lab3,labeller=lp)+geom_vline(xintercept=31,colour="red",size=2)+geom_vline(xintercept=23.5,colour="blue",size=2)+geom_col()+theme(panel.spacing=unit(1.5,"lines"),plot.title=element_textbox())+xlim(0,60)+labs(x="x",y="Density",title="Different mixtures to give you the same mean and median",subtitle="Parameters of a mixture of two Poisson distributions that (taken together) have a mean of 23.25 and median of 31.")#---------------check I'm not going mad and that I do get the right results!------------------- n<-1e5xsim<-c(rpois(n*top_params[1,]$p1,top_params[1,]$lambda1),rpois(n*(1-top_params[1,]$p1),top_params[1,]$lambda2))c(mean(xsim),median(xsim))
Better approach – fitting mixtures in Stan using all the data
There are two serious shortfalls with the approach to modelling above:
Because I am using only the mean and the median summary statistics, I am discarding a lot of extra information in the original data
While Poisson distributions are easy to use because they are fully charaterised by a single parameter, there is no particular reason to think that they are a good model of the number of days death by which reporting is delayed.
Shortfall 1 suggests I need to find a way to estimate a good mixture of distributions that uses the full original dataset. Shortfall 2 suggests using a different distribution, but still one that works with integers; negative binomial is my usual go-to here.
To address my challenges in easy steps, I started by fitting a mixture of two Poisson distributions to my delays data, using the full data not just the summary statistics. Here's the simple program in Stan that sets out this model
Running this from R is a one-liner, and it quickly comes back with an estimate that the data could come from two Poisson distributions with means of 3.4 and 35.4, with 38% of the data coming from the first of these. This isn't hugely different from my cruder estimates based on just the mean and median, but noticeably the mean of the first distribution is now quite a bit higher. Results from Stan are just pasted into the code below as comments, for reference.
ms1<-stan("0192-poisson.stan",data=list(x=orig_x,N=length(orig_x),K=2))ms1# Inference for Stan model: 0192-poisson. # 4 chains, each with iter=2000; warmup=1000; thin=1; # post-warmup draws per chain=1000, total post-warmup draws=4000. # # mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat # lambda[1] 3.42 0.01 0.30 2.84 3.21 3.41 3.61 4.02 2849 1 # lambda[2] 35.41 0.01 0.68 34.09 34.94 35.41 35.88 36.71 4107 1 # p[1] 0.38 0.00 0.04 0.30 0.35 0.38 0.41 0.47 3532 1 # p[2] 0.62 0.00 0.04 0.53 0.59 0.62 0.65 0.70 3532 1 # lp__ -580.55 0.03 1.29 -584.02 -581.09 -580.22 -579.62 -579.11 2029 1 # # Samples were drawn using NUTS(diag_e) at Sun Sep 06 11:36:48 2020. # For each parameter, n_eff is a crude measure of effective sample size, # and Rhat is the potential scale reduction factor on split chains (at #
What about addressing shortfall 2? It's straightforward to adapt my Stan program to work with a mixture of K negative binomial distributions. Note that there are two different popular ways to characterise a negative binomial distribution, I am choosing the method that specifies the mean and a dispersion parameter:
Fitting this gives me a mixture with 40% of the data from a negative binomial distribution with (mean, size) parameters of (4.8, 0.77); and 60% from a distribution with (mean, size) of (35.8, 13.7). Closing the circle, if I simulate data in R from that mixture, this is what it looks like:
It's noticeably not a perfect fit to the original Florida data, suggesting that even this model is perhaps under-fit and more complex things are going on. Also that more data (as always) would be helpful. But I think I'm onto a good approach. As an aside, I find it extraordinary how well Stan works in this situation.
Here's the final R code for controlling the Stan program with the negative binomial mixture, and simulating data from the result. Again, results from Stan are just pasted into the code below as comments, for reference.
ms2<-stan("0192-negbin.stan",data=list(x=orig_x,N=length(orig_x),K=2),cores=4,chains=4)ms2# Inference for Stan model: 0192-negbin. # 4 chains, each with iter=2000; warmup=1000; thin=1; # post-warmup draws per chain=1000, total post-warmup draws=4000. # # mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat # mu[1] 4.83 0.04 1.64 2.52 3.70 4.49 5.58 9.04 1423 1 # mu[2] 35.82 0.02 1.50 32.87 34.86 35.80 36.81 38.79 4234 1 # phi[1] 0.77 0.00 0.23 0.44 0.61 0.74 0.89 1.30 2190 1 # phi[2] 13.71 0.06 3.22 8.18 11.39 13.48 15.71 20.74 3032 1 # p[1] 0.40 0.00 0.05 0.31 0.37 0.40 0.44 0.51 2152 1 # p[2] 0.60 0.00 0.05 0.49 0.56 0.60 0.63 0.69 2152 1 # lp__ -526.41 0.04 1.64 -530.64 -527.20 -526.08 -525.21 -524.25 1524 1 # # Samples were drawn using NUTS(diag_e) at Sun Sep 06 12:04:16 2020. # For each parameter, n_eff is a crude measure of effective sample size, # and Rhat is the potential scale reduction factor on split chains (at # convergence, Rhat=1). #---------------simulated data from that fitted mixed negative binomial distribution------------------- n<-1e5xsim<-c(rnbinom(n*0.4,mu=4.83,size=0.77),rnbinom(n*0.6,mu=35.82,size=13.71))# Plot of simulations data.frame(x=xsim)%>%ggplot(aes(x=x))+geom_vline(xintercept=mean(xsim),colour="blue",size=2)+geom_vline(xintercept=median(xsim),colour="red",size=2)+geom_histogram(binwidth=1,aes(y=..density..),colour="grey50",alpha=0.9)+scale_y_continuous(label=comma)+coord_cartesian(xlim=range(orig_d$delay))+theme(plot.subtitle=element_textbox_simple())+labs(y="Relative frequency",title="Simulations from fitted mixture of two negative binomial distributions",subtitle="Fit is ok and median > mean, but not by as much as in original.")
To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.
[This article was first published on Musings on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
TL;DR
We built an R Shiny app to improve access to information on Hong Kong's local politicians. This is so that voters can make more informed choices. The app shows basic information on each politician, alongside a live feed of their Facebook page and illustrative maps of their district. We took advantage of this project to test out a range of R packages and techniques and to implement some DevOps best practices, which we will discuss in this post.
This project is an attempt to help make a difference with R programming. It's an opportunity for us to learn, to code, to have fun, and to make a difference.
Our project was mainly motivated by an observation that the engagement of the Hong Kong public with their local politicians was very low.1 Historically, the work of Hong Kong's District Councillors (DCs) are neither widely known nor closely scrutinised by the public media. Until recently, most District Councillors did not use webpages or Facebook pages to share their work, but instead favour distributing physical copies of 'work reports' via Direct Mail. A hypothesis is that this has changed significantly with the 2019 District Council election, where turnout has jumped to 71% (from 47% in 2015). For context, Hong Kong's District Councils is the most local level of government, and is the only level in which there is full universal suffrage for all candidates.
As of the summer of 2020, we identified that 96% (434) of the 452 District Councillors elected in 2019 actually have a dedicated Facebook page for delivering updates to and engaging with local residents. However, these Facebook pages have to be manually searched for online, and there is not a readily available tool where people can quickly map a District to a District Councillor and to their Facebook feeds.
As a wise person once said, "If you can solve a problem effectively in R, why the hell not?". We tackled this problem by creating a Shiny app in R, which brings the Facebook feeds and constituency information for Hong Kong's district councillors in one place. In this way, people will be able to access the currently disparately stored information in a single web app.
Don't forget to also provide some feedback to the Shiny app here!
Whether you are more of an R enthusiast or simply someone who has an interest in Hong Kong politics (hopefully both!), we hope this post will bring you some inspiration on how you can use R not just as a great tool for data analysis, but also as an enabler for you to do something tangible for your community and contribute to causes you care about.
What is in the app?
The Shiny app is built like a dashboard which combines information about each district councillor alongside their Facebook page posts (if it exists) and the district they serve, illustrated on an interactive map. By using the District and Constituency dropdown lists, you can retrieve information about the District Councillor and their Facebook feed.
Specifically, there are several key components that were used on top of the incredible shiny package:
We understood that our users, primarily HK citizens, frequently use mobiles. Thus, to ensure this app was useful to them, we centred our design on how the app looked on their mobile browsers.
googlesheets4: For seamless access to Google Sheets.
We understood that our users are not all technical so we stored the core data in a format and platform familiar and accessible to most people, Google Sheets.
At a later stage of the app development, we migrated to storing the data in an R package we wrote, called hkdatasets as we sought to keep the data in one place. However, the Google Sheets implementation worked very well, and the app could be deployed with no impact on performance or user experience.
sf and leaflet: For importing geographic data and creating interactive maps.
We understood that our users may want to explore other parts of Hong Kong but may not know the names of each constituency. Thus, we provided a map functionality to improve the ease they can learn more about different parts of Hong Kong.
We understood that our users are not necessarily keen to read pages of instructions on how to use the app, especially if they are on mobile. Thus, we implemented a dynamic feature that walks them through visually each component of the app.
How was the data collected?
Since there was no existing single data source on the DCs, we had to put this together ourselves. All the data on each DC, their constituency, the party they belong to, and their Facebook page was all collected manually through a combination of Wikipedia and Facebook. The data was initially housed on Google Sheets, for multiple reasons:
Using Google Sheets made it easy for multiple people to collaborate on data entry.
Keeping the data outside of the repo has the advantage of keeping the memory size minimal, in line with best practices.
By storing the data in Google Sheets, non-technical users would also be able to access the data too.
Most of all, it was easy to access the Google Sheets data with the {googlesheets4} package! For editing the data for pre-processing, a key function is googlesheets4::gs4_auth(), which directs the developer to a web browser, asked to sign in to their Google account, and to grant googlesheets4 permission to operate on their behalf with Google Sheets. We then set up the main Google Sheet – the nicely formatted version intended for the app to ingest – to provide read-only access to anyone with the link, and used googlesheets4::gs4_deauth() to access the public Google Sheet in a de-authorised state. The Shiny app itself does not have any particular Google credentials stored alongside it (which it shouldn't, for security reasons), and this workflow allows (i) collaborators/developers to edit the data from R and (ii) for the app to access the Google Sheet data without any need for users to login.
Creating a map with constituency boundaries also required additional data. Boundaries for each constituency were obtained through a Freedom of Information (FOI) request by a member of the public here (see discussion of shapefiles below).
This was pretty much Phase #1 of data collection, where we had single Google Sheet with basic information about the District Councillors and their Facebook feeds, which enabled us to create a proof of concept of the Shiny app, i.e. making sure that we can set up a mechanism where the user can select a constituency and the app returns the corresponding Facebook feed of the District Councillor.
Based on user feedback, we started with Phase #2 of data collection, which involved a web-scraping exercise on the official Hong Kong District Council website and the HK01 News Page on the 2019 District Council elections to get extra data points, such as: – Contact email address – Contact number – Office address – Number of votes, and share of votes won in 2019
A function that was extremely helpful for figuring out the URL of the District Councillors' individual official pages is the following. What this does is to run a Bing search on the https://www.districtcouncils.gov.hk website, and scrape from the search result any links which match what we want (based on what the URL string looks like). Although this doesn't always work, it helped us a long way with the 452 District Councillors.
One key thing to note is that all of the above data we compiled is available and accessible in the public domain, where we simply took an extra step to improve the accessibility.2 The Phase #2 data was used in the final app to provide more information to the user when a particular constituency or District Councillor is selected.
Creating a data package
Our data R package, hkdatasets, is to some extent a spin-off of this project. We decided to migrate from Google Sheets to an R data package approach, for the following reasons:
An R data package could allow us to provide more detailed documentation and tracking of how the data would change over time. If we choose to expand the dataset in the future, we can easily add this to the package release notes.
An R data package would fit well with our broader ambition to work on other Hong Kong themed, open-source projects. From sharing our project with friends, we were approached to help with another project to visualise Hong Kong traffic collisions data, where the repo is here. As part of this, we obtained this data via an FOI request on traffic collisions, where the data is also available through hkdatasets.
Make it easier for learners and students in the R community to practise with the datasets we've put together, without having to learn about the googlesheets4 package. Our thinking is that this would benefit others as other data packages like nycflights13 and babynames have benefitted us as we learned R.
hkdatasets is currently only available on GitHub, and our aim is to release it on CRAN in the future so that more R users to take advantage of it. Check out our GitHub repo to find out more about it.
Linking our Shiny App to Facebook
When we first conceptualised this project, our aim has always been to make the Facebook Page content the centre piece of the app. This was contingent on using some form of Facebook API to access content on the District Councillors' Public Pages, which we initially thought would be easy as Public Page content is 'out there', and shouldn't require any additional permissions or approvals.
It turns out, in order to read public posts from Facebook Pages that we do not have admin access to requires a certain permission called Page Public Content Access, which in turn requires us to submit our app to Facebook for review. Reading several threads (such as this) online soon convinced us that this would be a fairly challenging process, as we need to effectively submit a proposal on why we had to request this permission. To add to the difficulty, we understood that the App Review process had been put on pause at the time, due to the re-allocation of resourcing during COVID-19.
This drove us to search for a workaround, and this is where we stumbled across iframes as a solution. An iframe is basically a frame that enables you embed a HTML document within another HTML document (they've existed for a long time, as I recall using them in the really early GeoCities and Xanga websites).
The iframe concept roughly works as follows. All the Facebook Page URLs are saved in a vector called FacebookURL, and this is "wrapped" with some HTML markup that enables it to be rendered within the Shiny App as an UI component (using shiny::uiOutput():
Although the iframe solution comes with its own challenges, such as the difficulty in making it truly responsive / mobile-optimised, it was nonetheless an expedient and effective workaround that allowed us to produce a proof-of-concept; the alternative was to splash around in Facebook's API documentation and discussion boards for at least another month to achieve the App Approval (bearing in mind that we were working on this in our own free time, with limited resources).
Visualising the shapefiles
The first rule of optimisation is you don't.
— Michael A. Jackson
We acquired shapefiles in order to be able to visualise the individual Disticts on a map, which we obtained from AccessInfo.HK. A shapefile is, according to the ArcGIS website:
… a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas).
These shapefiles could be easily used as part of a ggplot2 workflow, which we created with geom_sf() to rapidly get a Proof of Concept. This was to quickly visualise the districts and how they look in relation to the Shiny app.
Once we settled on how the map looked in relation to the Shiny app, we then spent some additional time and effort to investigate using leaflet. The reason for moving to leaflet maps because of their interactivity: we understood our users would want to explore the HK map interactively to find out what consituency they belong to or to find out one that was of interest. This was because we were aware that people may know what region they live in but they may not know the name of the consituency.
What are our next steps?
There were some cool features that we would have liked to, but have not been able to implement:
precommit hooks: Those familiar with Python may be aware of pre-commit hooks as ways to automatically detect whether your repo contains anything sensitive like a .secrets file. Setting this up will enable us to have automated checks run each time we make a commit to assure we are follow specified standards.
Unfortunately, we named our repo with a hyphen so the pre-commit hooks won't work.
codecov: Allows us to robustly test the functions in our code so that they work under a multitude of scenarios such as when users encounter problems.
modularise shiny code: Ensures our Shiny code is chunked so individual pieces of logic are isolated.
This makes the overall code easier to follow as it separates the objects that are connected from those that are not. It also makes testing easier because you can test each isolated chunk.
language selection: Currently the app is a smorgasbord of English and Chinese. Consequently, it looks messy. We want to implement the ability for the user to choose which language they want to see the app in and the app's language will update accordingly.
Release to alpha testers to get early feedback.
More of our enhancements / spikes are listed here on GitHub
One of the things that we wanted to try out with this open-source project is to adhere to some DevOps best practices, yet unfortunately some of these were either easier to set up from the beginning, or require more time and knowledge (on our part) to set up. As we develop a V2 of this Shiny App and work on other projects, we hope to find the opportunity to implement more of the above features.
Other features in the app
There were also a number of features that we have implemented, but were not detailed in this post. For instance:
Adding a searchable DataTable with information on the District Councillors, with the DT package
Embedding a user survey within the Shiny app
Adding a tutorial to go through features of the Shiny app, using the rintrojs package
Adding loader animations with shinycssloaders
We will cover more of that detail in a Part 2 of this blog, so watch this space!
Who is behind this?
Multiple people contributed to this work. Avision Ho is a data scientist who wrote the majority of the Shiny app, and who was also previously interviewed on this blog. Avision is a co-author on this post. Ocean Cheung came up with the original idea of this app, and made it all possible with his knowledge and network with District Councillors. We would also like to credit Justin Yim, Tiffany Chau, and Gabriel Tam for their feedback and advice on the scope and the direction of this app. We are currently working on a number of other projects, which you can find out more from our website: https://hong-kong-districts-info.github.io/.
(Disclaimer! We are not affiliated to any political individuals nor movements. We are simply some people who'd like to contribute to society through code and open-source projects.)
Want to get involved?
We're looking for collaborators or reviewers, so please send us an email (hkdistricts.info@gmail.com), or comment down below if you are interested! We would also appreciate any feedback or questions, which you could either comment below or respond to our in-app survey. You can also get an idea of things we are planning to work on through our Trello board here.
When we first started out, we were just a couple of people who wanted to learn and practise a new skill (e.g. building a Shiny app, implementing best practices), and wanted a meaningful open-source project that we could work on. Read more about our Vision Statement here.
There are many reasons for this, and arguably a similar phenomenon can be observed in most local elections in other countries. See Lee, F. L., & Chan, J. M. (2008). Making sense of participation: The political culture of pro-democracy demonstrators in Hong Kong. The China Quarterly, 84-101.︎
This is in compliance with the ICO's description of the 'public domain', i.e. that information is only in the public domain if it is realistically accessible to a member of the general public at the time of the request. It must be available in practice, not just in theory.︎
To leave a comment for the author, please follow the link and comment on their blog: Musings on R.
[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data Science Conference Austria 2020
Data Science Conference (DSC) Austria is knocking on YOUR door – and it is all for free!
DSC Austria will happen on September 8-9th and during the event, you will get a chance to listen to over 15 high-quality talks and 8 tech tutorials on the topic of AI & ML, Data-Driven Decision and Data & AI Literacy – but that is not all!
With the DSC Austria ticket you will get:
Full access to DSC Austria 2020 talks and sessions
On September 8th you are going to listen to 2 Tech Tutorials & 3 Data Discussion. You are going to listen to Use Julia for your Scientific Computing Jobs! by Przemyslaw Szufel from Nunatak Capital and Recommender Systems using Deep Graph Library and Apache MXNet by Cyrus Vahid from AWS. Also, you will get a chance to listen to the next data discussions: Are Robo Bankers on our Doorstep?, May AI be Profitable and Ethical at the Same Time? and How AI is Fostering Dehumanization of Decision Making?. Our Panelist will be Martin Moessler, Craig Matthews, Georg Koldorfer, Aleksandra Przegalinska & Wolfgang Kienreich.
On September 9th you will be able to listen to 7 excellent talks & participate in 2 networking sessions. We will start our day with the keynote talk Automated Machine Learning for Fast Experiments and Prototypes delivered by Philipp Singer & Dmitry Gordeev, from h2o.ai. After that, you will be able to listen to experts such as Dragan Pleskonjic, Ronald Hochreiter, Valentina Djordjevic and others.
Devopes training in Hyderabad RS Trainings is a One of the best quality training center for online, Classroom and Corporate trainings In Hyderabad . DevOps is the mix of programming of programming improvement and programming tasks. It joins the method of improvement and activities. They go near to close or are done at same time.
your article on data science is very good keep it up thank you for sharing.
ReplyDeleteData Science Training in Hyderabad
Data Science course in Hyderabad
Data Science coaching in Hyderabad
Data Science Training institute in Hyderabad
Data Science institute in Hyderabad
ReplyDeleteDevopes training in Hyderabad RS Trainings is a One of the best quality training center for online, Classroom and Corporate trainings In Hyderabad . DevOps is the mix of programming of programming improvement and programming tasks. It joins the method of improvement and activities. They go near to close or are done at same time.