[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Intro: the need for microdata, and the risk of disclosure
Survey and administrative data are essential for scientific research, however accessing such datasets can be very tricky, or even impossible. In my previous job I was responsible for getting access to such "scientific micro-datasets" from institutions like Eurostat. In general, getting access to these micro datasets was only a question of filling out some forms and signing NDAs. But this was true only because my previous employer was an accredited research entity. Companies from the private sector or unaffiliated, individual, researchers cannot get access to the microdata sets. This is because institutions that produce such datasets absolutely do not want any type of personal information to be disclosed.
For instance, with the labour force survey, a National Statistical Institute (NSI) collects information about wages, family structure, educational attainment and much more. If, say, a politician would answer to the survey and his answers would leak to the public that would be disastrous for NSIs. So this is why access is restricted to accredited research institutions. You may be asking yourself, "how could the politicians answers leak? The data is anonymized!" Indeed it is, but in some cases that may not be enough to ensure that information does not get disclosed. Suppose that the dataset contains enough information to allow you to know for certain that you found said politician, assume that this politician is a 43 year old man, has two children, a PhD in theology and lives in Strassen, one of Luxembourg-City very nice neighborhoods. It would be quite easy to find him in the dataset and then find out his wage.
To avoid this, researchers are required to perform output checking, which means going through the set of outputs (summary tables, graphs, tables with regression coefficients…) and making sure that it is not possible to find out individuals. For instance, in Luxembourg there are two companies in the tobacco industry. Luxembourg's NSI cannot release the total turnover of the industry, because then company A would subtract its turnover from the total and find out its competitor's turnover. Now these are all hypothetical examples, and we might argue that the risk of leakage is quite low, especially if NSIs make sure to lower the precision of the variables, by providing age categories instead of the exact age for example. Or capping wages that exceed a certain fixed amount. In any case for now most NSIs don't release micro data to the public, and this poses some challenges for research. First of all, even for researchers, it would be great if the data was freely accessible. It would allow research to go straight to data analysis and look at the structure of the data before applying for access, with the risk of getting access to useless data. And of course it would be great for the public at large to be able to freely access such data, for educational purposes at the very least. It would also increase competition between research institutions and the private sector when it comes to conducting studies using such data. Free access to the microdata would level the playing field. Now, some NSIs do release micro data to the public, see Eustat, the NSI from the Basque country, an autonomous region of Spain. It is not clear to me if they also have more detailed data that is only accessible to researchers, but the data they offer is already quite interesting.
A middle ground between only releasing data to researchers and making it completely freely accessible is to create a synthetic dataset, which does not contain any of the original records, but which still allows to perform meaningful analyses.
I'm not yet very familiar with the details of the procedure, but in this blog post I'll use Eustat's microdata to generate a synthetic dataset. I'll then perform the same analysis on both the original dataset and the synthetic dataset. The dataset I'll be using can be found here, and is called Population with relation to activity (PRA):
The Survey on the Population in Relation to Activity operation is a continuous source of information on the characteristics and dynamics of the labour force of the Basque Country. It records the relation to productive activity of the population resident in family households, as well as the changes produced in labour situations; it produces indicators of conjunctural variations in the evolution of the active population; it also estimates the degree of participation of the population in economically non-productive activities. It offers information on the province and capital level.
I'll then compare the results of the analyses performed on the two datasets which will hopefully be very similar. To create the synthetic dataset, I'll be using the {synthpop} package. You can read the detailed vignette here – pdf warning –. First, let me perform some cleaning steps. There are four datasets included in the archive. Let's load them:
The columns are labeled in Spanish so I'm copy pasting the labels into Google translate and paste them back into my script. I saved the English names into the english.rds object for posterity. These steps are detailed in the next lines:
col_names <- dictionary %>% filter(!is.na(...1)) %>% dplyr::select(1:2) # copy to clipboard, paste to google translate # couldn't be bothered to use an api and google cloud or whatever #clipr::write_clip(col_names$`...2`) #english <- clipr::read_clip() english <- readRDS("english_col_names.rds") col_names$english <- english colnames(dataset) <- col_names$english dataset <- janitor::clean_names(dataset)
I now create a function that will perform the cleaning steps:
logit_model <- glm(job_search ~ ., data = pra, family = binomial()) # Create a tidy dataset with the results of the regression tidy_logit_model <- tidy(logit_model, conf.int = TRUE) %>% mutate(dataset = "true")
Let's now take a look at the coefficients, by plotting their value along with their confidence intervals:
As you can see, synthetic_data is a list with several elements. The data is inside the syn element. Let's extract it, and perform the same analysis from before:
syn_pra <- synthetic_data$syn head(syn_pra)
## territory capital sex place_of_birth age nationality ## 1 48 9 1 1 06 1 ## 2 01 9 6 3 09 1 ## 3 48 3 1 1 08 1 ## 4 48 9 6 1 11 1 ## 5 20 2 6 1 09 1 ## 6 48 1 6 1 11 1 ## level_of_studies_completed job_search main_occupation type_of_contract hours ## 1 3 N 2 1 40 ## 2 1 S 9 6 10 ## 3 1 N 6 32 ## 4 2 N 4 1 32 ## 5 3 N 5 40 ## 6 1 S 7 NA
This is quite interesting; generally, there is quite some overlap between the synthetic data and the real data! There are some differences though, for instance, main_occupation6 is significant with the synthetic data, but is not with the real data. There's the possibility to generate more than one synthetic dataset, which would very likely reduce the noise.
Putting on my data scientist hat
Now let's suppose that I am only interested into prediction. For this, I am going to split my dataset into a training and testing set, then run a logistic regression and a random forest, assess the models' performance with 10-fold cross validation. I'll do this on both the real and the synthetic data. To perform the analysis, I'll be using the {tidymodels} framework; I'm going to explain the code that follows line by line, because I'll very likely write a blog post focusing on {tidymodels} soon.
So, let's write a function that does exactly what I explained above:
## # A tibble: 4 x 4 ## .metric mean sd model ## ## 1 accuracy 0.882 0.00816 logit ## 2 roc_auc 0.708 0.0172 logit ## 3 accuracy 0.907 0.00619 forest ## 4 roc_auc 0.879 0.0123 forest
syn_data_performance
## # A tibble: 4 x 4 ## .metric mean sd model ## ## 1 accuracy 0.882 0.00758 logit ## 2 roc_auc 0.691 0.0182 logit ## 3 accuracy 0.899 0.00615 forest ## 4 roc_auc 0.857 0.0124 forest
The performance is pretty much the same!
Generating synthetic data is a very promising approach, that I certainly will be using more; I think that such approaches can also be very interesting in the private sector (not only to ease access to microdata for researchers) especially within large companies. For instance, it can happen that the data owners from say, an insurance company, are not very keen on sharing sensitive client information with their data scientists. However, by generating a synthetic dataset and sharing the synthetic data with their data science team, the data owners avoid any chance of disclosure of sensitive information, while at the same time allowing their data scientists to develop interesting analyses or applications on the data!
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and watch my youtube channel. If you want to support my blog and channel, you could buy me an espresso or paypal.me, or buy my ebook on Leanpub.
[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my last post, I examined the significance of the sector on the salary for different occupational groups using statistics from different regions. In previous posts I have shown a correlation between the salary and experience and also salary and education, In this post, I will examine the correlation between salary and sector using statistics for age.
The F-value from the Anova table is used as the single value to discriminate how much the region and salary correlates. For exploratory analysis, the Anova value seems good enough.
## v ggplot2 3.2.1 v purrr 0.3.3 ## v tibble 2.1.3 v dplyr 0.8.3 ## v tidyr 1.0.2 v stringr 1.4.0 ## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
library (broom) library (car)
## Loading required package: carData
## ## Attaching package: 'car'
## The following object is masked from 'package:dplyr': ## ## recode
## The following object is masked from 'package:purrr': ## ## some
library (sjPlot)
## Registered S3 methods overwritten by 'lme4': ## method from ## cooks.distance.influence.merMod car ## influence.merMod car ## dfbeta.influence.merMod car ## dfbetas.influence.merMod car
I have renamed the file to 000000D2_sector.csv because the filename 000000D2.csv was used in a previous post.
The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by sector, occupational group (SSYK 2012), sex and age. Year 2014 – 2018 Monthly salary 1-3 public sector 4-5 private sector
In the plot and tables, you can also find information on how the increase in salaries per year for each occupational group is affected when the interactions are taken into account.
tb <- readfile("000000D2_sector.csv") %>% rowwise() %>% mutate(age_l = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[1]) %>% rowwise() %>% mutate(age_h = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[2]) %>% mutate(age_n = (age_l + age_h) / 2) summary_table = 0 anova_table = 0 for (i in unique(tb$`occuptional (SSYK 2012)`)){ temp <- filter(tb, `occuptional (SSYK 2012)` == i) if (dim(temp)[1] > 90){ model <- lm(log(salary) ~ poly(age_n, 3) + sex + year_n + sector, data = temp) summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "none")) anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "none")) model <- lm(log(salary) ~ poly(age_n, 3) * sector + sex + year_n, data = temp) summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and age")) anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and age")) model <- lm(log(salary) ~ poly(age_n, 3) + sector * sex + year_n, data = temp) summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and sex")) anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and sex")) model <- lm(log(salary) ~ poly(age_n, 3) + year_n * sector + sex, data = temp) summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and year")) anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and year")) model <- lm(log(salary) ~ poly(age_n, 3) * sector * sex * year_n, data = temp) summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector, year, age and sex")) anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector, year, age and sex")) } }
## Note: model has aliased coefficients ## sums of squares computed by model comparison ## Note: model has aliased coefficients ## sums of squares computed by model comparison ## Note: model has aliased coefficients ## sums of squares computed by model comparison ## Note: model has aliased coefficients ## sums of squares computed by model comparison ## Note: model has aliased coefficients ## sums of squares computed by model comparison
Figure 1: The significance of the sector on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018
merge(summary_table, anova_table, by = c("ssyk", "interaction"), all = TRUE) %>% filter (term.x == "year_n") %>% filter (contcol.y > 0) %>% # only look at the interactions between all four variables in the case with interaction sector, year, age and sex filter (!(contcol.y < 3 & interaction == "sector, year, age and sex")) %>% mutate (estimate = (exp(estimate) - 1) * 100) %>% ggplot () + geom_point (mapping = aes(x = estimate, y = statistic.y, colour = interaction)) + labs( x = "Increase in salaries (% / year)", y = "F-value for interaction" )
Figure 2: The significance of the interaction between sector, age, year and sex on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018
The tables with all occupational groups sorted by F-value in descending order.
[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
And a new version of digest is getting onto CRAN now, and to Debian shortly.
digest creates hash digests of arbitrary R objects (using the md5, sha-1, sha-256, sha-512, crc32, xxhash32, xxhash64, murmur32, and spookyhash algorithms) permitting easy comparison of R language objects. It is a fairly widely-used package (currently listed at 889k monthly downloads with 255 direct reverse dependencies and 7340 indirect reverse dependencies) as many tasks may involve caching of objects for which it provides convenient general-purpose hash key generation.
This release is a one issue fix. Aaron Lun noticed some issues when spookyhash is used in streaming mode. Kendon Bell, who also contributed spookyhash quickly found the issue which is a simple oversight. This was worth addressing in new release, so I pushed 0.6.25.
[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
BMI has an expectations management problem
Body Mass Index (BMI) is an attempt to give a quick back-of-envelope answer to the question "if someone weighs W kg, is that a lot or not very much?" Clearly the answer to that question has to take into account at a minimum the person's height; in general, whatever may constitute a healthy weight, it has to be higher for tall people. So BMI – the weight (in kilograms) divided by the height (in metres) squared – gives a scaled version of weight that aims to be rougly comparable for people of different heights. It is actually extremely successful in this regard; certainly it tells you a lot more about someone than either the weight or height individually.
However, BMI is one of those single dimensional metrics that gets widespread criticism on account of being a single dimensional metric. Criticisms of these types of measures typically come in three varieties:
reality is much more complex than any single measure M could represent;
…but if you tweaked it (by method Y) it would somehow be much better;
…anyway, I think Z is really important. M is a really bad measure of Z.
Gross Domestic Product wins the award for attracting the biggest collection of criticisms in this pattern, but BMI would be high up in the list of also-rans. I won't bother linking to individual articles, because googling "why is BMI a bad measure" comes up with plenty to choose from.
The first of the criticisms listed above is always correct, but uninteresting given we need to somehow model the world into a simpler form to grasp it. A one-to-one scale map is no use for navigation.
The third criticism is often (not always) a red herring. In particular, it can introduce a straw man fallacy, implicitly or explicitly accusing the metric of not being something it never claimed to be. This can divert discussion into technical areas when the real disagreement is about something more substantive.
The second criticism ("if you tweaked it, it would be better") can be interesting and fruitful, but in my experience is often standing as a proxy for one of the other two criticisms, which can cause more confusion and talking-past-one-another.
BMI attracts criticisms in all three categories. Clearly, a person's state of health is much more complex than can be represented by a single number and "BMI doesn't tell the whole story" (criticism one with a smattering of three). Equally clearly, sometimes – perhaps always – whether someone is obese, normal weight or underweight is not the most important thing to know about them (criticism three). BMI is an inaccurate measure of body fat content (well, yes, of course) and a 'terrible measure of health' (still the straw man of criticism three).
On average and controlling for other factors, obesity and high BMI is associated with poor health outcomes, but in some unusual circumstances obesity seems to be a protective factor (criticism three – again, not clear why this is seen as a critique of BMI as a measure rather than an interesting fact about how that measure relates to outcomes). Some very well-muscled and clearly healthy individuals have high BMIs that would put them in the 'overweight' category by WHO standards (I think this is a variant of criticism one with a dusting of three). Or BMI is just calculated the wrong way – it should have height to the factor of 2.5, not 2 (criticism two).
I want to pass on most of this debate, not because it's not important and interesting (it is clearly both) but because I don't have enough time and space just now. I do want to zoom in on criticism two however, and in particular the widespread claim that "BMI overexaggerates weight in tall people and underexaggerates weight in short people". Relatedly, I have wondered (as I'm sure many many people have before me), why divide weight by height to the power of 2? Why 2? Why not 3, given that if people of different heights had the same proportions we would expect weight to be proportionate to height cubed? It turns out that this is because people's horizontal dimensions don't increase as much as by their height; but by how much less?
You don't have to follow your nose far in the discussions on BMI to come across references to Oxford University Professor of Numerical Analaysis Nick Trefethen's letter to the Economist on 5 January 2013:
SIR – The body-mass index that you (and the National Health Service) count on to assess obesity is a bizarre measure. We live in a three-dimensional world, yet the BMI is defined as weight divided by height squared. It was invented in the 1840s, before calculators, when a formula had to be very simple to be usable. As a consequence of this ill-founded definition, millions of short people think they are thinner than they are, and millions of tall people think they are fatter.
Trefethen elsewhere expands on his idea of a BMI for the post-calculator age, providing the archetypal exemplar of my second criticism category. His proposed approach is to put height to the power of 2.5, and scale the result by multiplying it by 1.3 so the numbers are similar to the current familiar range:
The scaling (which is designed to make the new BMI identical to the old for someone who is 1.69 metres tall) seems an unnecessary complication. After all, BMI is a completely artificial measure and we can put the boundaries for underweight, healthy, overweight etc wherever we wish; if we want a "new BMI" measure there is no reason for it to appear similar in magnitude to the old one (and in fact some good arguments against).
Trefethen's views are mentioned in most post-2013 critiques of BMI, nearly always uncritically, for example in the subheading "BMI exaggerates thinness in short people and fatness in tall people" in this MedicalNewsToday piece, which quotes Trefethen as saying "Certainly if you plot typical weights of people against their heights, the result comes out closer to height ^ 2.5 than height ^ 2" (I haven't been able to find the original source of this, but it certainly is consistent with what I have been able to find written by Trefethen). But despite the widespread acceptance of Trefethen's critique everywhere from the popular press to Reddit sites for tall people, I can't see any actual evidence for the use of 2.5. So are "short people fatter than they think", as The Telegraph reported Oxford University researchers to have found?
How do height and weight relate in a real population?
Very well. Let us look at some data. The US Centers for Disease Control and Prevention (CDC) Behavioral Risk Factor Surveillance System (BRFSS) is an extraordinary on-going survey of around 400,000 adult interviews each year. It claims to be the largest continuously conducted health survey system in the world, and I imagine it is one of the largest continuous surveys of any purpose. It includes questions on (self-reported) height and weight as well as standard demographics.
Let's start by looking at the relationship between weight and height, without any calculated summary of the two:
There's a few interesting things here. It is clear that the relationship between weight and height varies with sex, ethnicity and age. In particular, the relationship between height and weight is stronger for men (as in a steeper line) in all races and age groups. Also, the average values of these things change. Women are shorter and weigh less than men; people with an Asian ethnic background than the general population; etc.
I've applied as a layer of grey the healthy weights for a person of given height implied by the World Health Organization's BMI guidelines. This makes it clear that most US Americans weigh more than is recommended for their height, which of course is well known.
What does this mean for BMI?
My next step is to convert those weight and height measures into BMI. If BMI is a "fair" measure of how far an individual is from the average weight at their height, we would expect to see no relationship between height and BMI. That is, BMI should capture all the structural information about the norm for weight available in height. The essence of Trefethen's argument is that this is not the case, and that traditional BMI is systematically higher for taller people because the impact of height is shrunk by choosing a power of only 2 rather than 2.5.
Let us see:
In fact, we see that there is a very marked negative relationship between height and BMI, particularly for women, but also present for men. This is the precise opposite of Trefethen's claim, which is that tall people get exaggeratedly high BMIs in the current standard calculation. Based on sheer empiricism, we could make the case that BMI understates the scaled weight of tall people.
(There's an important nuance here, which is that I'm only looking at a 'positive' relationship. We can't deduce a normative statement from this. But I think that's a question for another day. I'm also disregarding for now measurement error, a survey in one time and place, and other restrictions on making general inferences from this sample. After all, this is only a blog.)
Here's an interesting point for those interested in gender relations and the development of science and measurement. The slopes in the chart above for men are nearly horizontal, certainly more so than for women; and most horizontal for people described as 'white, non-Hispanic'. One interpretation for this would be that:
"the current BMI calculation is a fair representation of the relationship between height and weight for white, non-Hispanic men, but underplays the weight of taller people (and exaggerates the weight of shorter) of other racial backgrounds and of women in general."
What power for height is weight proportional to?
I provided a second hand quote of Trevethen earlier in which he argued that "if you plot typical weights of people against their heights, the result comes out closer to height ^ 2.5 than height ^ 2". Is this the case? Actually, there's nothing visual to tell at all, in my opinion:
We need to calculate. Luckily, if weight is expected to be proportionate to height to the power of something, then the logarithms of the two will have a linear relationship and the height coefficient of a regression of log(weight) on log(height) will tell us what the power is.
An interesting statistical issue arises here – should we do this regression for the full population of Americans, or should we let that coefficient vary by race, sex and age? From a modelling perspective, it definitely is a different value for different race, sex and age.
I decided fairly arbitrarily that I wanted a single coefficient to put height to the power to, but that I wanted that coefficient calculated after controlling for other obviously important characteristics. Fitting a regression of the type
log(weight) ~ log(height) + sex + race + age
gets me a slope for log(height) of 1.55. In other words, after controlling for sex, race and age, American's weight varies in proportion to their height to the power of 1.55.
Alternatively, if we treat the population as one and fit the regression:
log(weight) ~ log(height)
We get 1.87.
I imagine I would get different values again with different populations or times.
So what do I conclude? Frankly, that "2" is good enough. I think the Trefethen critique of BMI is at the higher level an unnecessary distraction, and at the level of detail he almost certainly gets the direction of misrepresentation wrong (it is short people who are made to look fatter than they are by the standard calculation, not tall people).
Overall, having poked and prodded this measure a bit, I think that it meets many of the criteria of a useful measure. It's easy to calculate, easy to explain, lends itself to simple benchmarking, doesn't need specialist skills for data collection, and is clearly closely related to major outcomes of interest.
I doubt BMI is the best single number to represent someone's health, but I don't think this has often been claimed for it. It probably isn't even the best single number regarding the cluster of issues relating to weight, shape, size and body fat content, but it's not a bad starting point. It might well be the most cost-effective such measure in many situations. So I'm happy to say that as a metric, BMI isn't a bad one so long as you don't expect it to do more than is reasonable.
Population versus individuals
I want to make a final point about one point in the BMI debate. We often see BMI critiqued (or defended) from the angle that "BMI is meant to be a population measure about averages, not something that is meaningful for individuals". As sometimes expressed, this seems to me completely incoherent. If the average BMI tells us anything about a population, it is because it is meaningful for individuals; the population effect is the sum of a bunch of small individual effects. I think this argument seems to be a misguided extrapolation from the wise caution that one shouldn't try to diagnose the health of an individual from just their BMI. That seems absolutely fair. But this doesn't mean BMI tells us nothing about the individual, just that the BMI alone is not determinative of outcomes.
Code
Here's all the R code for the above in a single chunk. It's not terribly exciting, but there are some interesting visualisation challenges associated with:
the lumping of reported height and weight (I jitter points to make them more distinguishable)
taking care to convert measurements which can be either in metric or medieval measurements to a common unit
unequal survey weights
large sample sizes (with nearly half a million points, how to represent them on the screen? I use a high degree of transparency to get around this).
When it comes to the actual modelling that treats the survey weights and the sampling structure appropriately, Thomas Lumley's survey package rules supreme in terms of easy of use.
library(tidyverse)library(scales)library(patchwork)library(MASS)library(rio)library(survey)library(Cairo)#--------------CDC annual Behavioral Risk Factor Surveillance System (BRFSS) survey- # 69.5 MB download, so only uncomment this and run it once: # download.file("https://www.cdc.gov/brfss/annual_data/2018/files/LLCP2018XPT.zip", # destfile = "LLCP2018XPT.zip", mode = "wb") # # unzip("LLCP2018XPT.zip") llcp<-import("LLCP2018.XPT")dim(llcp)# 437,436 respondents and 275 questions! # Survey weights. These are very unbalanced so take care: summary(llcp$"_LLCPWT")# Codebook: # https://www.cdc.gov/brfss/annual_data/2018/pdf/codebook18_llcp-v2-508.pdf # Function to clean and tidy up variables from the original code_vars<-function(d){d%>%# height in metres: mutate(height=case_when(HEIGHT3>=9000&HEIGHT3<=9998~as.numeric(str_sub(HEIGHT3,2,4))/100,HEIGHT3<800~(as.numeric(str_sub(HEIGHT3,1,1))*12+as.numeric(str_sub(HEIGHT3,2,3)))*2.54/100,TRUE~NA_real_))%>%# weight in kg: mutate(weight=case_when(WEIGHT2<1000~WEIGHT2*0.453592,WEIGHT2>=9000&WEIGHT2<=9998~as.numeric(str_sub(WEIGHT2,2,4)),TRUE~NA_real_),bmi=weight/height^2,trefethen=1.3*weight/height^2.5,# BMI cutoff points from https://www.who.int/gho/ncd/risk_factors/bmi_text/en/ who_cat=cut(bmi,c(0,18.5,25,30,Inf),labels=c("Underweight","Healthy","Overweight","Obese")))%>%mutate(sex=case_when(SEX1==1~"Male",SEX1==2~"Female",TRUE~NA_character_))%>%mutate(race=case_when(`_HISPANC`==1~"Hispanic, Latino/a, or Spanish origin",`_PRACE1`==1~"White, non-Hispanic",`_PRACE1`==2~"Black or African American",`_PRACE1`==3~"American Indian or Alaskan Native",`_PRACE1`==4~"Asian",`_PRACE1`==5~"Pacific Islander",`_PRACE1`==6~"Other",TRUE~NA_character_),race=fct_relevel(race,"White, non-Hispanic"))%>%mutate(race2=fct_collapse(race,"Other"=c("Pacific Islander","Other","American Indian or Alaskan Native")),race2=str_wrap(race2,20),race2=fct_reorder(race2,`_LLCPWT`,.fun=sum))%>%mutate(age=case_when(`_AGE65YR`==1~"Age 18 to 64",`_AGE65YR`==2~"Age 65 or older",NA~NA_character_))%>%mutate(race3=str_wrap(race2,20),race3=fct_reorder(race3,`_LLCPWT`,.fun=sum))%>%mutate(hhinc=case_when(INCOME2==1~"Less than $10k",INCOME2==2~"$10k to $15k",INCOME2==3~"$15k to $20k",INCOME2==4~"$20k to $25k",INCOME2==5~"$25k to $35k",INCOME2==6~"$35k to $50k",INCOME2==7~"$50k to $75k",INCOME2==8~"More than $75k",TRUE~NA_character_),hhinc=fct_reorder(hhinc,INCOME2))%>%mutate(education=case_when(EDUCA==1~"Never attended school or only kindergarten",EDUCA==2~"Grades 1 through 8",EDUCA==3~"Grades 9 through 11",EDUCA==4~"Grade 12 GED",EDUCA==5~"College 1 to 3 years",EDUCA==6~"College 4 years or more",TRUE~NA_character_),education=fct_reorder(education,EDUCA))%>%# remove people taller than the world record height: filter(height<2.72)%>%rename(psu=`_PSU`,survey_weight=`_LLCPWT`)%>%dplyr::select(height,weight,sex,race,race2,race3,age,hhinc,education,bmi,psu,survey_weight)}the_caption="Source: http://freerangestats.info analysis of the US CDC Behavioral Risk Factor Surveillance System Survey for 2018"# small subsampled version of the data used during dev, not referred to in blog: set.seed(123)llcp_small<-llcp%>%sample_n(20000,weight=`_LLCPWT`,replace=TRUE)%>%code_vars()nrow(distinct(llcp_small))# 17,370 - remembering higher weight observations likely to be resampled twice, and dropping NAs # full data, with tidied variables llcp_all<-llcp%>%code_vars()# Make a data frame for shading of healthy BMI, to get around problems with rectangles and log axes. See # https://stackoverflow.com/questions/52007187/background-fill-via-geom-rect-is-removed-after-scale-y-log10-transformation healthy<-tibble(xmin=0.9,xmax=2.6,ymin=18.5,ymax=25,race3=unique(llcp_small$race3))%>%drop_na()# data frame to use for shading of healthy weight implied by BMI healthy_by_height<-healthy%>%dplyr::select(ymin,ymax,race3)%>%gather(variable,bmi,-race3)%>%mutate(link=1)%>%full_join(tibble(height=seq(from=0.9,to=2.3,length.out=100),link=1),by="link")%>%mutate(weight=bmi*height^2)%>%dplyr::select(-link,-bmi)%>%spread(variable,weight)p1b<-llcp_all%>%dplyr::select(height,weight,sex,age,race3,survey_weight)%>%drop_na()%>%ggplot(aes(x=height))+facet_grid(age~race3)+geom_ribbon(data=healthy_by_height,aes(ymin=ymin,ymax=ymax),colour=NA,fill="grey90")+geom_jitter(alpha=0.02,aes(y=weight,colour=sex,size=survey_weight))+geom_smooth(method="rlm",se=FALSE,aes(y=weight,colour=sex))+scale_size_area(label=comma)+theme(panel.grid.minor=element_blank())+labs(title="Relationship between self-reported weight and height in USA adults",subtitle="Men's weight is more dependent on their height than is the case for women. Grey ribbon shows WHO-recommended healthy weight. Straight lines show empirical relationship of weight and height.",x="Height (m)",y="Weight (kg)",caption=the_caption,colour="",size="Each point represents X Americans:")+guides(size=guide_legend(override.aes=list(alpha=1)))print(p1b)p3<-llcp_all%>%drop_na()%>%ggplot(aes(x=height,y=bmi,colour=sex))+facet_grid(age~race3)+geom_rect(data=healthy,aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),colour=NA,fill="grey90",inherit.aes=FALSE)+geom_jitter(alpha=0.02,aes(size=survey_weight))+geom_smooth(method="rlm",se=FALSE,size=0.5)+scale_size_area(label=comma)+scale_x_log10(breaks=c(0.9,1.1,1.4,1.8,2.3))+scale_y_log10()+theme(panel.grid.minor=element_blank())+labs(title="Relationship between BMI and height in USA adults",subtitle=bquote("Taller people tend to have lower BMIs because on average weight is proportionate to less than the square of height (in fact, "~weight%prop%height^1.6~")"),x="Height (metres), logarithmic scale",y=bquote("Body Mass Index: "~frac(weight,height^2)~", logarithmic scale"),caption=the_caption,colour="",size="Each point represents X Americans:")+guides(size=guide_legend(override.aes=list(alpha=1)))print(p3)#------------ p4<-llcp_all%>%mutate(h2=height^2,h2.5=height^2.5)%>%dplyr::select(h2,h2.5,weight,survey_weight)%>%drop_na()%>%gather(variable,value,-weight,-survey_weight)%>%mutate(variable=gsub("h2",expression(height^2),variable))%>%ggplot(aes(x=value,y=weight))+facet_wrap(~variable,scales="free_x")+geom_jitter(alpha=0.02,aes(size=survey_weight))+geom_smooth(method='gam',formula=y~s(x,k=5))+scale_size_area(label=comma)+labs(title=bquote("Weight compared to "~height^2~"and"~height^2.5),subtitle="It's not possible to choose visually which of the two better expresses the relationship.",y="Weight (kg)",x="Height to the power of 2.0 or 2.5",size="Each point represents X Americans:")+guides(size=guide_legend(override.aes=list(alpha=1)))print(p4)#-------Models--------------- llcp_svy<-svydesign(~psu,weights=~survey_weight,data=llcp_all)model0<-svyglm(log(weight)~log(height),design=llcp_svy)model1<-svyglm(log(weight)~log(height)+sex+race+age,design=llcp_svy)anova(model1)summary(model1)# best exponent for height is 1.55, not 2 (this is the raw coefficient for height) # men of the same race, age and height weigh 6% more than women (exp(0.056)) # Asians weight 89% of white non-hispanic of same sex and age: exp(coef(model1))
To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.
[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
It is almost the 29th of February 2020! A day that is very interesting for R, because it marks 20 years from the release of R v1.0.0, the first official public release of the R programming language.
In this post, we will look back on the 20 years of R with a bit of history and 3 interesting perspectives – how much faster did R get over the years, how many R packages were being released since 2000 and how did the number of package downloads grow.
The first official public release of R happened on the 29th of February, 2000. In the release announcement, Peter Dalgaard notes:
"The release of a current major version indicates that we believe that R has reached a level of stability and maturity that makes it suitable for production use. Also, the release of 1.0.0 marks that the base language and the API for extension writers will remain stable for the foreseeable future. In addition we have taken the opportunity to tie up as many loose ends as we could."
Today, 20 years later, it is quite amazing how true the statement around the API remaining stable has proven. The original release announcement and full release statement are still available online.
You can also still download the very first public version of R. For instance, for Windows you can find it on the Previous Releases of R for Windows page. And it is quite runnable, even under Windows 10.
Further down in history, to 1977
Now to give R justice in terms of age, we need to go even further into history. In the full release statement of R v1.0.0, we can find that
R implements a dialect of the award-winning language S, developed at Bell Laboratories by John Chambers et al.
With some digging we can use the Wayback Machine Internet Archive to find interesting notes on Version 1 of S itself written by John Chambers, where he writes:
Over the summer of 1976, some actual implementation began. The paper record has a gap over this period (maybe we were too busy coding to write things down). My recollection is that by early autumn, a language was available for local use on the Honeywell system in use at Murray Hill. Certainly by early 1977 there was software and a first version of a user's manual.
As we can see the ideas and principles behind R are actually much older than 20 years and even 40 years. If you are interested in the history, I recommend watching the very interesting 40 years of S talk from userR 2016.
Faster – How performant is R today versus 20 years ago?
With the 20th birthday of R approaching, I was curious as to how much faster did the implementation of R get with increasing versions. I wrote a very simple benchmarking code to solve the Longest Collatz sequence problem for the first 1 million numbers with a brute-force-ish algorithm.
Then executed it on the same hardware using 20 different versions of R, starting with the very original 1.0, through 2.0, 3.0 all the way to today's development version.
Benchmarking code
Below is the code snippet with the implementation to be benchmarked:
col_len <- function(n) { len <- 0 while (n > 1) { len <- len + 1 if ((n %% 2) == 0) n <- n / 2 else { n <- (n * 3 + 1) / 2 len <- len + 1 } } len } res <- lapply( 1:10, function(i) { gc() system.time( max(sapply(seq(from = 1, to = 999999), col_len)) ) } )
Results
Now to the interesting part, the results – the below chart shows the boxplots of time required to execute the code in seconds, with R versions on the horizontal axis.
We can see that the median time to execute the above code to find the longest Collatz sequence amongst the first million numbers was:
February 2000: More than 17 minutes with the first R version, 1.0.0
January 2002: A large performance boost came already with the 1.4.1 release, decreasing the time by almost 4x, to around 4.5 minutes
October 2004: Even more interestingly, my measurements have seen another big improvement with version 2.0.0 – to just 168 seconds, less than 3 minutes. I was not however able to get such good results for any of the later 2.x versions
April 2014 – Another speed improvement came 10 years later, with version 3.1 decreasing the time to around 145 seconds
April 2017 – Finally, the 3.4 release has seen another significant performance boost, from this version on the time needed to perform this calculation is less than 30 seconds.
Some details and notes
The above is by no means a proper benchmarking solution and was ran purely out of interest. The benchmarks were run on a
Windows-based PC with Intel Core (TM) i5-4590 Processor and 8 GB DDR3 1600 MHz RAM.
using 32-bit versions of R, with no additional packages installed
the following options were used with R 1.0.0: --vsize=900M --nsize=20000k
Some interesting notes on running the same code with a 20-year-old version of R:
There was no message() function available
Integer literals using the L suffix were not accepted
The function do.call() needed a character function name as the first argument
Did not accept = for assignment. It did accept _ though
Other than that, the code ran with no issues across all the tested versions.
Stronger – How many packages were released over the years?
The power of R comes by no small part from the fact that it is easily extensible and the extensions are easily accessible using The Comprehensive R Archive Network, known to most simply as CRAN.
Next on the list of interesting numbers was to look at how CRAN has grown to the powerhouse with more than 15 000 available packages today. Namely, I looked at the numbers of new packages (first releases to CRAN), and total releases (including newer versions of existing packages) over the years using the pkgsearch package.
Results
Once again, the numbers speak for themselves
In 2000-2004 the number of newly released packages was less than a 100
In 2010 CRAN has seen more than 400 new packages
In 2014 more than 1000 packages had their first release
In 2017 over 2000 new packages were added to CRAN
In 2018 and 2019, the number of total CRAN releases was more than 10 000
I would like to take this opportunity to thank the team behind CRAN to make this amazing growth possible.
Bigger – How did downloads of R packages grow?
The size of the user and developer bases of programming languages is difficult to estimate, but we can use a simple proxy to get a picture in terms of growth. RStudio's CRAN mirror provides a REST API from which we can look at and visualize the number of monthly downloads of R packages in the past 7 years:
Note the numbers above represent just one of many CRAN mirrors and therefore the true number of package downloads is much higher, the informational value of the chart is mostly in the growth, which is quite impressive:
[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
RAthena 1.7.1 and noctua 1.5.1 package versions have now been released to the CRAN. They both bring along several improvements with the connection to AWS Athena, noticeably the performance speed and several creature comforts.
These packages have both been designed to reflect one another,even down to how they connect to AWS Athena. This means that all features going forward will exist in both packages. I will refer to these packages as one, as they basically work in the same way.
Performance improvements:
Initially the packages utilised AWS Athena SQL queries. This was to achieve all the functional requirements of the DBI package framework. However the package would always send a SQL query to AWS Athena which in turn would have to lift a flat file from AWS S3, before returning the final result to R. This means the performance of the packages would be limited and fairly slow compared to other data base backends.
The biggest change is the adoption of more functionality of the SDKs (software development kit) into AWS. The key component that has been adopted is AWS Glue. AWS Glue contains all of AWS Athena table DDL's. This means instead of going to AWS Athena for this information AWS Glue can be used instead.
By utilising AWS Glue, the table meta data (column names, column types, schema hierarchy etc…) can easily be retrieved at a fraction of the time it would of taken to query AWS Athena. Previously the DBI function dbListTables would send a query to AWS Athena, this would retrieve all the tables listed in all schemas. This would take over 3 seconds. Now using AWS Glue to retrieve the same data, it takes less than 0.5 of a second.
dplyr
When AWS Glue is used to collect metadata around a table in AWS Athena, a performance in dplyr::tbl can be done. I would like to say thanks to @OssiLehtinen for developing the initial implementation as this improvement would have been overlooked.
dplyr::tbl has two key methods when creating the initial object. The first is called SQL identifiers and this is the method that benefits from the new AWS Glue functionality. To use SQL identifiers is fairly straight forward.
dplyr can identify the iris table within the connected schema. When a user uses the SQL identifier method in dplyr::tbl, AWS Glue is called to retrieve all the meta data for dplyr. This increases the performance from 3.66 to 0.29 seconds. The second method is called SQL sub query. This unfortunately won't benefit from the new feature and will run in slower at 3.66 seconds.
subquery_iris = tbl(con, sql("select * from iris"))
Therefore I recommend the use of SQL identifier method when using dplyr's interface.
Creature Comforts
AWS Athena Metadata
Due to user feature requests the packages now return more meta data around each query sent to AWS Athena. Thus the basic level of meta data returned, is the amount of data scanned by AWS Athena. This is formatted into a readable format depending on the amount of data scanned.
library(DBI) library(RAthena) #Or library(noctua) con = dbConnect(athena()) dbWriteTable(con, "iris", iris) dbGetQuery(con, "select * from iris")
res = dbExecute(con, "select * from iris") # return query statistic query_stats = dbStatistics(res) # return query results dbFetch(res) # Free all resources dbClearResult(res)
RJDBC inspired function
I have to give full credit to the package RJDBC for inspiring me to create this function. DBI has got a good function called dbListTables that will list all the tables that are in AWS Athena. However it won't return to which schema each individual table is related to. To over come this RJDBC has a excellent function called dbGetTables. This function returns all the tables from AWS Athena as a data.frame. This has the advantage of detailing schema, table and table type. With the new integration into AWS Glue this can be returned quickly.
This just makes it a little bit easier when working in different IDE's for example Jupyter.
Backend option changes
This is not really a creature comfort but it still interesting and useful. Both packages are dependent on data.table to read data into R. This is down to the amazing speed data.table offers when reading files into R. However a new package, with equally impressive read speeds, has come onto the scene called vroom. As vroom has been designed to only read data into R similarly to readr, data.table is still used for all of the heavy lifting. However if a user wishes to use vroom as the file parser an *_options function has been created to enable this:
nocuta_options(file_parser = c("data.table", "vroom")) # Or RAthena__options(file_parser = c("data.table", "vroom"))
By setting the file_parser to vroom then the backend will change to allow vroom's file parser to be used instead of data.table.
If you aren't sure whether to use vroom over data.table, I draw your attention to vroom boasting a whopping 1.40GB/sec throughput.
Statistics taken from vroom's github readme
package
version
time (sec)
speed-up
throughput
vroom
1.1.0
1.14
58.44
1.40 GB/sec
data.table
1.12.8
11.88
5.62
134.13 MB/sec
readr
1.3.1
29.02
2.30
54.92 MB/sec
read.delim
3.6.2
66.74
1.00
23.88 MB/sec
RStudio Interface!
Due to the ability of AWS Glue to retrieve metadata for AWS Athena at speed, it has now been possible to add the interface into RStudio's connection tab. When a connection is established:
library(DBI) library(RAthena) #Or library(noctua) con = dbConnect(athena())
The connection icon will as follows:
The AWS region you are connecting to will be reflected in the connection (highlighted above in the red square). This is to help users that are able to connect to multiple different AWS Athena over different regions.
Once you have connected AWS Athena, schema hierarchy will be displayed. In my example you can see some of the tables I have created when testing these packages.
To sum up, the Rathena and noctua latest versions have been released to cran with all the new goodies they bring. As these packages are based on AWS SDK's they are highly customisable. Features can easily be added to improve the packages when connecting to AWS Athena. So please raise any feature requests / bug issues to: https://github.com/DyfanJones/RAthena and https://github.com/DyfanJones/noctua
To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.
[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Correlation, often computed as part of descriptive statistics, is a statistical tool used to study the relationship between two variables, that is, whether and how strongly couples of variables are associated.
Correlations are measured between only 2 variables at a time. Therefore, for datasets with many variables, computing correlations can become quite cumbersome and time consuming.
Correlation matrix
A solution to this problem is to compute correlations and display them in a correlation matrix, which shows correlation coefficients for all possible combinations of two variables in the dataset.
For example, below is the correlation matrix for the dataset mtcars (which, as described by the help documentation of R, comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles):1
Even after rounding the correlation coefficients to 2 digits, you will conceive that this correlation matrix is not easily and quickly interpretable.
If you are using R Markdown, you can use the pander() function from the {pander} package to make it slightly more readable, but still, we must admit that this table is not optimal when it comes to visualizing correlations between several variables of a dataset, especially for large datasets.
Correlogram
To tackle this issue and make it much more insightful, let's transform the correlation matrix into a correlation plot. A correlation plot, also referred as a correlogram, allows to highlight the variables that are most (positively and negatively) correlated. Below an example with the same dataset presented above:
The correlogram represents the correlations for all pairs of variables. Positive correlations are displayed in blue and negative correlations in red. The intensity of the color is proportional to the correlation coefficient so the stronger the correlation (i.e., the closer to -1 or 1), the darker the boxes. The color legend on the right hand side of the correlogram shows the correlation coefficients and the corresponding colors.
As a reminder, a negative correlation implies that the two variables under consideration vary in opposite directions, that is, if one variable increases the other decreases and vice versa. A positive correlation implies that the two variables under consideration vary in the same direction, that is, if one variable increases the other increases and if one variable decreases the other decreases as well. Furthermore, the stronger the correlation, the stronger the association between the two variables.
Correlation test
Finally, a white box in the correlogram indicates that the correlation is not significantly different from 0 at the specified significance level (in this example, at \(\alpha = 5\)%) for the couple of variables. A correlation not significantly different from 0 means that there is no linear relationship between the two variables considered (there could be another kind of association, but other than linear).
To determine whether a specific correlation coefficient is significantly different from 0, a correlation test has been performed. Remind that the null and alternative hypotheses of this test are:
\(H_0\): \(\rho = 0\)
\(H_1\): \(\rho \ne 0\)
where \(\rho\) is the correlation coefficient. The correlation test is based on two factors: the number of observations and the correlation coefficient. The more observations and the stronger the correlation between 2 variables, the more likely it is to reject the null hypothesis of no correlation between these 2 variables.
In the context of our example, the correlogram above shows that the variables cyl (the number of cylinders) and hp (horsepower) are positively correlated, while the variables mpg (miles per gallon) and wt (weight) are negatively correlated (both correlations make sense if we think about it). Furthermore, the variables gear and carb are not correlated. Even if the correlation coefficient is 0.27 between the 2 variables, the correlation test has shown that we cannot reject the hypothesis of no correlation. This is the reason the box for these two variable is white.
Although this correlogram presents exactly the same information than the correlation matrix, the correlogram presents a visual representation of the correlation matrix, allowing to quickly scan through it to see which variables are correlated and which are not.
Code
For those interested to draw this correlogram with their own data, here is the code of the function I adapted based on the corrplot() function from the {corrplot} package (thanks again to all contributors of this package):
The main arguments in the corrplot2() function are the following:
data: name of your dataset
method: the correlation method to be computed, one of "pearson" (default), "kendall", or "spearman". As a rule of thumb, if your dataset contains quantitative continuous variables, you can keep the Pearson method, if you have qualitative ordinal variables, the Spearman method is more appropriate
sig.level: the significance level for the correlation test, default is 0.05
order: order of the variables, one of "original" (default), "AOE" (angular order of the eigenvectors), "FPC" (first principal component order), "hclust" (hierarchical clustering order), "alphabet" (alphabetical order)
diag: display the correlation coefficients on the diagonal? The default is FALSE
type: display the entire correlation matrix or simply the upper/lower part, one of "upper" (default), "lower", "full"
tl.srt: rotation of the variable labels
(note that missing values in the dataset are automatically removed)
You can also play with the arguments of the corrplot2 function and see the results thanks to this R Shiny app.
Thanks for reading. I hope this article will help you to visualize correlations between variables in a dataset and to make correlation matrices more insightful and more appealing.
As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.
The dataset mtcars is preloaded in R by default, so there is no need to import it into R. Check the article "How to import an Excel file in R" if you need help in importing your own dataset.
To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.
Excellent Website. Lots of useful information here, thank you for your effort! For more information please visit.keep continue guys.
ReplyDeleteC and C++ Training Institute in chennai | C and C++ Training Institute in anna nagar | C and C++ Training Institute in omr | C and C++ Training Institute in porur | C and C++ Training Institute in tambaram | C and C++ Training Institute in velachery
I like website a lot.
ReplyDeleteAngularJS training in chennai | AngularJS training in anna nagar | AngularJS training in omr | AngularJS training in porur | AngularJS training in tambaram | AngularJS training in velachery