[R-bloggers] Synthetic micro-datasets: a promising middle ground between data privacy and data analysis (and 6 more aRticles)

[R-bloggers] Synthetic micro-datasets: a promising middle ground between data privacy and data analysis (and 6 more aRticles)

Link to R-bloggers

Synthetic micro-datasets: a promising middle ground between data privacy and data analysis

Posted: 22 Feb 2020 04:00 PM PST

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Intro: the need for microdata, and the risk of disclosure

Survey and administrative data are essential for scientific research, however accessing such datasets
can be very tricky, or even impossible. In my previous job I was responsible for getting access to
such "scientific micro-datasets" from institutions like Eurostat.
In general, getting access to these micro datasets was only a question of filling out some forms
and signing NDAs. But this was true only because my previous employer was an accredited research entity.
Companies from the private sector or unaffiliated, individual, researchers cannot get access to
the microdata sets. This is because institutions that produce such datasets absolutely do not want
any type of personal information to be disclosed.

For instance, with the labour force survey, a National Statistical Institute (NSI) collects
information about wages, family structure, educational attainment and much more.
If, say, a politician would answer to the survey and his answers would leak to the public that would
be disastrous for NSIs. So this is why access is restricted to accredited research institutions.
You may be asking yourself, "how could the politicians answers leak? The data is anonymized!"
Indeed it is, but in some cases that may not be enough to ensure that information does not get
disclosed. Suppose that the dataset contains enough information to allow you to know for certain
that you found said politician, assume that this politician is a 43 year old man, has two children,
a PhD in theology and lives in Strassen, one of Luxembourg-City very nice neighborhoods. It would be quite easy to
find him in the dataset and then find out his wage.

To avoid this, researchers are required to perform output checking, which means going through the
set of outputs (summary tables, graphs, tables with regression coefficients…) and making sure that
it is not possible to find out individuals. For instance, in Luxembourg there are two companies in
the tobacco industry. Luxembourg's NSI cannot release the total turnover
of the industry, because then company A would subtract its turnover from the total and find out its
competitor's turnover. Now these are all hypothetical examples, and we might argue that the risk of
leakage is quite low, especially if NSIs make sure to lower the precision of the variables, by
providing age categories instead of the exact age for example. Or capping wages that exceed a certain
fixed amount.
In any case for now most NSIs don't release micro data to the public, and this poses some challenges
for research. First of all, even for researchers, it would be great if the data was freely accessible.
It would allow research to go straight to data analysis and look at the structure of the data before
applying for access, with the risk of getting access to useless data.
And of course it would be great for the public at large to be able to freely access such data, for
educational purposes at the very least. It would also increase competition between research institutions
and the private sector when it comes to conducting studies using such data. Free access to the
microdata would level the playing field.
Now, some NSIs do release micro data to the public, see Eustat, the NSI from the Basque country,
an autonomous region of Spain. It is not clear to me if they also have more detailed data that is
only accessible to researchers, but the data they offer is already quite interesting.

A middle ground between only releasing data to researchers and making it completely freely accessible
is to create a synthetic dataset, which does not contain any of the original records, but which still
allows to perform meaningful analyses.

I'm not yet very familiar with the details of the procedure, but in this blog post I'll use Eustat's
microdata to generate a synthetic dataset. I'll then perform the same analysis on both the original
dataset and the synthetic dataset. The dataset I'll be using can be found
here, and is called
Population with relation to activity (PRA):

The Survey on the Population in Relation to Activity operation is a continuous source of
information on the characteristics and dynamics of the labour force of the Basque Country.
It records the relation to productive activity of the population resident in family households,
as well as the changes produced in labour situations; it produces indicators of conjunctural
variations in the evolution of the active population; it also estimates the degree of participation
of the population in economically non-productive activities. It offers information on the province
and capital level.

I'll then compare the results of the analyses performed on the two datasets which will hopefully be
very similar. To create the synthetic dataset, I'll be using the {synthpop} package. You can read
the detailed vignette here – pdf warning –.
First, let me perform some cleaning steps. There are four datasets included in the
archive. Let's load them:

library(tidyverse)  library(tidymodels)  library(readxl)  library(synthpop)    list_data <- Sys.glob("MICRO*.csv")    dataset <- map(list_data, read_csv2) %>%    bind_rows()    head(dataset)

The columns are labeled in Spanish so I'm copy pasting the labels into Google translate and paste
them back into my script. I saved the English names into the english.rds object for posterity.
These steps are detailed in the next lines:

dictionary <- read_xlsx("Microdatos_PRA_2019/diseño_registro_microdatos_pra.xlsx", sheet="Valores",                          col_names = FALSE)
## New names:  ## * `` -> ...1  ## * `` -> ...2  ## * `` -> ...3
col_names <- dictionary %>%    filter(!is.na(...1)) %>%    dplyr::select(1:2)    # copy to clipboard, paste to google translate  # couldn't be bothered to use an api and google cloud or whatever  #clipr::write_clip(col_names$`...2`)    #english <- clipr::read_clip()    english <- readRDS("english_col_names.rds")    col_names$english <- english    colnames(dataset) <- col_names$english    dataset <- janitor::clean_names(dataset)

I now create a function that will perform the cleaning steps:

basic_cleaning <- function(dataset){    dataset %>%    dplyr::filter(age %in% c("05", "06", "07", "08", "09", "10", "11")) %>%    dplyr::filter(!is.na(job_search)) %>%      dplyr::select(territory, capital, sex, place_of_birth, age, nationality, level_of_studies_completed,                  job_search, main_occupation, type_of_contract, hours) %>%    dplyr::mutate_at(.vars = vars(-hours), .funs=as.factor)  }

Putting on my econometricians hat

Let's now suppose that I'm only interested in running a logistic regression:

pra <- basic_cleaning(dataset)    head(pra)
## # A tibble: 6 x 11  ##   territory capital sex   place_of_birth age   nationality level_of_studie…  ##                                           ## 1 48        9       6     1              09    1           1                 ## 2 48        9       1     1              09    1           2                 ## 3 48        1       1     1              11    1           3                 ## 4 48        1       6     1              10    1           3                 ## 5 48        9       6     1              07    1           3                 ## 6 48        9       1     1              09    1           1                 ## # … with 4 more variables: job_search , main_occupation ,  ## #   type_of_contract , hours 
logit_model <- glm(job_search ~ ., data = pra, family = binomial())    # Create a tidy dataset with the results of the regression  tidy_logit_model <- tidy(logit_model, conf.int = TRUE) %>%    mutate(dataset = "true")

Let's now take a look at the coefficients, by plotting their value along with their confidence
intervals:

ggplot(tidy_logit_model, aes(x = term, y = estimate)) +    geom_point(colour = "#82518c") +    geom_hline(yintercept = 0, colour = "red") +    geom_errorbar(aes(ymin = conf.low, ymax = conf.high), colour = "#657b83") +    brotools::theme_blog() +    theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Ok, so now, how would the results change if I run the same analysis on the synthetic dataset? First,
I need to generate this synthetic dataset:

my_seed <- 1234    synthetic_data <- syn(pra, seed = my_seed)
## Synthesis  ## -----------  ##  territory capital sex place_of_birth age nationality level_of_studies_completed job_search main_occupation type_of_contract  ##  hours

The synthetic data is generated by a single call to the syn() function included in the {synthpop}
package. Let's take a look at the generated object:

synthetic_data
## Call:  ## ($call) syn(data = pra, seed = my_seed)  ##   ## Number of synthesised data sets:   ## ($m)  1   ##   ## First rows of synthesised data set:   ## ($syn)  ##   territory capital sex place_of_birth age nationality  ## 1        48       9   1              1  06           1  ## 2        01       9   6              3  09           1  ## 3        48       3   1              1  08           1  ## 4        48       9   6              1  11           1  ## 5        20       2   6              1  09           1  ## 6        48       1   6              1  11           1  ##   level_of_studies_completed job_search main_occupation type_of_contract hours  ## 1                          3          N               2                1    40  ## 2                          1          S               9                6    10  ## 3                          1          N               6                 32  ## 4                          2          N               4                1    32  ## 5                          3          N               5                 40  ## 6                          1          S               7                 NA  ## ...  ##   ## Synthesising methods:   ## ($method)  ##                  territory                    capital   ##                   "sample"                     "cart"   ##                        sex             place_of_birth   ##                     "cart"                     "cart"   ##                        age                nationality   ##                     "cart"                     "cart"   ## level_of_studies_completed                 job_search   ##                     "cart"                     "cart"   ##            main_occupation           type_of_contract   ##                     "cart"                     "cart"   ##                      hours   ##                     "cart"   ##   ## Order of synthesis:   ## ($visit.sequence)  ##                  territory                    capital   ##                          1                          2   ##                        sex             place_of_birth   ##                          3                          4   ##                        age                nationality   ##                          5                          6   ## level_of_studies_completed                 job_search   ##                          7                          8   ##            main_occupation           type_of_contract   ##                          9                         10   ##                      hours   ##                         11   ##   ## Matrix of predictors:   ## ($predictor.matrix)  ##                            territory capital sex place_of_birth age nationality  ## territory                          0       0   0              0   0           0  ## capital                            1       0   0              0   0           0  ## sex                                1       1   0              0   0           0  ## place_of_birth                     1       1   1              0   0           0  ## age                                1       1   1              1   0           0  ## nationality                        1       1   1              1   1           0  ## level_of_studies_completed         1       1   1              1   1           1  ## job_search                         1       1   1              1   1           1  ## main_occupation                    1       1   1              1   1           1  ## type_of_contract                   1       1   1              1   1           1  ## hours                              1       1   1              1   1           1  ##                            level_of_studies_completed job_search  ## territory                                           0          0  ## capital                                             0          0  ## sex                                                 0          0  ## place_of_birth                                      0          0  ## age                                                 0          0  ## nationality                                         0          0  ## level_of_studies_completed                          0          0  ## job_search                                          1          0  ## main_occupation                                     1          1  ## type_of_contract                                    1          1  ## hours                                               1          1  ##                            main_occupation type_of_contract hours  ## territory                                0                0     0  ## capital                                  0                0     0  ## sex                                      0                0     0  ## place_of_birth                           0                0     0  ## age                                      0                0     0  ## nationality                              0                0     0  ## level_of_studies_completed               0                0     0  ## job_search                               0                0     0  ## main_occupation                          0                0     0  ## type_of_contract                         1                0     0  ## hours                                    1                1     0

As you can see, synthetic_data is a list with several elements. The data is inside the syn element.
Let's extract it, and perform the same analysis from before:

syn_pra <- synthetic_data$syn    head(syn_pra)
##   territory capital sex place_of_birth age nationality  ## 1        48       9   1              1  06           1  ## 2        01       9   6              3  09           1  ## 3        48       3   1              1  08           1  ## 4        48       9   6              1  11           1  ## 5        20       2   6              1  09           1  ## 6        48       1   6              1  11           1  ##   level_of_studies_completed job_search main_occupation type_of_contract hours  ## 1                          3          N               2                1    40  ## 2                          1          S               9                6    10  ## 3                          1          N               6                 32  ## 4                          2          N               4                1    32  ## 5                          3          N               5                 40  ## 6                          1          S               7                 NA
syn_pra <- basic_cleaning(syn_pra)    logit_model_syn <- glm(job_search ~ ., data = syn_pra, family = binomial())    tidy_logit_syn <- tidy(logit_model_syn, conf.int = TRUE) %>%    mutate(dataset = "syn")    ggplot(tidy_logit_syn, aes(x = term, y = estimate)) +    geom_point(colour = "#82518c") +    geom_hline(yintercept = 0, colour = "red") +    geom_errorbar(aes(ymin = conf.low, ymax = conf.high), colour = "#657b83") +    brotools::theme_blog() +    theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

To ease the comparison between the coefficients of the model, let's create a single graph:

coeff_models <- bind_rows(list(tidy_logit_model, tidy_logit_syn))    ggplot(coeff_models, aes(x = term, y = estimate, colour = dataset)) +    geom_point() +    geom_hline(yintercept = 0) +    geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +    brotools::theme_blog() +    theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

This is quite interesting; generally, there is quite some overlap between the synthetic data and
the real data! There are some differences though, for instance, main_occupation6 is significant
with the synthetic data, but is not with the real data. There's the possibility to generate more
than one synthetic dataset, which would very likely reduce the noise.

Putting on my data scientist hat

Now let's suppose that I am only interested into prediction. For this, I am going to split my dataset
into a training and testing set, then run a logistic regression and a random forest, assess the
models' performance with 10-fold cross validation. I'll do this on both the real and the synthetic
data. To perform the analysis, I'll be using the {tidymodels} framework; I'm going to explain
the code that follows line by line, because I'll very likely write a blog post focusing on {tidymodels}
soon.

So, let's write a function that does exactly what I explained above:

training_and_evaluating <- function(dataset){      pra_split <- initial_split(dataset, prop = 0.8)        pra_train <- training(pra_split)    pra_test <- testing(pra_split)        pra_cv_splits <- vfold_cv(pra_train, v = 10)        preprocess <- recipe(job_search ~ ., data = pra) %>%      step_knnimpute(all_predictors())        logit_pra <- logistic_reg() %>%      set_engine("glm")        fitted_logit <- fit_resamples(preprocess,                                  model = logit_pra,                                  resamples = pra_cv_splits,                                  control = control_resamples(save_pred = TRUE))        metric_logit <- fitted_logit$.metrics %>%      bind_rows() %>%      group_by(.metric) %>%      summarise_at(.vars = vars(.estimate), .funs = lst(mean, sd)) %>%      mutate(model = "logit")        rf_pra <- rand_forest(mode = "classification") %>%      set_engine(engine = "ranger")        fitted_forest <- fit_resamples(preprocess,                                  model = rf_pra,                                  resamples = pra_cv_splits,                                  control = control_resamples(save_pred = TRUE))        metric_forest <- fitted_forest$.metrics %>%      bind_rows() %>%      group_by(.metric) %>%      summarise_at(.vars = vars(.estimate), .funs = lst(mean, sd)) %>%      mutate(model = "forest")        bind_rows(list(metric_logit, metric_forest))  }

Now I can run this function on both the real and the synthetic data, and look at the performance
of the logistic regression and of the random forest:

true_data_performance <- training_and_evaluating(pra)    syn_data_performance <- training_and_evaluating(syn_pra)
true_data_performance
## # A tibble: 4 x 4  ##   .metric   mean      sd model   ##              ## 1 accuracy 0.882 0.00816 logit   ## 2 roc_auc  0.708 0.0172  logit   ## 3 accuracy 0.907 0.00619 forest  ## 4 roc_auc  0.879 0.0123  forest
syn_data_performance
## # A tibble: 4 x 4  ##   .metric   mean      sd model   ##              ## 1 accuracy 0.882 0.00758 logit   ## 2 roc_auc  0.691 0.0182  logit   ## 3 accuracy 0.899 0.00615 forest  ## 4 roc_auc  0.857 0.0124  forest

The performance is pretty much the same!

Generating synthetic data is a very promising approach, that I certainly will be using more; I think
that such approaches can also be very interesting in the private sector (not only to ease access
to microdata for researchers) especially within large
companies. For instance, it can happen that the data owners from say, an insurance company, are
not very keen on sharing sensitive client information with their data scientists. However, by generating
a synthetic dataset and sharing the synthetic data with their data science team, the data owners
avoid any chance of disclosure of sensitive information, while at the same time allowing their
data scientists to develop interesting analyses or applications on the data!

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and watch my
youtube channel. If you want to support
my blog and channel, you could buy me an espresso or
paypal.me, or buy my ebook on Leanpub.

Buy me an EspressoBuy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

The significance of the sector on the salary in Sweden, a comparison between different occupational groups, part 2

Posted: 22 Feb 2020 04:00 PM PST

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post, I examined the significance of the sector on the salary for different occupational groups using statistics from different regions. In previous posts I have shown a correlation between the salary and experience and also salary and education, In this post, I will examine the correlation between salary and sector using statistics for age.

The F-value from the Anova table is used as the single value to discriminate how much the region and salary correlates. For exploratory analysis, the Anova value seems good enough.

First, define libraries and functions.

library (tidyverse)
## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3  ## v tibble  2.1.3     v dplyr   0.8.3  ## v tidyr   1.0.2     v stringr 1.4.0  ## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --  ## x dplyr::filter() masks stats::filter()  ## x dplyr::lag()    masks stats::lag()
library (broom)  library (car)
## Loading required package: carData
##   ## Attaching package: 'car'
## The following object is masked from 'package:dplyr':  ##   ##     recode
## The following object is masked from 'package:purrr':  ##   ##     some
library (sjPlot)
## Registered S3 methods overwritten by 'lme4':  ##   method                          from  ##   cooks.distance.influence.merMod car   ##   influence.merMod                car   ##   dfbeta.influence.merMod         car   ##   dfbetas.influence.merMod        car
readfile <- function (file1){read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%    gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%    drop_na() %>%    mutate (year_n = parse_number (year))  }

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000D2.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

I have renamed the file to 000000D2_sector.csv because the filename 000000D2.csv was used in a previous post.

The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by sector, occupational group (SSYK 2012), sex and age. Year 2014 – 2018 Monthly salary 1-3 public sector 4-5 private sector

In the plot and tables, you can also find information on how the increase in salaries per year for each occupational group is affected when the interactions are taken into account.

tb <- readfile("000000D2_sector.csv") %>%    rowwise() %>%    mutate(age_l = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[1]) %>%    rowwise() %>%    mutate(age_h = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[2]) %>%    mutate(age_n = (age_l + age_h) / 2)    summary_table = 0  anova_table = 0    for (i in unique(tb$`occuptional  (SSYK 2012)`)){    temp <- filter(tb, `occuptional  (SSYK 2012)` == i)    if (dim(temp)[1] > 90){      model <- lm(log(salary) ~ poly(age_n, 3) + sex + year_n + sector, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "none"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "none"))            model <- lm(log(salary) ~ poly(age_n, 3) * sector + sex + year_n, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and age"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and age"))              model <- lm(log(salary) ~ poly(age_n, 3) + sector * sex + year_n, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and sex"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and sex"))              model <- lm(log(salary) ~ poly(age_n, 3) +  year_n * sector + sex, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and year"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and year"))              model <- lm(log(salary) ~ poly(age_n, 3) * sector * sex * year_n, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector, year, age and sex"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector, year, age and sex"))          }  }
## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison
anova_table <- anova_table %>% rowwise() %>% mutate(contcol = str_count(term, ":"))     summary_table <- summary_table %>% rowwise() %>% mutate(contcol = str_count(term, ":"))    merge(summary_table, anova_table, by = c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (term.y == "sector") %>%        filter (interaction == "none") %>%        mutate (estimate = (exp(estimate) - 1) * 100) %>%      ggplot () +      geom_point (mapping = aes(x = estimate, y = statistic.y, colour = interaction)) +      labs(        x = "Increase in salaries (% / year)",        y = "F-value for sector"      )   

The significance of the sector on the salary in Sweden, a comparison between different occupational groups, Year 2014 - 2018

Figure 1: The significance of the sector on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018

merge(summary_table, anova_table, by = c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%        # only look at the interactions between all four variables in the case with interaction sector, year, age and sex    filter (!(contcol.y < 3 & interaction == "sector, year, age and sex")) %>%         mutate (estimate = (exp(estimate) - 1) * 100) %>%      ggplot () +      geom_point (mapping = aes(x = estimate, y = statistic.y, colour = interaction)) +      labs(        x = "Increase in salaries (% / year)",        y = "F-value for interaction"      ) 

The significance of the interaction between sector, age, year and sex on the salary in Sweden, a comparison between different occupational groups, Year 2014 - 2018

Figure 2: The significance of the interaction between sector, age, year and sex on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018

The tables with all occupational groups sorted by F-value in descending order.

merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (term.y == "sector") %>%       filter (interaction == "none") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector) and the yearly increase in salaries')
Table 1: Correlation for F-value (sector) and the yearly increase in salaries
ssyk Increase in salary F-value interaction
218 Specialists within environmental and health protection 2.3966875 456.4704739 none
261 Legal professionals 2.6130221 304.1919940 none
411 Office assistants and other secretaries 2.3419860 285.0349753 none
222 Nursing professionals 4.6578230 280.4433905 none
515 Building caretakers and related workers 2.2146338 270.0075297 none
819 Process control technicians 2.5350438 268.3445857 none
242 Organisation analysts, policy administrators and human resource specialists 2.0523696 259.5458010 none
251 ICT architects, systems analysts and test managers 2.3512692 241.0997355 none
321 Medical and pharmaceutical technicians 2.5801263 237.2341857 none
333 Business services agents 2.3936321 218.8431989 none
334 Administrative and specialized secretaries 2.0432078 197.0091850 none
243 Marketing and public relations professionals 1.7547524 188.6715597 none
335 Tax and related government associate professionals 2.4477931 179.6275210 none
342 Athletes, fitness instructors and recreational workers 1.8086152 178.9426205 none
962 Newspaper distributors, janitors and other service workers 1.9900356 178.5334093 none
213 Biologists, pharmacologists and specialists in agriculture and forestry 1.8912397 161.3670379 none
265 Creative and performing artists 2.0992262 156.9400225 none
241 Accountants, financial analysts and fund managers 2.4619072 134.7081954 none
351 ICT operations and user support technicians 2.7045039 117.0092304 none
331 Financial and accounting associate professionals 2.1803329 111.0703889 none
264 Authors, journalists and linguists 2.2865429 93.9690111 none
211 Physicists and chemists 2.0010438 65.4196943 none
911 Cleaners and helpers 1.9306757 64.9465316 none
231 University and higher education teachers 2.3987358 58.9926983 none
541 Other surveillance and security workers 2.2500326 54.4430468 none
815 Machine operators, textile, fur and leather products 1.3517419 48.6411876 none
723 Machinery mechanics and fitters 2.4251751 48.2416487 none
233 Secondary education teachers 3.0390165 38.5627689 none
422 Client information clerks 2.4686744 37.8718779 none
228 Specialists in health care not elsewhere classified 2.1589844 36.5601628 none
941 Fast-food workers, food preparation assistants 1.9960527 18.6514673 none
216 Architects and surveyors 2.8323556 15.8227312 none
235 Teaching professionals not elsewhere classified 1.6582824 14.3745577 none
532 Personal care workers in health services 3.0076073 12.2147525 none
311 Physical and engineering science technicians 2.7704145 12.1641422 none
234 Primary- and pre-school teachers 3.4774983 7.3900661 none
511 Cabin crew, guides and related workers -0.3443427 6.9533625 none
534 Attendants, personal assistants and related workers 2.0233419 5.1208491 none
533 Health care assistants 2.2271720 4.8058895 none
332 Insurance advisers, sales and purchasing agents 2.2044676 4.2832833 none
531 Child care workers and teachers aides 1.9105637 4.0479235 none
214 Engineering professionals 2.7995742 2.1850611 none
341 Social work and religious associate professionals 2.5651458 2.1164875 none
432 Stores and transport clerks 1.9337354 0.7577500 none
611 Market gardeners and crop growers 1.8041743 0.0643998 none
512 Cooks and cold-buffet managers 2.7942944 0.0197082 none
merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%       filter (interaction == "sector and sex") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector and sex) and the yearly increase in salaries')
Table 2: Correlation for F-value (sector and sex) and the yearly increase in salaries
ssyk Increase in salary F-value interaction
911 Cleaners and helpers 1.9306757 109.1411280 sector and sex
335 Tax and related government associate professionals 2.4477931 70.7824355 sector and sex
815 Machine operators, textile, fur and leather products 1.3563289 31.8117555 sector and sex
333 Business services agents 2.2899853 30.4752880 sector and sex
331 Financial and accounting associate professionals 2.2023600 25.9296892 sector and sex
342 Athletes, fitness instructors and recreational workers 1.8086152 25.0203935 sector and sex
241 Accountants, financial analysts and fund managers 2.4786358 24.2528951 sector and sex
264 Authors, journalists and linguists 2.2453619 21.8740576 sector and sex
533 Health care assistants 2.2271720 21.5539071 sector and sex
243 Marketing and public relations professionals 1.7263442 19.6730093 sector and sex
332 Insurance advisers, sales and purchasing agents 2.2044676 19.3664844 sector and sex
611 Market gardeners and crop growers 1.7542849 15.6290927 sector and sex
532 Personal care workers in health services 3.0076073 12.5499477 sector and sex
213 Biologists, pharmacologists and specialists in agriculture and forestry 1.8973847 11.5413223 sector and sex
321 Medical and pharmaceutical technicians 2.5636636 9.5201918 sector and sex
242 Organisation analysts, policy administrators and human resource specialists 2.0388525 7.8030347 sector and sex
351 ICT operations and user support technicians 2.7045039 7.5568275 sector and sex
211 Physicists and chemists 1.9926264 7.5217104 sector and sex
235 Teaching professionals not elsewhere classified 1.6228416 5.7033533 sector and sex
512 Cooks and cold-buffet managers 2.7942944 5.4926902 sector and sex
941 Fast-food workers, food preparation assistants 1.9960527 3.8523689 sector and sex
432 Stores and transport clerks 1.9247904 3.3361868 sector and sex
218 Specialists within environmental and health protection 2.3902920 3.2726896 sector and sex
231 University and higher education teachers 2.4049429 3.1968905 sector and sex
222 Nursing professionals 4.6527974 2.8848508 sector and sex
265 Creative and performing artists 2.0908231 2.4335097 sector and sex
962 Newspaper distributors, janitors and other service workers 1.9900356 2.3091055 sector and sex
233 Secondary education teachers 3.0478812 1.6616950 sector and sex
228 Specialists in health care not elsewhere classified 2.2271410 1.4295653 sector and sex
819 Process control technicians 2.5350438 1.1646979 sector and sex
541 Other surveillance and security workers 2.2500326 1.1488532 sector and sex
534 Attendants, personal assistants and related workers 2.0233419 1.1344410 sector and sex
261 Legal professionals 2.6014900 1.1208624 sector and sex
531 Child care workers and teachers aides 1.9034289 1.0441681 sector and sex
411 Office assistants and other secretaries 2.3419860 0.8053533 sector and sex
214 Engineering professionals 2.7938330 0.7610648 sector and sex
341 Social work and religious associate professionals 2.5666541 0.6898733 sector and sex
422 Client information clerks 2.4707255 0.4188795 sector and sex
251 ICT architects, systems analysts and test managers 2.3512692 0.2019761 sector and sex
234 Primary- and pre-school teachers 3.4774983 0.1910697 sector and sex
334 Administrative and specialized secretaries 2.0473936 0.1673844 sector and sex
723 Machinery mechanics and fitters 2.4270025 0.0946975 sector and sex
216 Architects and surveyors 2.8323370 0.0480519 sector and sex
515 Building caretakers and related workers 2.2148711 0.0466927 sector and sex
311 Physical and engineering science technicians 2.7695781 0.0317138 sector and sex
511 Cabin crew, guides and related workers -0.3452668 0.0263848 sector and sex
merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%       filter (interaction == "sector and age") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector and age) and the yearly increase in salaries')
Table 3: Correlation for F-value (sector and age) and the yearly increase in salaries
ssyk Increase in salary F-value interaction
251 ICT architects, systems analysts and test managers 2.3512692 81.293931 sector and age
534 Attendants, personal assistants and related workers 2.0233419 74.994233 sector and age
234 Primary- and pre-school teachers 3.4774983 50.554556 sector and age
511 Cabin crew, guides and related workers -0.3214703 49.726047 sector and age
331 Financial and accounting associate professionals 2.1760115 43.716282 sector and age
432 Stores and transport clerks 1.9372268 41.355837 sector and age
351 ICT operations and user support technicians 2.7045039 37.992660 sector and age
241 Accountants, financial analysts and fund managers 2.4925081 31.914364 sector and age
242 Organisation analysts, policy administrators and human resource specialists 2.0984548 31.372391 sector and age
422 Client information clerks 2.4763530 24.396462 sector and age
342 Athletes, fitness instructors and recreational workers 1.8086152 21.878170 sector and age
332 Insurance advisers, sales and purchasing agents 2.2044676 20.887588 sector and age
333 Business services agents 2.5711782 20.728956 sector and age
261 Legal professionals 2.6542145 19.114710 sector and age
233 Secondary education teachers 3.1089611 18.460571 sector and age
541 Other surveillance and security workers 2.2500326 18.166197 sector and age
243 Marketing and public relations professionals 1.8298739 16.372110 sector and age
231 University and higher education teachers 2.5000216 15.799814 sector and age
411 Office assistants and other secretaries 2.3419860 15.436701 sector and age
235 Teaching professionals not elsewhere classified 1.8467331 15.280487 sector and age
211 Physicists and chemists 1.9949354 11.545117 sector and age
213 Biologists, pharmacologists and specialists in agriculture and forestry 1.8835767 11.142152 sector and age
228 Specialists in health care not elsewhere classified 2.0243679 10.920972 sector and age
335 Tax and related government associate professionals 2.4477931 10.914843 sector and age
941 Fast-food workers, food preparation assistants 1.9960527 10.718505 sector and age
341 Social work and religious associate professionals 2.5816295 9.475763 sector and age
216 Architects and surveyors 2.7872555 9.229637 sector and age
321 Medical and pharmaceutical technicians 2.5872747 8.122166 sector and age
911 Cleaners and helpers 1.9306757 8.084293 sector and age
815 Machine operators, textile, fur and leather products 1.3751133 7.636766 sector and age
334 Administrative and specialized secretaries 2.1355883 7.153296 sector and age
218 Specialists within environmental and health protection 2.4620357 6.994422 sector and age
962 Newspaper distributors, janitors and other service workers 1.9900356 6.838603 sector and age
531 Child care workers and teachers aides 1.9226126 6.070516 sector and age
515 Building caretakers and related workers 2.2192760 6.005165 sector and age
311 Physical and engineering science technicians 2.7875316 5.421210 sector and age
723 Machinery mechanics and fitters 2.4360682 5.406378 sector and age
819 Process control technicians 2.5350438 5.043236 sector and age
264 Authors, journalists and linguists 2.2247505 4.044872 sector and age
512 Cooks and cold-buffet managers 2.7942944 3.928362 sector and age
533 Health care assistants 2.2271720 3.673529 sector and age
265 Creative and performing artists 2.0786906 2.538964 sector and age
214 Engineering professionals 2.7863137 2.497814 sector and age
532 Personal care workers in health services 3.0076073 2.339712 sector and age
222 Nursing professionals 4.6262827 2.287950 sector and age
611 Market gardeners and crop growers 1.8042107 1.716068 sector and age
merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%       filter (interaction == "sector and year") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector and year) and the yearly increase in salaries')
Table 4: Correlation for F-value (sector and year) and the yearly increase in salaries
ssyk Increase in salary F-value interaction
222 Nursing professionals 3.2821623 52.9080079 sector and year
235 Teaching professionals not elsewhere classified 2.7102815 24.3392922 sector and year
334 Administrative and specialized secretaries 0.6824704 15.9484068 sector and year
531 Child care workers and teachers aides 2.3704462 13.8844930 sector and year
422 Client information clerks 3.2226939 12.4412439 sector and year
962 Newspaper distributors, janitors and other service workers 2.5465829 10.9472608 sector and year
534 Attendants, personal assistants and related workers 2.3719311 10.1007673 sector and year
532 Personal care workers in health services 2.7590988 9.0616925 sector and year
611 Market gardeners and crop growers 2.4285881 9.0355793 sector and year
233 Secondary education teachers 3.4681159 8.2187247 sector and year
264 Authors, journalists and linguists 3.2488108 6.2777560 sector and year
351 ICT operations and user support technicians 3.4272576 4.8822064 sector and year
214 Engineering professionals 3.4475215 4.3685520 sector and year
723 Machinery mechanics and fitters 2.7726800 4.0300820 sector and year
511 Cabin crew, guides and related workers -1.4636615 3.9854827 sector and year
243 Marketing and public relations professionals 2.4174309 3.9092911 sector and year
311 Physical and engineering science technicians 3.2975181 3.6534643 sector and year
533 Health care assistants 2.0202998 3.4258034 sector and year
216 Architects and surveyors 3.2766321 3.0286083 sector and year
512 Cooks and cold-buffet managers 2.4761796 2.1189296 sector and year
231 University and higher education teachers 2.7564149 1.9590690 sector and year
911 Cleaners and helpers 2.1080113 1.3375758 sector and year
265 Creative and performing artists 1.8252102 1.3073089 sector and year
341 Social work and religious associate professionals 2.7179780 1.1525106 sector and year
332 Insurance advisers, sales and purchasing agents 2.4698420 0.8846941 sector and year
515 Building caretakers and related workers 2.0896486 0.8721029 sector and year
251 ICT architects, systems analysts and test managers 2.1070969 0.8453000 sector and year
242 Organisation analysts, policy administrators and human resource specialists 2.3074810 0.7627131 sector and year
213 Biologists, pharmacologists and specialists in agriculture and forestry 2.1515840 0.7217108 sector and year
815 Machine operators, textile, fur and leather products 1.5169216 0.7021796 sector and year
333 Business services agents 2.6770984 0.6948266 sector and year
411 Office assistants and other secretaries 2.4961111 0.6866667 sector and year
261 Legal professionals 2.9024857 0.6474976 sector and year
342 Athletes, fitness instructors and recreational workers 2.0356952 0.5210552 sector and year
541 Other surveillance and security workers 2.1319348 0.4764820 sector and year
941 Fast-food workers, food preparation assistants 2.1030085 0.4107102 sector and year
241 Accountants, financial analysts and fund managers 2.1837720 0.3653365 sector and year
335 Tax and related government associate professionals 2.2476351 0.3329319 sector and year
228 Specialists in health care not elsewhere classified 2.0215985 0.1994037 sector and year
321 Medical and pharmaceutical technicians 2.4986510 0.1694097 sector and year
234 Primary- and pre-school teachers 3.5227087 0.0566620 sector and year
211 Physicists and chemists 2.0509288 0.0256116 sector and year
819 Process control technicians 2.5525388 0.0089722 sector and year
432 Stores and transport clerks 1.9579284 0.0058151 sector and year
331 Financial and accounting associate professionals 2.1471390 0.0023497 sector and year
218 Specialists within environmental and health protection 2.3896946 0.0012880 sector and year
merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 1) %>%       filter (interaction == "sector, year, age and sex") %>%    filter (!(contcol.y < 3 & interaction == "sector, year, age and sex")) %>%      mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector, year, age and sex) and the yearly increase in salaries')
Table 5: Correlation for F-value (sector, year, age and sex) and the yearly increase in salaries
ssyk Increase in salary F-value interaction
214 Engineering professionals 2.4648855 11.6391393 sector, year, age and sex
342 Athletes, fitness instructors and recreational workers 2.0991670 8.9098726 sector, year, age and sex
216 Architects and surveyors 3.2301466 7.6322806 sector, year, age and sex
228 Specialists in health care not elsewhere classified 1.5176747 4.1043437 sector, year, age and sex
321 Medical and pharmaceutical technicians 2.2641978 3.9184148 sector, year, age and sex
218 Specialists within environmental and health protection 2.4573789 3.6168903 sector, year, age and sex
261 Legal professionals 2.7581752 3.2902539 sector, year, age and sex
235 Teaching professionals not elsewhere classified 2.4011513 3.0277439 sector, year, age and sex
335 Tax and related government associate professionals 2.0292874 2.6128635 sector, year, age and sex
341 Social work and religious associate professionals 2.6319084 2.6004044 sector, year, age and sex
233 Secondary education teachers 3.5273156 2.2615012 sector, year, age and sex
331 Financial and accounting associate professionals 2.4662860 1.9867325 sector, year, age and sex
941 Fast-food workers, food preparation assistants 1.9900178 1.9221016 sector, year, age and sex
512 Cooks and cold-buffet managers 2.5222219 1.9206611 sector, year, age and sex
962 Newspaper distributors, janitors and other service workers 2.6603985 1.8150814 sector, year, age and sex
213 Biologists, pharmacologists and specialists in agriculture and forestry 2.3962841 1.7496638 sector, year, age and sex
911 Cleaners and helpers 2.1197350 1.5815397 sector, year, age and sex
511 Cabin crew, guides and related workers -0.6957497 1.5410098 sector, year, age and sex
211 Physicists and chemists 2.3310509 1.5257771 sector, year, age and sex
333 Business services agents 2.9306066 1.5169994 sector, year, age and sex
819 Process control technicians 2.5394398 1.4148181 sector, year, age and sex
723 Machinery mechanics and fitters 2.9145883 1.2913243 sector, year, age and sex
234 Primary- and pre-school teachers 3.5373214 1.2588294 sector, year, age and sex
533 Health care assistants 2.0561764 1.1641262 sector, year, age and sex
222 Nursing professionals 3.3141175 1.1593790 sector, year, age and sex
515 Building caretakers and related workers 2.3554368 1.0916712 sector, year, age and sex
334 Administrative and specialized secretaries -0.4841162 0.8627072 sector, year, age and sex
264 Authors, journalists and linguists 3.8766371 0.8178463 sector, year, age and sex
541 Other surveillance and security workers 2.1959143 0.8087407 sector, year, age and sex
531 Child care workers and teachers aides 2.4624435 0.6787061 sector, year, age and sex
815 Machine operators, textile, fur and leather products 1.2558500 0.6507638 sector, year, age and sex
242 Organisation analysts, policy administrators and human resource specialists 2.1412580 0.6425880 sector, year, age and sex
411 Office assistants and other secretaries 2.4063713 0.5077368 sector, year, age and sex
432 Stores and transport clerks 2.1618484 0.5005162 sector, year, age and sex
422 Client information clerks 3.3819183 0.4079113 sector, year, age and sex
265 Creative and performing artists 1.7171084 0.4046591 sector, year, age and sex
241 Accountants, financial analysts and fund managers 1.9650332 0.3867492 sector, year, age and sex
311 Physical and engineering science technicians 3.0661705 0.3303477 sector, year, age and sex
611 Market gardeners and crop growers 2.1758845 0.1794678 sector, year, age and sex
351 ICT operations and user support technicians 3.3472478 0.1238188 sector, year, age and sex
332 Insurance advisers, sales and purchasing agents 2.2161856 0.1214610 sector, year, age and sex
251 ICT architects, systems analysts and test managers 2.1371036 0.1080593 sector, year, age and sex
534 Attendants, personal assistants and related workers 2.3576411 0.0748416 sector, year, age and sex
532 Personal care workers in health services 2.6448524 0.0675532 sector, year, age and sex
231 University and higher education teachers 2.5212252 0.0231719 sector, year, age and sex
243 Marketing and public relations professionals 2.1626640 0.0083390 sector, year, age and sex

Let's check what we have found.

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "218 Specialists within environmental and health protection")     model <- lm (log(salary) ~ year_n + poly(age_n, 3) + sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Highest F-value sector, Specialists within environmental and health protection

Figure 3: Highest F-value sector, Specialists within environmental and health protection

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "512 Cooks and cold-buffet managers")     model <-lm (log(salary) ~ year_n + poly(age_n, 3) + sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Lowest F-value sector, Cooks and cold-buffet managers

Figure 4: Lowest F-value sector, Cooks and cold-buffet managers

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "911 Cleaners and helpers")     model <- lm (log(salary) ~ year_n + poly(age_n, 3) + sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "sex", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Highest F-value interaction sector and gender, Cleaners and helpers

Figure 5: Highest F-value interaction sector and gender, Cleaners and helpers

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "511 Cabin crew, guides and related workers")     model <- lm (log(salary) ~ year_n + poly(age_n, 3) + sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "sex", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Lowest F-value interaction sector and gender, Cabin crew, guides and related workers

Figure 6: Lowest F-value interaction sector and gender, Cabin crew, guides and related workers

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "251 ICT architects, systems analysts and test managers")     model <- lm (log(salary) ~ year_n + poly(age_n, 3) * sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Highest F-value interaction sector and age, ICT architects, systems analysts and test managers

Figure 7: Highest F-value interaction sector and age, ICT architects, systems analysts and test managers

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "611 Market gardeners and crop growers")     model <- lm (log(salary) ~ year_n + poly(age_n, 3) * sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Lowest F-value interaction sector and age, Market gardeners and crop growers

Figure 8: Lowest F-value interaction sector and age, Market gardeners and crop growers

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "222 Nursing professionals")     model <- lm (log(salary) ~ year_n * sector + poly(age_n, 3) + sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "year_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Highest F-value interaction sector and year, Nursing professionals

Figure 9: Highest F-value interaction sector and year, Nursing professionals

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "218 Specialists within environmental and health protection")     model <- lm (log(salary) ~ year_n * sector + poly(age_n, 3) + sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "year_n", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Lowest F-value interaction sector and year, Specialists within environmental and health protection

Figure 10: Lowest F-value interaction sector and year, Specialists within environmental and health protection

temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "214 Engineering professionals")     model <- lm (log(salary) ~ year_n * poly(age_n, 3) * sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "year_n", "sex", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Highest F-value interaction sector, age, year and gender, Engineering professionals

Figure 11: Highest F-value interaction sector, age, year and gender, Engineering professionals

## TableGrob (2 x 1) "arrange": 2 grobs  ##   z     cells    name           grob  ## 1 1 (1-1,1-1) arrange gtable[layout]  ## 2 2 (2-2,1-1) arrange gtable[layout]
temp <- tb %>%    filter(`occuptional  (SSYK 2012)` == "243 Marketing and public relations professionals")     model <- lm (log(salary) ~ year_n * poly(age_n, 3) * sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("age_n", "year_n", "sex", "sector"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Lowest F-value interaction sector, age, year and gender, Marketing and public relations professionals

Figure 12: Lowest F-value interaction sector, age, year and gender, Marketing and public relations professionals

## TableGrob (2 x 1) "arrange": 2 grobs  ##   z     cells    name           grob  ## 1 1 (1-1,1-1) arrange gtable[layout]  ## 2 2 (2-2,1-1) arrange gtable[layout]

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

digest 0.6.25: Spookyhash bugfix

Posted: 22 Feb 2020 03:42 PM PST

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

And a new version of digest is getting onto CRAN now, and to Debian shortly.

digest creates hash digests of arbitrary R objects (using the md5, sha-1, sha-256, sha-512, crc32, xxhash32, xxhash64, murmur32, and spookyhash algorithms) permitting easy comparison of R language objects. It is a fairly widely-used package (currently listed at 889k monthly downloads with 255 direct reverse dependencies and 7340 indirect reverse dependencies) as many tasks may involve caching of objects for which it provides convenient general-purpose hash key generation.

This release is a one issue fix. Aaron Lun noticed some issues when spookyhash is used in streaming mode. Kendon Bell, who also contributed spookyhash quickly found the issue which is a simple oversight. This was worth addressing in new release, so I pushed 0.6.25.

CRANberries provides the usual summary of changes to the previous version.

For questions or comments use the issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Body Mass Index by @ellis2013nz

Posted: 22 Feb 2020 05:00 AM PST

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

BMI has an expectations management problem

Body Mass Index (BMI) is an attempt to give a quick back-of-envelope answer to the question "if someone weighs W kg, is that a lot or not very much?" Clearly the answer to that question has to take into account at a minimum the person's height; in general, whatever may constitute a healthy weight, it has to be higher for tall people. So BMI – the weight (in kilograms) divided by the height (in metres) squared – gives a scaled version of weight that aims to be rougly comparable for people of different heights. It is actually extremely successful in this regard; certainly it tells you a lot more about someone than either the weight or height individually.

However, BMI is one of those single dimensional metrics that gets widespread criticism on account of being a single dimensional metric. Criticisms of these types of measures typically come in three varieties:

  1. reality is much more complex than any single measure M could represent;
  2. …but if you tweaked it (by method Y) it would somehow be much better;
  3. …anyway, I think Z is really important. M is a really bad measure of Z.

Gross Domestic Product wins the award for attracting the biggest collection of criticisms in this pattern, but BMI would be high up in the list of also-rans. I won't bother linking to individual articles, because googling "why is BMI a bad measure" comes up with plenty to choose from.

The first of the criticisms listed above is always correct, but uninteresting given we need to somehow model the world into a simpler form to grasp it. A one-to-one scale map is no use for navigation.

The third criticism is often (not always) a red herring. In particular, it can introduce a straw man fallacy, implicitly or explicitly accusing the metric of not being something it never claimed to be. This can divert discussion into technical areas when the real disagreement is about something more substantive.

The second criticism ("if you tweaked it, it would be better") can be interesting and fruitful, but in my experience is often standing as a proxy for one of the other two criticisms, which can cause more confusion and talking-past-one-another.

BMI attracts criticisms in all three categories. Clearly, a person's state of health is much more complex than can be represented by a single number and "BMI doesn't tell the whole story" (criticism one with a smattering of three). Equally clearly, sometimes – perhaps always – whether someone is obese, normal weight or underweight is not the most important thing to know about them (criticism three). BMI is an inaccurate measure of body fat content (well, yes, of course) and a 'terrible measure of health' (still the straw man of criticism three).

On average and controlling for other factors, obesity and high BMI is associated with poor health outcomes, but in some unusual circumstances obesity seems to be a protective factor (criticism three – again, not clear why this is seen as a critique of BMI as a measure rather than an interesting fact about how that measure relates to outcomes). Some very well-muscled and clearly healthy individuals have high BMIs that would put them in the 'overweight' category by WHO standards (I think this is a variant of criticism one with a dusting of three). Or BMI is just calculated the wrong way – it should have height to the factor of 2.5, not 2 (criticism two).

I want to pass on most of this debate, not because it's not important and interesting (it is clearly both) but because I don't have enough time and space just now. I do want to zoom in on criticism two however, and in particular the widespread claim that "BMI overexaggerates weight in tall people and underexaggerates weight in short people". Relatedly, I have wondered (as I'm sure many many people have before me), why divide weight by height to the power of 2? Why 2? Why not 3, given that if people of different heights had the same proportions we would expect weight to be proportionate to height cubed? It turns out that this is because people's horizontal dimensions don't increase as much as by their height; but by how much less?

You don't have to follow your nose far in the discussions on BMI to come across references to Oxford University Professor of Numerical Analaysis Nick Trefethen's letter to the Economist on 5 January 2013:

SIR – The body-mass index that you (and the National Health Service) count on to assess obesity is a bizarre measure. We live in a three-dimensional world, yet the BMI is defined as weight divided by height squared. It was invented in the 1840s, before calculators, when a formula had to be very simple to be usable. As a consequence of this ill-founded definition, millions of short people think they are thinner than they are, and millions of tall people think they are fatter.

Trefethen elsewhere expands on his idea of a BMI for the post-calculator age, providing the archetypal exemplar of my second criticism category. His proposed approach is to put height to the power of 2.5, and scale the result by multiplying it by 1.3 so the numbers are similar to the current familiar range:

New formula: BMI = 1.3 * weight(kg) / height(m) ^ 2.5

The scaling (which is designed to make the new BMI identical to the old for someone who is 1.69 metres tall) seems an unnecessary complication. After all, BMI is a completely artificial measure and we can put the boundaries for underweight, healthy, overweight etc wherever we wish; if we want a "new BMI" measure there is no reason for it to appear similar in magnitude to the old one (and in fact some good arguments against).

Trefethen's views are mentioned in most post-2013 critiques of BMI, nearly always uncritically, for example in the subheading "BMI exaggerates thinness in short people and fatness in tall people" in this MedicalNewsToday piece, which quotes Trefethen as saying "Certainly if you plot typical weights of people against their heights, the result comes out closer to height ^ 2.5 than height ^ 2" (I haven't been able to find the original source of this, but it certainly is consistent with what I have been able to find written by Trefethen). But despite the widespread acceptance of Trefethen's critique everywhere from the popular press to Reddit sites for tall people, I can't see any actual evidence for the use of 2.5. So are "short people fatter than they think", as The Telegraph reported Oxford University researchers to have found?

How do height and weight relate in a real population?

Very well. Let us look at some data. The US Centers for Disease Control and Prevention (CDC) Behavioral Risk Factor Surveillance System (BRFSS) is an extraordinary on-going survey of around 400,000 adult interviews each year. It claims to be the largest continuously conducted health survey system in the world, and I imagine it is one of the largest continuous surveys of any purpose. It includes questions on (self-reported) height and weight as well as standard demographics.

Let's start by looking at the relationship between weight and height, without any calculated summary of the two:

There's a few interesting things here. It is clear that the relationship between weight and height varies with sex, ethnicity and age. In particular, the relationship between height and weight is stronger for men (as in a steeper line) in all races and age groups. Also, the average values of these things change. Women are shorter and weigh less than men; people with an Asian ethnic background than the general population; etc.

I've applied as a layer of grey the healthy weights for a person of given height implied by the World Health Organization's BMI guidelines. This makes it clear that most US Americans weigh more than is recommended for their height, which of course is well known.

What does this mean for BMI?

My next step is to convert those weight and height measures into BMI. If BMI is a "fair" measure of how far an individual is from the average weight at their height, we would expect to see no relationship between height and BMI. That is, BMI should capture all the structural information about the norm for weight available in height. The essence of Trefethen's argument is that this is not the case, and that traditional BMI is systematically higher for taller people because the impact of height is shrunk by choosing a power of only 2 rather than 2.5.

Let us see:

In fact, we see that there is a very marked negative relationship between height and BMI, particularly for women, but also present for men. This is the precise opposite of Trefethen's claim, which is that tall people get exaggeratedly high BMIs in the current standard calculation. Based on sheer empiricism, we could make the case that BMI understates the scaled weight of tall people.

(There's an important nuance here, which is that I'm only looking at a 'positive' relationship. We can't deduce a normative statement from this. But I think that's a question for another day. I'm also disregarding for now measurement error, a survey in one time and place, and other restrictions on making general inferences from this sample. After all, this is only a blog.)

Here's an interesting point for those interested in gender relations and the development of science and measurement. The slopes in the chart above for men are nearly horizontal, certainly more so than for women; and most horizontal for people described as 'white, non-Hispanic'. One interpretation for this would be that:

"the current BMI calculation is a fair representation of the relationship between height and weight for white, non-Hispanic men, but underplays the weight of taller people (and exaggerates the weight of shorter) of other racial backgrounds and of women in general."

What power for height is weight proportional to?

I provided a second hand quote of Trevethen earlier in which he argued that "if you plot typical weights of people against their heights, the result comes out closer to height ^ 2.5 than height ^ 2". Is this the case? Actually, there's nothing visual to tell at all, in my opinion:

We need to calculate. Luckily, if weight is expected to be proportionate to height to the power of something, then the logarithms of the two will have a linear relationship and the height coefficient of a regression of log(weight) on log(height) will tell us what the power is.

An interesting statistical issue arises here – should we do this regression for the full population of Americans, or should we let that coefficient vary by race, sex and age? From a modelling perspective, it definitely is a different value for different race, sex and age.

I decided fairly arbitrarily that I wanted a single coefficient to put height to the power to, but that I wanted that coefficient calculated after controlling for other obviously important characteristics. Fitting a regression of the type

log(weight) ~ log(height) + sex + race + age  

gets me a slope for log(height) of 1.55. In other words, after controlling for sex, race and age, American's weight varies in proportion to their height to the power of 1.55.

Alternatively, if we treat the population as one and fit the regression:

log(weight) ~ log(height)  

We get 1.87.

I imagine I would get different values again with different populations or times.

So what do I conclude? Frankly, that "2" is good enough. I think the Trefethen critique of BMI is at the higher level an unnecessary distraction, and at the level of detail he almost certainly gets the direction of misrepresentation wrong (it is short people who are made to look fatter than they are by the standard calculation, not tall people).

Overall, having poked and prodded this measure a bit, I think that it meets many of the criteria of a useful measure. It's easy to calculate, easy to explain, lends itself to simple benchmarking, doesn't need specialist skills for data collection, and is clearly closely related to major outcomes of interest.

I doubt BMI is the best single number to represent someone's health, but I don't think this has often been claimed for it. It probably isn't even the best single number regarding the cluster of issues relating to weight, shape, size and body fat content, but it's not a bad starting point. It might well be the most cost-effective such measure in many situations. So I'm happy to say that as a metric, BMI isn't a bad one so long as you don't expect it to do more than is reasonable.

Population versus individuals

I want to make a final point about one point in the BMI debate. We often see BMI critiqued (or defended) from the angle that "BMI is meant to be a population measure about averages, not something that is meaningful for individuals". As sometimes expressed, this seems to me completely incoherent. If the average BMI tells us anything about a population, it is because it is meaningful for individuals; the population effect is the sum of a bunch of small individual effects. I think this argument seems to be a misguided extrapolation from the wise caution that one shouldn't try to diagnose the health of an individual from just their BMI. That seems absolutely fair. But this doesn't mean BMI tells us nothing about the individual, just that the BMI alone is not determinative of outcomes.

Code

Here's all the R code for the above in a single chunk. It's not terribly exciting, but there are some interesting visualisation challenges associated with:

  • the lumping of reported height and weight (I jitter points to make them more distinguishable)
  • taking care to convert measurements which can be either in metric or medieval measurements to a common unit
  • unequal survey weights
  • large sample sizes (with nearly half a million points, how to represent them on the screen? I use a high degree of transparency to get around this).

When it comes to the actual modelling that treats the survey weights and the sampling structure appropriately, Thomas Lumley's survey package rules supreme in terms of easy of use.

library(tidyverse)  library(scales)  library(patchwork)  library(MASS)  library(rio)  library(survey)  library(Cairo)    #--------------CDC annual Behavioral Risk Factor Surveillance System (BRFSS)  survey-    # 69.5 MB download, so only uncomment this and run it once:  # download.file("https://www.cdc.gov/brfss/annual_data/2018/files/LLCP2018XPT.zip",  #               destfile = "LLCP2018XPT.zip", mode = "wb")  #   # unzip("LLCP2018XPT.zip")    llcp <- import("LLCP2018.XPT")  dim(llcp) # 437,436 respondents and 275 questions!    # Survey weights. These are very unbalanced so take care:  summary(llcp$"_LLCPWT")    # Codebook:  # https://www.cdc.gov/brfss/annual_data/2018/pdf/codebook18_llcp-v2-508.pdf    # Function to clean and tidy up variables from the original  code_vars <- function(d){    d %>%      # height in metres:      mutate(height = case_when(              HEIGHT3 >= 9000 & HEIGHT3 <= 9998 ~ as.numeric(str_sub(HEIGHT3, 2, 4)) / 100,              HEIGHT3 < 800 ~ (as.numeric(str_sub(HEIGHT3, 1, 1)) * 12 +                                  as.numeric(str_sub(HEIGHT3, 2, 3))) * 2.54 / 100,              TRUE ~ NA_real_            )) %>%      # weight in kg:      mutate(weight = case_when(        WEIGHT2 < 1000 ~ WEIGHT2 * 0.453592,        WEIGHT2 >= 9000 & WEIGHT2 <= 9998 ~ as.numeric(str_sub(WEIGHT2, 2, 4)),        TRUE ~ NA_real_      ),        bmi = weight / height ^ 2,        trefethen = 1.3 * weight / height ^ 2.5,      # BMI cutoff points from https://www.who.int/gho/ncd/risk_factors/bmi_text/en/        who_cat = cut(bmi, c(0, 18.5, 25, 30, Inf), labels = c("Underweight", "Healthy",                                                               "Overweight", "Obese"))) %>%      mutate(sex = case_when(        SEX1 == 1 ~ "Male",        SEX1 == 2 ~ "Female",        TRUE ~ NA_character_      )) %>%      mutate(race = case_when(        `_HISPANC` == 1 ~ "Hispanic, Latino/a, or Spanish origin",        `_PRACE1` == 1 ~ "White, non-Hispanic",        `_PRACE1` == 2 ~ "Black or African American",        `_PRACE1` == 3 ~ "American Indian or Alaskan Native",        `_PRACE1` == 4 ~ "Asian",        `_PRACE1` == 5 ~ "Pacific Islander",        `_PRACE1` == 6 ~ "Other",        TRUE ~ NA_character_      ),      race = fct_relevel(race, "White, non-Hispanic")) %>%      mutate(race2 = fct_collapse(race,                                  "Other" = c("Pacific Islander", "Other",                                               "American Indian or Alaskan Native")),             race2 = str_wrap(race2, 20),             race2 = fct_reorder(race2, `_LLCPWT`, .fun = sum)) %>%      mutate(age = case_when(        `_AGE65YR` == 1 ~ "Age 18 to 64",        `_AGE65YR` == 2 ~ "Age 65 or older",        NA ~ NA_character_              )) %>%      mutate(race3 = str_wrap(race2, 20),             race3 = fct_reorder(race3, `_LLCPWT`, .fun = sum)) %>%      mutate(hhinc = case_when(        INCOME2 == 1 ~ "Less than $10k",        INCOME2 == 2 ~ "$10k to $15k",        INCOME2 == 3 ~ "$15k to $20k",        INCOME2 == 4 ~ "$20k to $25k",        INCOME2 == 5 ~ "$25k to $35k",        INCOME2 == 6 ~ "$35k to $50k",        INCOME2 == 7 ~ "$50k to $75k",        INCOME2 == 8 ~ "More than $75k",        TRUE ~ NA_character_      ),      hhinc = fct_reorder(hhinc, INCOME2)) %>%      mutate(education = case_when(        EDUCA == 1 ~ "Never attended school or only kindergarten",        EDUCA == 2 ~ "Grades 1 through 8",        EDUCA == 3 ~ "Grades 9 through 11",        EDUCA == 4 ~ "Grade 12 GED",        EDUCA == 5 ~ "College 1 to 3 years",        EDUCA == 6 ~ "College 4 years or more",        TRUE ~ NA_character_      ),      education = fct_reorder(education, EDUCA)) %>%      # remove people taller than the world record height:      filter(height < 2.72) %>%      rename(psu = `_PSU`,             survey_weight = `_LLCPWT`) %>%      dplyr::select(height, weight, sex, race, race2, race3, age, hhinc, education, bmi, psu, survey_weight)  }    the_caption = "Source: http://freerangestats.info analysis of the US CDC Behavioral Risk Factor Surveillance System Survey for 2018"    # small subsampled version of the data used during dev, not referred to in blog:  set.seed(123)  llcp_small <- llcp %>%    sample_n(20000, weight = `_LLCPWT`, replace = TRUE) %>%    code_vars()     nrow(distinct(llcp_small)) # 17,370 - remembering higher weight observations likely to be resampled twice, and dropping NAs    # full data, with tidied variables  llcp_all <- llcp %>%    code_vars()      # Make a data frame for shading of healthy BMI, to get around problems with rectangles and log axes. See  # https://stackoverflow.com/questions/52007187/background-fill-via-geom-rect-is-removed-after-scale-y-log10-transformation  healthy <- tibble(    xmin = 0.9,    xmax = 2.6,    ymin = 18.5,    ymax = 25,    race3 = unique(llcp_small$race3)  ) %>%    drop_na()      # data frame to use for shading of healthy weight implied by BMI  healthy_by_height <- healthy %>%    dplyr::select(ymin, ymax, race3) %>%    gather(variable, bmi, -race3) %>%    mutate(link = 1) %>%    full_join(tibble(height = seq(from = 0.9, to = 2.3, length.out = 100), link = 1), by = "link") %>%    mutate(weight = bmi * height ^ 2) %>%     dplyr::select(-link, -bmi) %>%    spread(variable, weight)      p1b <- llcp_all %>%    dplyr::select(height, weight, sex, age, race3, survey_weight) %>%    drop_na() %>%    ggplot(aes(x = height)) +    facet_grid(age ~ race3) +    geom_ribbon(data = healthy_by_height, aes(ymin = ymin, ymax = ymax),                colour = NA, fill = "grey90") +    geom_jitter(alpha = 0.02, aes(y = weight, colour = sex, size = survey_weight)) +    geom_smooth(method = "rlm", se = FALSE, aes(y = weight, colour = sex)) +    scale_size_area(label = comma) +    theme(panel.grid.minor = element_blank()) +    labs(title = "Relationship between self-reported weight and height in USA adults",         subtitle = "Men's weight is more dependent on their height than is the case for women.  Grey ribbon shows WHO-recommended healthy weight. Straight lines show empirical relationship of weight and height.",         x = "Height (m)",         y = "Weight (kg)",         caption = the_caption,         colour = "",         size = "Each point represents X Americans:") +    guides(size = guide_legend(override.aes = list(alpha = 1)))      print(p1b)           p3 <- llcp_all %>%    drop_na() %>%    ggplot(aes(x = height, y = bmi, colour = sex)) +    facet_grid(age ~ race3) +    geom_rect(data = healthy, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),               colour = NA, fill = "grey90", inherit.aes = FALSE) +    geom_jitter(alpha = 0.02, aes(size = survey_weight)) +    geom_smooth(method = "rlm", se = FALSE, size = 0.5) +    scale_size_area(label = comma) +     scale_x_log10(breaks = c(0.9, 1.1, 1.4, 1.8, 2.3)) +     scale_y_log10() +    theme(panel.grid.minor = element_blank()) +    labs(title = "Relationship between BMI and height in USA adults",         subtitle = bquote("Taller people tend to have lower BMIs because on average weight is proportionate to less than the square of height (in fact, "~weight%prop%height^1.6~")"),         x = "Height (metres), logarithmic scale",         y = bquote("Body Mass Index: "~frac(weight,  height ^ 2)~", logarithmic scale"),         caption = the_caption,          colour = "",         size = "Each point represents X Americans:") +    guides(size = guide_legend(override.aes = list(alpha = 1)))    print(p3)         #------------    p4 <- llcp_all %>%    mutate(h2 = height ^ 2,           h2.5 = height ^ 2.5) %>%    dplyr::select(h2, h2.5, weight, survey_weight) %>%    drop_na() %>%    gather(variable, value, -weight, -survey_weight) %>%    mutate(variable = gsub("h2", expression(height^2), variable)) %>%    ggplot(aes(x = value, y = weight)) +    facet_wrap(~variable, scales = "free_x") +    geom_jitter(alpha = 0.02, aes(size = survey_weight)) +    geom_smooth(method = 'gam', formula = y ~ s(x, k = 5)) +    scale_size_area(label = comma) +    labs(title = bquote("Weight compared to "~height^2~"and"~height^2.5),         subtitle = "It's not possible to choose visually which of the two better expresses the relationship.",         y = "Weight (kg)",         x = "Height to the power of 2.0 or 2.5",         size = "Each point represents X Americans:") +    guides(size = guide_legend(override.aes = list(alpha = 1)))    print(p4)     #-------Models---------------  llcp_svy <- svydesign(~psu, weights = ~survey_weight, data = llcp_all)    model0 <- svyglm(log(weight) ~ log(height), design = llcp_svy)      model1 <- svyglm(log(weight) ~ log(height) + sex + race + age, design = llcp_svy)    anova(model1)  summary(model1)  # best exponent for height is 1.55, not 2 (this is the raw coefficient for height)      # men of the same race, age and height weigh 6% more than women (exp(0.056))  # Asians weight 89% of white non-hispanic of same sex and age:    exp(coef(model1))

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years

Posted: 22 Feb 2020 04:00 AM PST

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

It is almost the 29th of February 2020! A day that is very interesting for R, because it marks 20 years from the release of R v1.0.0, the first official public release of the R programming language.

In this post, we will look back on the 20 years of R with a bit of history and 3 interesting perspectives – how much faster did R get over the years, how many R packages were being released since 2000 and how did the number of package downloads grow.

The first release of R, 29th February 2000

The first official public release of R happened on the 29th of February, 2000. In the release announcement, Peter Dalgaard notes:

"The release of a current major version indicates that we believe that R has reached a level of stability and maturity that makes it suitable for production use. Also, the release of 1.0.0 marks that the base language and the API for extension writers will remain stable for the foreseeable future. In addition we have taken the opportunity to tie up as many loose ends as we could."

Today, 20 years later, it is quite amazing how true the statement around the API remaining stable has proven. The original release announcement and full release statement are still available online.

You can also still download the very first public version of R. For instance, for Windows you can find it on the Previous Releases of R for Windows page. And it is quite runnable, even under Windows 10.

Further down in history, to 1977

Now to give R justice in terms of age, we need to go even further into history. In the full release statement of R v1.0.0, we can find that

R implements a dialect of the award-winning language S, developed at Bell Laboratories by John Chambers et al.

With some digging we can use the Wayback Machine Internet Archive to find interesting notes on Version 1 of S itself written by John Chambers, where he writes:

Over the summer of 1976, some actual implementation began. The paper record has a gap over this period (maybe we were too busy coding to write things down). My recollection is that by early autumn, a language was available for local use on the Honeywell system in use at Murray Hill. Certainly by early 1977 there was software and a first version of a user's manual.

As we can see the ideas and principles behind R are actually much older than 20 years and even 40 years. If you are interested in the history, I recommend watching the very interesting 40 years of S talk from userR 2016.

Faster – How performant is R today versus 20 years ago?

With the 20th birthday of R approaching, I was curious as to how much faster did the implementation of R get with increasing versions. I wrote a very simple benchmarking code to solve the Longest Collatz sequence problem for the first 1 million numbers with a brute-force-ish algorithm.

Then executed it on the same hardware using 20 different versions of R, starting with the very original 1.0, through 2.0, 3.0 all the way to today's development version.

Benchmarking code

Below is the code snippet with the implementation to be benchmarked:

col_len <- function(n) {    len <- 0    while (n > 1) {      len <- len + 1      if ((n %% 2) == 0)        n <- n / 2      else {        n <- (n * 3 + 1) / 2        len <- len + 1      }    }    len  }    res <- lapply(    1:10,    function(i) {      gc()      system.time(        max(sapply(seq(from = 1, to = 999999), col_len))      )    }  )

Results

Now to the interesting part, the results – the below chart shows the boxplots of time required to execute the code in seconds, with R versions on the horizontal axis.

We can see that the median time to execute the above code to find the longest Collatz sequence amongst the first million numbers was:

  • February 2000: More than 17 minutes with the first R version, 1.0.0
  • January 2002: A large performance boost came already with the 1.4.1 release, decreasing the time by almost 4x, to around 4.5 minutes
  • October 2004: Even more interestingly, my measurements have seen another big improvement with version 2.0.0 – to just 168 seconds, less than 3 minutes. I was not however able to get such good results for any of the later 2.x versions
  • April 2014 – Another speed improvement came 10 years later, with version 3.1 decreasing the time to around 145 seconds
  • April 2017 – Finally, the 3.4 release has seen another significant performance boost, from this version on the time needed to perform this calculation is less than 30 seconds.

Some details and notes

The above is by no means a proper benchmarking solution and was ran purely out of interest. The benchmarks were run on a

  • Windows-based PC with Intel Core (TM) i5-4590 Processor and 8 GB DDR3 1600 MHz RAM.
  • using 32-bit versions of R, with no additional packages installed
  • the following options were used with R 1.0.0: --vsize=900M --nsize=20000k

Some interesting notes on running the same code with a 20-year-old version of R:

  • There was no message() function available
  • Integer literals using the L suffix were not accepted
  • The function do.call() needed a character function name as the first argument
  • Did not accept = for assignment. It did accept _ though 😉

Other than that, the code ran with no issues across all the tested versions.

Stronger – How many packages were released over the years?

The power of R comes by no small part from the fact that it is easily extensible and the extensions are easily accessible using The Comprehensive R Archive Network, known to most simply as CRAN.

Next on the list of interesting numbers was to look at how CRAN has grown to the powerhouse with more than 15 000 available packages today. Namely, I looked at the numbers of new packages (first releases to CRAN), and total releases (including newer versions of existing packages) over the years using the pkgsearch package.

Results

Once again, the numbers speak for themselves

  • In 2000-2004 the number of newly released packages was less than a 100
  • In 2010 CRAN has seen more than 400 new packages
  • In 2014 more than 1000 packages had their first release
  • In 2017 over 2000 new packages were added to CRAN
  • In 2018 and 2019, the number of total CRAN releases was more than 10 000

I would like to take this opportunity to thank the team behind CRAN to make this amazing growth possible.

Bigger – How did downloads of R packages grow?

The size of the user and developer bases of programming languages is difficult to estimate, but we can use a simple proxy to get a picture in terms of growth. RStudio's CRAN mirror provides a REST API from which we can look at and visualize the number of monthly downloads of R packages in the past 7 years:

Note the numbers above represent just one of many CRAN mirrors and therefore the true number of package downloads is much higher, the informational value of the chart is mostly in the growth, which is quite impressive:

  • January 2013 has seen around 1.1 million
  • January 2015 it was 7.7 million
  • January 2017 it was 26.9 million
  • January 2020 more than 128 million downloads

Thank you for the 20 years

And here is to 20 more.

Cheers!

Cheers!

Resources

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

The next package release into AWS Athena

Posted: 21 Feb 2020 04:00 PM PST

[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RBloggers|RBloggers-feedburner

RAthena 1.7.1 and noctua 1.5.1 package versions have now been released to the CRAN. They both bring along several improvements with the connection to AWS Athena, noticeably the performance speed and several creature comforts.

These packages have both been designed to reflect one another,even down to how they connect to AWS Athena. This means that all features going forward will exist in both packages. I will refer to these packages as one, as they basically work in the same way.

Performance improvements:

Initially the packages utilised AWS Athena SQL queries. This was to achieve all the functional requirements of the DBI package framework. However the package would always send a SQL query to AWS Athena which in turn would have to lift a flat file from AWS S3, before returning the final result to R. This means the performance of the packages would be limited and fairly slow compared to other data base backends.

The biggest change is the adoption of more functionality of the SDKs (software development kit) into AWS. The key component that has been adopted is AWS Glue. AWS Glue contains all of AWS Athena table DDL's. This means instead of going to AWS Athena for this information AWS Glue can be used instead.

By utilising AWS Glue, the table meta data (column names, column types, schema hierarchy etc…) can easily be retrieved at a fraction of the time it would of taken to query AWS Athena. Previously the DBI function dbListTables would send a query to AWS Athena, this would retrieve all the tables listed in all schemas. This would take over 3 seconds. Now using AWS Glue to retrieve the same data, it takes less than 0.5 of a second.

dplyr

When AWS Glue is used to collect metadata around a table in AWS Athena, a performance in dplyr::tbl can be done. I would like to say thanks to @OssiLehtinen for developing the initial implementation as this improvement would have been overlooked.

dplyr::tbl has two key methods when creating the initial object. The first is called SQL identifiers and this is the method that benefits from the new AWS Glue functionality. To use SQL identifiers is fairly straight forward.

library(DBI)  library(dplyr)  library(RAthena) #Or library(noctua)    con = dbConnect(athena())    dbWriteTable(con, "iris", iris)    ident_iris = tbl(con, "iris")  

dplyr can identify the iris table within the connected schema. When a user uses the SQL identifier method in dplyr::tbl, AWS Glue is called to retrieve all the meta data for dplyr. This increases the performance from 3.66 to 0.29 seconds. The second method is called SQL sub query. This unfortunately won't benefit from the new feature and will run in slower at 3.66 seconds.

subquery_iris = tbl(con, sql("select * from iris"))  

Therefore I recommend the use of SQL identifier method when using dplyr's interface.

Creature Comforts

AWS Athena Metadata

Due to user feature requests the packages now return more meta data around each query sent to AWS Athena. Thus the basic level of meta data returned, is the amount of data scanned by AWS Athena. This is formatted into a readable format depending on the amount of data scanned.

library(DBI)  library(RAthena) #Or library(noctua)    con = dbConnect(athena())    dbWriteTable(con, "iris", iris)    dbGetQuery(con, "select * from iris")  
Info: (Data scanned: 860 Bytes)       sepal_length sepal_width petal_length petal_width   species    1:          5.1         3.5          1.4         0.2    setosa    2:          4.9         3.0          1.4         0.2    setosa    3:          4.7         3.2          1.3         0.2    setosa    4:          4.6         3.1          1.5         0.2    setosa    5:          5.0         3.6          1.4         0.2    setosa   ---                                                              146:          6.7         3.0          5.2         2.3 virginica  147:          6.3         2.5          5.0         1.9 virginica  148:          6.5         3.0          5.2         2.0 virginica  149:          6.2         3.4          5.4         2.3 virginica  150:          5.9         3.0          5.1         1.8 virginica  

However if you set the new parameter statistics to TRUE then all the metadata around that query is printed out like so:

dbGetQuery(con, "select * from iris", statistics = TRUE)  
$EngineExecutionTimeInMillis  [1] 1568    $DataScannedInBytes  [1] 860    $DataManifestLocation  character(0)    $TotalExecutionTimeInMillis  [1] 1794    $QueryQueueTimeInMillis  [1] 209    $QueryPlanningTimeInMillis  [1] 877    $ServiceProcessingTimeInMillis  [1] 17    Info: (Data scanned: 860 Bytes)       sepal_length sepal_width petal_length petal_width   species    1:          5.1         3.5          1.4         0.2    setosa    2:          4.9         3.0          1.4         0.2    setosa    3:          4.7         3.2          1.3         0.2    setosa    4:          4.6         3.1          1.5         0.2    setosa    5:          5.0         3.6          1.4         0.2    setosa   ---                                                              146:          6.7         3.0          5.2         2.3 virginica  147:          6.3         2.5          5.0         1.9 virginica  148:          6.5         3.0          5.2         2.0 virginica  149:          6.2         3.4          5.4         2.3 virginica  150:          5.9         3.0          5.1         1.8 virginica  

This can also be retrieved by using dbStatistics:

res = dbExecute(con, "select * from iris")    # return query statistic  query_stats = dbStatistics(res)    # return query results  dbFetch(res)    # Free all resources  dbClearResult(res)  

RJDBC inspired function

I have to give full credit to the package RJDBC for inspiring me to create this function. DBI has got a good function called dbListTables that will list all the tables that are in AWS Athena. However it won't return to which schema each individual table is related to. To over come this RJDBC has a excellent function called dbGetTables. This function returns all the tables from AWS Athena as a data.frame. This has the advantage of detailing schema, table and table type. With the new integration into AWS Glue this can be returned quickly.

dbGetTables(con)  
      Schema             TableName      TableType   1:  default             df_bigint EXTERNAL_TABLE   2:  default                  iris EXTERNAL_TABLE   3:  default               mtcars2 EXTERNAL_TABLE   4:  default         nyc_taxi_2018 EXTERNAL_TABLE  

This just makes it a little bit easier when working in different IDE's for example Jupyter.

Backend option changes

This is not really a creature comfort but it still interesting and useful. Both packages are dependent on data.table to read data into R. This is down to the amazing speed data.table offers when reading files into R. However a new package, with equally impressive read speeds, has come onto the scene called vroom. As vroom has been designed to only read data into R similarly to readr, data.table is still used for all of the heavy lifting. However if a user wishes to use vroom as the file parser an *_options function has been created to enable this:

nocuta_options(file_parser = c("data.table", "vroom"))    # Or     RAthena__options(file_parser = c("data.table", "vroom"))  

By setting the file_parser to vroom then the backend will change to allow vroom's file parser to be used instead of data.table.

If you aren't sure whether to use vroom over data.table, I draw your attention to vroom boasting a whopping 1.40GB/sec throughput.

Statistics taken from vroom's github readme

package version time (sec) speed-up throughput
vroom 1.1.0 1.14 58.44 1.40 GB/sec
data.table 1.12.8 11.88 5.62 134.13 MB/sec
readr 1.3.1 29.02 2.30 54.92 MB/sec
read.delim 3.6.2 66.74 1.00 23.88 MB/sec

RStudio Interface!

Due to the ability of AWS Glue to retrieve metadata for AWS Athena at speed, it has now been possible to add the interface into RStudio's connection tab. When a connection is established:

library(DBI)  library(RAthena) #Or library(noctua)    con = dbConnect(athena())  

The connection icon will as follows:

The AWS region you are connecting to will be reflected in the connection (highlighted above in the red square). This is to help users that are able to connect to multiple different AWS Athena over different regions.

Once you have connected AWS Athena, schema hierarchy will be displayed. In my example you can see some of the tables I have created when testing these packages.

For more information around RStudio's connection tab please check out RStudio preview connections.

Summary

To sum up, the Rathena and noctua latest versions have been released to cran with all the new goodies they bring. As these packages are based on AWS SDK's they are highly customisable. Features can easily be added to improve the packages when connecting to AWS Athena. So please raise any feature requests / bug issues to: https://github.com/DyfanJones/RAthena and https://github.com/DyfanJones/noctua

To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Correlogram in R: how to highlight the most correlated variables in a dataset

Posted: 21 Feb 2020 04:00 PM PST

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Pritesh Sudra

Photo by Pritesh Sudra

Introduction

Correlation, often computed as part of descriptive statistics, is a statistical tool used to study the relationship between two variables, that is, whether and how strongly couples of variables are associated.

Correlations are measured between only 2 variables at a time. Therefore, for datasets with many variables, computing correlations can become quite cumbersome and time consuming.

Correlation matrix

A solution to this problem is to compute correlations and display them in a correlation matrix, which shows correlation coefficients for all possible combinations of two variables in the dataset.

For example, below is the correlation matrix for the dataset mtcars (which, as described by the help documentation of R, comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles):1

dat <- mtcars  round(cor(mtcars), 2)
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  ## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55  ## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53  ## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39  ## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75  ## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09  ## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43  ## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66  ## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57  ## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06  ## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27  ## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

Even after rounding the correlation coefficients to 2 digits, you will conceive that this correlation matrix is not easily and quickly interpretable.

If you are using R Markdown, you can use the pander() function from the {pander} package to make it slightly more readable, but still, we must admit that this table is not optimal when it comes to visualizing correlations between several variables of a dataset, especially for large datasets.

Correlogram

To tackle this issue and make it much more insightful, let's transform the correlation matrix into a correlation plot. A correlation plot, also referred as a correlogram, allows to highlight the variables that are most (positively and negatively) correlated. Below an example with the same dataset presented above:

The correlogram represents the correlations for all pairs of variables. Positive correlations are displayed in blue and negative correlations in red. The intensity of the color is proportional to the correlation coefficient so the stronger the correlation (i.e., the closer to -1 or 1), the darker the boxes. The color legend on the right hand side of the correlogram shows the correlation coefficients and the corresponding colors.

As a reminder, a negative correlation implies that the two variables under consideration vary in opposite directions, that is, if one variable increases the other decreases and vice versa. A positive correlation implies that the two variables under consideration vary in the same direction, that is, if one variable increases the other increases and if one variable decreases the other decreases as well. Furthermore, the stronger the correlation, the stronger the association between the two variables.

Correlation test

Finally, a white box in the correlogram indicates that the correlation is not significantly different from 0 at the specified significance level (in this example, at \(\alpha = 5\)%) for the couple of variables. A correlation not significantly different from 0 means that there is no linear relationship between the two variables considered (there could be another kind of association, but other than linear).

To determine whether a specific correlation coefficient is significantly different from 0, a correlation test has been performed. Remind that the null and alternative hypotheses of this test are:

  • \(H_0\): \(\rho = 0\)
  • \(H_1\): \(\rho \ne 0\)

where \(\rho\) is the correlation coefficient. The correlation test is based on two factors: the number of observations and the correlation coefficient. The more observations and the stronger the correlation between 2 variables, the more likely it is to reject the null hypothesis of no correlation between these 2 variables.

In the context of our example, the correlogram above shows that the variables cyl (the number of cylinders) and hp (horsepower) are positively correlated, while the variables mpg (miles per gallon) and wt (weight) are negatively correlated (both correlations make sense if we think about it). Furthermore, the variables gear and carb are not correlated. Even if the correlation coefficient is 0.27 between the 2 variables, the correlation test has shown that we cannot reject the hypothesis of no correlation. This is the reason the box for these two variable is white.

Although this correlogram presents exactly the same information than the correlation matrix, the correlogram presents a visual representation of the correlation matrix, allowing to quickly scan through it to see which variables are correlated and which are not.

Code

For those interested to draw this correlogram with their own data, here is the code of the function I adapted based on the corrplot() function from the {corrplot} package (thanks again to all contributors of this package):

The main arguments in the corrplot2() function are the following:

  • data: name of your dataset
  • method: the correlation method to be computed, one of "pearson" (default), "kendall", or "spearman". As a rule of thumb, if your dataset contains quantitative continuous variables, you can keep the Pearson method, if you have qualitative ordinal variables, the Spearman method is more appropriate
  • sig.level: the significance level for the correlation test, default is 0.05
  • order: order of the variables, one of "original" (default), "AOE" (angular order of the eigenvectors), "FPC" (first principal component order), "hclust" (hierarchical clustering order), "alphabet" (alphabetical order)
  • diag: display the correlation coefficients on the diagonal? The default is FALSE
  • type: display the entire correlation matrix or simply the upper/lower part, one of "upper" (default), "lower", "full"
  • tl.srt: rotation of the variable labels
  • (note that missing values in the dataset are automatically removed)

You can also play with the arguments of the corrplot2 function and see the results thanks to this R Shiny app.

Thanks for reading. I hope this article will help you to visualize correlations between variables in a dataset and to make correlation matrices more insightful and more appealing.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Get updates every time a new article is published by subscribing to this blog.

Related articles:


  1. The dataset mtcars is preloaded in R by default, so there is no need to import it into R. Check the article "How to import an Excel file in R" if you need help in importing your own dataset.↩

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Comments

Post a Comment