[R-bloggers] Setting up R with Visual Studio Code quickly and easily with the languageserversetup package (and 8 more aRticles)

Setting up R with Visual Studio Code quickly and easily with the languageserversetup package
Working with Statistics Canada Data in R, Part 5: Retrieving Census Data
Impact of a country’s age breakdown on COVID-19 case fatality rate by @ellis2013nz
Modeling Pandemics (3)
Modeling pandemics (1)
Apply for the Google Summer of Code and Help Us Improving The R Code Optimizer
The significance of population size, year, and per cent women on the education level in Sweden
Rebalancing history
COVID-19 Tracker: Days since N

Setting up R with Visual Studio Code quickly and easily with the languageserversetup package

Posted: 21 Mar 2020 05:00 AM PDT

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Over the past years, R has been gaining popularity, bringing to life new tools to with ith it. Thanks to the amazing work by contributors implementing the Language Server Protocol for R and writing Visual Studio Code Extensions for R, the most popular development environment amongst developers across the world now has very strong support for R as well.

In this post, we will look at the languageserversetup package that aims to make the setup of the R Language Server robust and easy to use by installing it into a separate, independent library and adjusting R startup in a way that initializes the language server when relevant.

Visual Studio Code and R

According to the 2019 StackOverflow developer survey, Visual Studio Code is the most popular development environment across the board, with amazing support for many languages and extensions ranging from improved code editing to advanced version control support and Docker integration.

Until recently the support for R in Visual Studio Code was in my view not comprehensive enough to justify switching from other tools such as RStudio (Server) to using VS Code exclusively. This has changed with the work done by the team implementing the following 3 tools:

The R extension for VS Code
The R LSP Client extension for VS Code
The languageserver package: An implementation of the Language Server Protocol for R

The features now include all that we need to work efficiently, including auto-complete, definition provider, code formatting, code linting, information on functions on hover, color provider, code sections and more.

If you are interested in more steps around the setup and the overview of features I recommend the Writing R in VSCode: A Fresh Start blogpost by Kun Ren. I also recommend that you follow Kun on Twitter if you are interested in the latest developments.

Setup considerations, issues, and tweaks: creating the `languageserversetup` package

With my current team, we have almost fully embraced Visual Studio Code as an IDE for our work in R, which is especially great as the work is multi-language and multi-environment in nature and we can do our development in Scala, R and more, including implementing and testing Jenkins pipelines and designing Docker images without leaving VS Code.

Setting up for the team on multiple systems and platforms we have found the following interesting points which were my motivation to write a small R package, languageserversetup, that should make the installation and setup of the R language server as easy and painless as possible.

Managing package libraries

One of the specifics of R is that all extensions (packages) are installed into package libraries, be it the packages we develop and use for our applications or the tools we use mostly as means to make our development life easier. We can therefore often end in a situation where we need to use different versions of R packages for different purposes. For example, the languageserver package currently needs R6 (>= 2.4.1), stringr (>= 1.4.0) and more, in total it recursively requires 75 other R packages to be installed. When installing and running the package we can run into conflicting versions of what our current applications need versus what the languageserver package requires to function properly.

Managing library paths

The second consideration, related to the first one is that if we simply install the language server into the default library with for instance install.packages it will change the library to a state that is possibly not desired. We can also run into unexpected crashes, where the languageserver will function properly for a time until one of the non-triggered dependencies with a hidden conflict gets triggered.

A solution – Complete library separation and smart initialization

One possible solution to the above issues is to:

Keep the package libraries of the languageserver and the other libraries that the user uses (perhaps apart from the main system library containing the base and recommended packages that come with the R installation itself) completely separated, including all non-base dependencies
Initialize that library only when the R process in question is triggered by the language server, otherwise, keep the process untouched and use the user libraries as usual

Solving it with 2 R commands – the `languageserversetup` package

To make the above solution easily accessible, I have created a small R package called languageserversetup that will do all the work for you. It can be installed from CRAN and it has no dependencies on other R packages:

install.packages("languageserversetup")

Now the entire setup has only 2 steps:

Install the languageserver package and all of its dependencies into a separate independent library (Will ask for confirmation before taking action) using:

languageserversetup::languageserver_install()

Add code to .Rprofile to automatically align the library paths for the language server functionality if the process is an instance of the languageserver, otherwise, the R session will run as usual with library paths unaffected. This is achieved by running (will also ask for confirmation):

languageserversetup::languageserver_add_to_rprofile()

That's it. Now you can enjoy the functionality without caring about the setup of libraries or any package version conflicts. Thanks to the full separation of libraries, the removal is as trivial as deleting the library directory.

In action with VS Code

Installing languageserversetup and using `languageserver_install()`

Installing the language server

Initializing the functionality with `languageserver_add_to_rprofile()`

Adding the language server to startup

All done, now enjoy the awesomeness!

Technical details

If you are interested in more technical details,

please visit the package's openly accessible GitHub repository.
the README.md has information on options configuration, installation, uninstallation, platforms and more
the help files for the functions can be accessed from R with ?languageserver_install, ?languageserver_startup, ?languageserver_add_to_rprofile and ?languageserver_remove_from_rprofile for more details on their arguments and customization
for testing, GitHub actions are set up for multiple platforms and to run all CRAN checks on the package on each commit

References

The R extension for VS Code Marketplace
The R LSP Client extension on VS Code Marketplace
The languageserver package on GitHub
The languageserversetup package on GitHub
Kun Ren's Writing R in VSCode: A Fresh Start blogpost

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Working with Statistics Canada Data in R, Part 5: Retrieving Census Data

Posted: 20 Mar 2020 09:53 PM PDT

[This article was first published on Data Enthusiast's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back to Working with Statistics Canada Data in R, Part 4.

Introduction

Now that we are ready to start working with Canadian Census data, let's first briefly address the question why you may need to use it. After all, CANSIM data is often more up-to-date and covers a much broader range of topics than the national census data, which is gathered every five years in respect of a limited number of questions.

The main reason is that CANSIM data is far less granular geographically. Most of it is collected at the provincial or even higher regional level. You may be able to find CANSIM data on a limited number of questions for some of the country's largest metropolitan areas, but if you need the data for a specific census division, city, town, or village, you'll have to use the Census.

To illustrate the use of cancensus package, let's do a small research project. First, in this post we'll retrieve these key labor force characteristics of the largest metropolitan areas in each of the five geographic regions of Canada:

Labor force participation rate, employment rate, and unemployment rate.
Percent of workers by work situation: full time vs part time, by gender.
Education levels of people aged 25 to 64, by gender.

The cities (metropolitan areas) that we are going to look at, are: Calgary, Halifax, Toronto, Vancouver, and Whitehorse. We'll also get these data for Canada as a whole for comparison and to illustrate the retrieval of data at different geographic levels

Next, in the upcoming Part 6 of the "Working with Statistics Canada Data in R" series, we will visualize these data, including making a faceted plot and writing a function to automate repetitive plotting tasks.

Keep in mind that cancensus also allows you to retrieve geospatial data, that is, borders of census regions at various geographic levels, in sp and sf formats. Retrieving and visualizing Statistics Canada geospatial data will be covered later in these series.

So, let's get started by loading the required packages:

library(cancensus)  library(tidyverse)

Searching for Data

cancensus retrieves census data with the get_census function. get_census can take a number of arguments, the most important of which are dataset, regions, and vectors, which have no defaults. Thus, in order to be able to retrieve census data, you'll first need to figure out:

your dataset,
your region(s), and
your data vector(s).

Find Census Datasets

Let's see which census datasets are available through the CensusMapper API:

list_census_datasets()

Currently, datasets earlier than 2001 are not available, so if you need to work with the 20th century census data, you won't be able to retrieve it with cancensus.

Find Census Regions

Next, let's find the regions that we'll be getting the data for. To search for census regions, use the search_census_regions function.

Let's take a look at what region search returns for Toronto. Note that cancensus functions return their output as dataframes, so it is easy to subset. Here I limited the output to the most relevant columns to make sure it fits on screen. You can run the code without [c(1:5, 8)] to see all of it.

# all census levels  search_census_regions(searchterm = "Toronto",                         dataset = "CA16")[c(1:5, 8)]

Returns:

A tibble: 3 x 6  # region  name    level     pop municipal_status PR_UID                              1 35535   Toronto CMA   5928040 B                35      2 3520    Toronto CD    2731571 CDR              35      3 3520005 Toronto CSD   2731571 C                35

You may have expected to get only one region: the city of Toronto, but instead you got three! So, what is the difference? Look at the column 'level' for the answer. Often, the same geographic region can be represented by several census levels, as is the case here. There are three levels for Toronto, which is simultaneously a census metropolitan area, a census division, and a census sub-division. Note also the 'PR_UID' column that contains numeric codes for Canada's provinces and territories, which can help you distinguish between different census regions that have the same or similar names. For an example, run the code above replacing "Toronto" with "Windsor".

Remember that we were going to plot the data for census metropolitan areas? You can choose the geographic level with the level argument, which can take the following values: 'C' for Canada (national level), 'PR' for province, 'CMA' for census metropolitan area, 'CD' for census division, 'CSD' for census sub-division, or NA:

# specific census level  search_census_regions("Toronto", "CA16", level = "CMA")

Let's now list census regions that may be relevant for our project:

# explore available census regions  names <- c("Canada", "Calgary", "Halifax",              "Toronto", "Vancouver", "Whitehorse")  map_df(names, ~ search_census_regions(., dataset = "CA16"))

purrr::map_df function applies search_census_regions iteratively to each element of the names vector and returns output as a single dataframe. Note also the ~ . syntax. Think of it as the tilde taking names and passing it as an argument to a place indicated by the dot in the search_census_regions function. You can find more about the tilde-dot syntax here. It may be a good idea to read the whole tutorial: purrr is a super-useful package, but not the easiest to learn, and this tutorial does a great job explaining the basics.

So as you can see, there are multiple entries for each search term, so we'll need to choose the results for census metropolitan areas, and for census sub-division in case of Whitehorse, since Whitehorse is too small to be considered a census metropolitan area:

# select only the regions we need: CMAs (and CSD for Whitehorse)  regions <- list_census_regions(dataset = "CA16") %>%     filter(grepl("Calgary|Halifax|Toronto|Vancouver", name) &           grepl("CMA", level) |            grepl("Canada|Whitehorse$", name)) %>%     as_census_region_list()

Pay attention to the use of logical operators to filter the output by several conditions at once; also note using $ regex meta-character to choose the entry ending with 'Whitehorse' from the 'names' column (to filter out 'Whitehorse, Unorganized'.

Finally, as_census_region_list converts list_census_regions output to a data object of type list that can be passed to the get_census function as its regions argument.

Find Census Vectors

Canadian census data is made up of individual variables, aka census vectors. Vector number(s) is another argument you need to specify in order to retrieve data with the get_census function.

cancensus has two functions that allow you to search through census data variables: list_census_vectors and search_census_vectors.

list_census_vectors returns all available vectors for a given dataset as a single dataframe containing vectors and their descriptions:

# structure of list_census_vectors output  str(list_census_vectors(dataset = 'CA16'))    # count variables in 'CA16' dataset  nrow(list_census_vectors(dataset = 'CA16'))

As you can see, there are 6623 (as of the time of writing this) variables in the 2016 census dataset, so list_census_vectors won't be the most convenient function to find a specific vector. Note however that there are situations (such as when you need to select a lot of vectors at once), in which list_census_vectors would be appropriate.

Usually it is more convenient to use search_census_vectors to search for vectors. Just pass the text string of what you are looking for as the searchterm argument. You don't have to be precise: this function works even if you make a typo or are uncertain about the spelling of your search term.

Let's now find census data vectors for labor force involvement rates:

# get census data vectors for labor force involvement rates  lf_vectors <-     search_census_vectors(searchterm = "employment rate",                           dataset = "CA16") %>%     union(search_census_vectors("participation rate", "CA16")) %>%     filter(type == "Total") %>%     pull(vector)

Let's take a look at what this code does. Since searchterm doesn't have to be a precise match, "employment rate" search term retrieves unemployment rate vectors too. In the next line, union merges dataframes returned by search_census_vectors into a single dataframe. Note that in this case union could be substituted with bind_rows. I recommend using union in order to avoid data duplication. Next, we choose only the "Total" numbers, since we are not going to plot labor force indicators by gender. Finally, the pull command extracts a single vector from the dataframe, just like the $ subsetting operator: we need 'lf_vectors' to be a data object of type vector in order to pass it to the vectors argument of the get_census function.

The second labor force indicator we are looking for, is the number of people who work full-time and part-time, broken down by gender. But before we proceed with getting the respective vectors, let me show you another way to figure out search terms to put inside the search_census_vectors function: use Statistics Canada online Census Profile tool. It can be used to quickly explore census data as well as to figure out variable names (search terms) and their hierarchical structure.

For example, let's look at census labor data for Calgary metropolitan area. Scrolling down, you will quickly find the numbers and text labels for full-time and part-time workers:

Now we know the exact search terms, so we can get precisely the vectors we need, free from any extraneous data:

# get census data vectors for full and part time work    # get vectors and labels      work_vectors_labels <-     search_census_vectors("full year, full time", "CA16") %>%     union(search_census_vectors("part year and/or part time", "CA16")) %>%     filter(type != "Total") %>%     select(1:3) %>%     mutate(label = str_remove(label, ".*, |.*and/or ")) %>%     mutate(type = fct_drop(type)) %>%     setNames(c("vector", "gender", "type"))    # extract vectors  work_vectors <- work_vectors_labels$vector

Note how this code differs from the code with which we extracted labor force involvement rates: since we need the data to be sub-divided both by the type of work and by gender (hence no "Total" values here), we create a dataframe that assigns respective labels to each vector number. This work_vectors_labels dataframe will supply categorical labels to be attached to the data retrieved with get_census.

Also, note these three lines:

  mutate(label = str_remove(label, ".*, |.*and/or ")) %>%     mutate(type = fct_drop(type)) %>%     setNames(c("vector", "gender", "type"))

The first mutate call removes all text up to and including ', ' and 'and/or ' (spaces included) from the 'label' column. The second drops unused factor level "Total" – it is a good practice to make sure there are no unused factor levels if you are going to use ggplot2 to plot your data. Finally, setNames renames variables for convenience.

Finally, let's retrieve vectors for the education data for the age group from 25 to 64 years, by gender. Before we do this, I'd like to draw your attention to the fact that some of the census data is hierarchical, which means that some variables (census vectors) are included into parent and/or include child variables. It is very important to choose vectors at proper hierarchical levels so that you do not double-count or omit your data.

Education data is a good example of hierarchical data. You can explore data hierarchy using parent_census_vectors and child_census_vectors functions as described here. However, you may find exploring the hierarchy visually using Statistics Canada Census Profile tool to be more convenient:

So, let's now retrieve and label the education data vectors:

# get census vectors for education levels data    # get vectors and labels  ed_vectors_labels <-    search_census_vectors("certificate", "CA16") %>%    union(search_census_vectors("degree", "CA16")) %>%    union(search_census_vectors("doctorate", "CA16")) %>%    filter(type != "Total") %>%    filter(grepl("25 to 64 years", details)) %>%    slice(-1,-2,-7,-8,-11:-14,-19,-20,-23:-28) %>%    select(1:3) %>%    mutate(label =             str_remove_all(label,                            " cert.*diploma| dipl.*cate|, CEGEP| level|")) %>%    mutate(label =             str_replace_all(label,                              c("No.*" = "None",                               "Secondary.*" = "High school or equivalent",                               "other non-university" = "equivalent",                               "University above" = "Cert. or dipl. above",                               "medicine.*" = "health**",                               ".*doctorate$" = "Doctorate*"))) %>%    mutate(type = fct_drop(type)) %>%    setNames(c("vector", "gender", "level"))    # extract vectors  ed_vectors <- ed_vectors_labels$vector

Note the slice function that allows to manually select specific rows from a dataframe: positive numbers choose rows to keep, negative numbers choose rows to drop. I used slice to drop the hierarchical levels from the data that are either too generalized or too granular. Note also that I had to edit text strings in the data. Finally, I added asterisks after "Doctorate" and "health". These are not regex symbols, but actual asterisks that will be used to refer to footnotes in plot captions later on.

Now that we have figured out our dataset, regions, and data vectors (and labelled the vectors, too), we are finally ready to retrieve the data itself.

Retrieve Census Data

To retrieve census data, feed the dataset, regions, and data vectors into get_census as its' respective arguments. Note also that get_census has use_cache argument (set to TRUE by default), which tells get_census to retrieve data from cache if available. If there is no cached data, the function will query CensusMapper API for the data and will save it in cache, while use_cache = FALSE will force get_census to query the API and update the cache.

# get census data for labor force involvement rates  # feed regions and vectors into get_census()  labor <-     get_census(dataset = "CA16",                regions = regions,               vectors = lf_vectors) %>%     select(-c(1, 2, 4:7)) %>%     setNames(c("region", "employment rate",                "unemployment rate",                "participation rate")) %>%     mutate(region = str_remove(region, " (.*)")) %>%     pivot_longer("employment rate":"participation rate",                  names_to = "indicator",                 values_to = "rate") %>%     mutate_if(is.character, as_factor)

The select call drops columns with irrelevant data, setNames renames columns to remove vector numbers from variable names, which will be then converted to values in the 'indicator' column; str_remove inside the mutate call drops municipal status codes '(B)' and '(CY)' from region names; finally, mutate_if converts characters to factors for subsequent plotting.

An important function here is tidyr::pivot_longer. It converts the dataframe from wide to long format. It takes three columns: 'employment rate', 'unemployment rate', and 'participation rate', and converts their names into values of the 'indicator' variable, while their numeric values are passed to the 'rate' variable. The reason for the conversion is that we are going to plot the data for all three labor force indicators in the same graphic, which makes it necessary to store the indicators as a single factor variable.

Next, let's retrieve census data about the percent of full time vs part time workers, by gender, and the data about the education levels of people aged 25 to 64, by gender:

# get census data for full time and part time work  work <-     get_census(dataset = "CA16",                regions = regions,               vectors = work_vectors) %>%     select(-c(1, 2, 4:7)) %>%     rename(region = "Region Name") %>%     pivot_longer(2:5, names_to = "vector",                       values_to = "count") %>%     mutate(region = str_remove(region, " (.*)")) %>%     mutate(vector = str_remove(vector, ":.*")) %>%     left_join(work_vectors_labels, by = "vector") %>%     mutate(gender = str_to_lower(gender)) %>%     mutate_if(is.character, as_factor)    # get census data for education levels  education <-     get_census(dataset = "CA16",                regions = regions,               vectors = ed_vectors) %>%     select(-c(1, 2, 4:7)) %>%     rename(region = "Region Name") %>%     pivot_longer(2:21, names_to = "vector",                        values_to = "count") %>%     mutate(region = str_remove(region, " (.*)")) %>%     mutate(vector = str_remove(vector, ":.*")) %>%     left_join(ed_vectors_labels, by = "vector") %>%     mutate_if(is.character, as_factor)

Note one important difference from the code I used to retrieve the labor force involvement data: here I added the dplyr::left_join function that joins labels to the census data.

We now have the data and are ready to visualize it, which will be done in the next post.

Annex: Notes and Definitions

For those of you who are outside of Canada, Canada's geographic regions and their largest metropolitan areas are:

The Atlantic Provinces – Halifax
Central Canada – Toronto
The Prairie Provinces – Calgary
The West Coast – Vancouver
The Northern Territories – Whitehorse

These regions should not be confused with 10 provinces and 3 territories, which are Canada's sub-national administrative divisions, much like states in the U.S. Each region consists of several provinces or territories, except the West Coast, which includes only one province – British Columbia. You can find more about Canada's geographic regions and territorial structure here (pages 44 to 51).

For the definitions of employment rate, unemployment rate, labour force participation rate, full-time work, and part-time work, see Statistics Canada's Guide to the Labour Force Survey.

You can find more about census geographic areas here and here. There is also a glossary of census-related geographic concepts.

The post Working with Statistics Canada Data in R, Part 5: Retrieving Census Data appeared first on Data Enthusiast's Blog.

To leave a comment for the author, please follow the link and comment on their blog: Data Enthusiast's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Impact of a country’s age breakdown on COVID-19 case fatality rate by @ellis2013nz

Posted: 20 Mar 2020 06:00 AM PDT

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Italy is routinely and correctly described as particularly vulnerable to COVID-19 because of its older age profile. I set out to understand for myself how important this factor is. What would happen if the case fatality rates observed in Italy were applied to demographic profiles of other countries?

Fatality rates by age and sex so far in Italy

The Istituto Superiore di Sanità, Roma is publishing regular bulletins with the latest data on COVID-19 cases in Italy. I used the 19 March 2020 version. These are the observed case ratality fates for around 35,000 cases to that point:

It's worth pointing out that the snapshots presented in these bulletins change fast, including the raw fatality rate (for both sexes) which has increased from 5.8% seven days earlier to 8.5% on 19 March. Further rapid change is to be expected, remembering that deaths lag beginning of the illness by days or weeks, and diagnoses lag infections by days (symptoms start on average around five days after exposure).

It's also worth pointing out how much worse this disease seems to be for men. Of the deaths in Italy at the time of this bulletin, 71% were men. Of diagnosed cases, 59% were male (more than 200 Italian boys have the illness but none had died at the time of the bulletin). There were more male fatalities aged 80 and over than female of all ages. Also, it's worth pointing out that while it is definitely worse for older people, fatality rates are pretty bad for middle-aged people – about 1% for those between 30 and 59. That's bad for a disease expecting as many cases as this one.

Population profiles in selected countries

I took population breakdowns by age and sex from the United Nations' World Population Prospects. To illustrate I chose nine countries representing a range of cultural and economic situations. I've chosen to present these as density charts, not population pyramids (which I find difficult to make comparisons with). We can readily see the contrast between Italy and (for an extreme example) economically poor Timor Leste:

Applying fatality rates to population profiles

It's straightforward to take a country's population and apply the Italian case fatality rates to it to get a weighted average fatality rate. In effect, this tells us what the fatality rate would be in a country, if the Italian rates applied to its whole population or a subpopulation that was representative of the overall age and sex balance. Here's what we get for our nine 'countries' (including the World aggregate):

Two things stand out.

First, the different demographics of the different countries make a huge difference. On these sorts of age-based rates, Italy can expect twice the fatality rate of China (and nearly five times that of Timor Leste).

Second, the death rate for Italy from this method is much lower than the actual fatality rate in the 19 March bulletin – 3.9% compared to 8.5%. This isn't a mistake – it comes about because the profile of Italians diagnosed with COVID-19 is older and more male than Italians in general.

Older people and men are not just more likely to die if they get COVID-19, they are also more likely to be diagnosed with it in the first place.

As I note on the graphic, this could be due to women and younger people of either sex being less likely to be diagnosed given they have the disease; or it might mean they are less likely to have the disease at all. There is no way to tell with this data.

We can adjust the fatality rates by scaling them up to match Italy's 19 March observed level. This gives a more realistic but still very rough answer to the question "what would Italy's case fatality rates mean, translated to other countries". It's very rough because doing this assumes away a whole bunch of possible complexities and interactions between variables, but it's probably as thorough a method as is warranted at the moment with the fast changing data. Here's those scaled results:

What does it all mean?

Well, the danger to people over 50, particularly but not only men, is very very real from this disease. And the age profiles of countries vary enough for this to make big differences to the overall impact.

But regardless of this, the necessary actions are clear. Work hard to avoid getting this horrible disease and to avoid passing it on. Work to help others do the same, and pull together to manage society through some difficult months ahead. Wash your hands and practice social distancing.

Here's the code behind those charts. The Italian data is just entered by hand because it's only 20 numbers, not worth trying to automate.

#------------setup---------------  # 59% cases male  20686 / (20686 + 14378)    # 71% deaths men (no boys)  2139 / (2139 + 890)      library(tidyverse)  library(scales)  library(wpp2019)    # colours for male and female used by Washington Post 2017; see https://blog.datawrapper.de/gendercolor/  sex_cols <- c(Male = "#F4BA3B", Female =  "#730B6D")      #---------------------Italian fatality rates---------    italy_rates <-tibble(    age_grp = rep(c('0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90+'), 2),    sex = rep(c("Male", "Female"), each = 10),    cfr = c(0, 0, 0, 0.6, 0.7,   1.7, 6.0, 17.8, 26.4, 32.5,            0, 0, 0, 0.2,   0.4, 0.6, 2.8,  10.7, 19.1,   22.3) / 100,    age_midpt = rep(c(5, 15, 25, 35, 45, 55, 65, 75, 85, 95), 2)  )      italy_rates %>%    ggplot(aes(x = age_midpt, y = cfr, colour = sex)) +    geom_point() +    geom_text(data = filter(italy_rates, cfr > 0.01),              aes(label = percent(cfr), y = cfr + 0.012), size = 3) +    geom_line() +    scale_x_continuous(breaks = italy_rates$age_midpt, labels = italy_rates$age_grp) +    scale_y_continuous(label = percent_format(accuracy = 1)) +    scale_colour_manual(values = sex_cols) +    theme(panel.grid.minor = element_blank(),          panel.grid.major.x = element_blank()) +    labs(x = "Age group", colour = "", y = "Observed case fatality rate",         title = "Observed fatality rate of diagnosed COVID-19 cases in Italy to 19 March 2020",         subtitle = "20,686 men and boys with case fatality rate of 10.3%; 14,378 women and girls with case fatality rate of 6.2%",         caption = "Source: Istituto Superiore di Sanità, Roma")    #----------------Population rates ------------------  data(popF)  data(popM)    selected_countries <- c("Australia", "Italy", "Timor-Leste", "United States of America", "World",                          "China", "Brazil", "Japan", "Germany")    age_lu <- tibble(age = unique(popF$age),                   age_grp = c(rep(unique(italy_rates$age_grp), each = 2), "90+")) %>%    mutate(age_grp = factor(age_grp, levels = unique(age_grp)))    # Visual check that this shorthand worked ok  # View(age_lu)    pop_2020 <- popF %>%    mutate(sex = "Female") %>%    rbind(mutate(popM, sex = "Male")) %>%    select(country = name, age, pop = `2020`, sex) %>%    left_join(age_lu, by = "age") %>%    group_by(country, age_grp, sex) %>%    summarise(pop = sum(pop)) %>%    ungroup() %>%    filter(country %in% selected_countries) %>%    mutate(country = fct_drop(country)) %>%    group_by(country) %>%    mutate(prop = pop / sum(pop)) %>%    ungroup()    # check no misspellings in countries  stopifnot(sum(!selected_countries %in% unique(pop_2020$country)) == 0)    pop_2020 %>%    ggplot(aes(x = as.numeric(age_grp), y = prop, colour = sex)) +    geom_line() +    facet_wrap(~country) +    scale_y_continuous(label = percent_format(accuracy = 1)) +    scale_x_continuous(breaks = 1:10, labels = levels(pop_2020$age_grp)) +    scale_colour_manual(values = sex_cols) +    theme(panel.grid.minor = element_blank(),          panel.grid.major.x = element_blank(),          axis.text.x = element_text(angle = 45, hjust = 1)) +    labs(x = "Age group",         y = "",         colour = "",         title = "Estimated proportion of the population in 2020",         subtitle = "By age group and sex",         caption = "Source: UN World Population Prospects 2019")      #----------Combine fatality rate with population--------------------    the_caption = "Source: Italian case fatality rates to 19 March 2020 from Istituto Superiore di Sanità, Roma, combined with UN World Population Prospects 2019"    projected_cfr <- pop_2020 %>%    mutate(age_grp = as.character(age_grp)) %>%    left_join(italy_rates, by = c("age_grp", "sex")) %>%    group_by(country) %>%    summarise(cfr = sum(cfr * prop) /  sum(prop)) %>%    ungroup() %>%    mutate(country = fct_reorder(country, -cfr))    xlabel <- "Case fatality rate if rates observed in Italy applied to each country's total age and sex profile.\n  Do not treat these as forecasts of actual case fatality rate."    # Version 1:  projected_cfr %>%    ggplot(aes(y = country, x = cfr)) +    geom_point(colour = "steelblue") +    geom_text(aes(label = percent(cfr, accuracy = 0.1)), nudge_x = 0.001, size = 3) +    geom_segment(aes(yend = country, xend = 0), colour = "steelblue") +    scale_x_continuous(label = percent_format(accuracy = 0.1)) +    theme(panel.grid.minor = element_blank(),          panel.grid.major.y = element_blank()) +    labs(subtitle = xlabel,         y = "",         title = "Different age profiles can make a big difference to overall fatality rates, based on Italian data",         x = "Note that in observed situations (eg Italy 8.5% to 19 March 2020), raw case fatality rates are more than double  those shown here, suggesting younger cases are either not diagnosed or not occurring.",         caption = the_caption)    # Version 2, calibrated to actual Italy case fatality rate so far  projected_cfr %>%    mutate(cfr_adj = cfr / cfr[country == "Italy"] * 0.085) %>%    ggplot(aes(y = country, x = cfr_adj)) +    geom_point(colour = "steelblue") +    geom_text(aes(label = percent(cfr_adj, accuracy = 0.1)), nudge_x = 0.002, size = 3) +    geom_segment(aes(yend = country, xend = 0), colour = "steelblue") +    scale_x_continuous(label = percent_format(accuracy = 0.1)) +    theme(panel.grid.minor = element_blank(),          panel.grid.major.y = element_blank()) +    labs(subtitle = xlabel,         y = "",         title = "Different age profiles can make a big difference to overall fatality rates, based on Italian data",         x = "Estimates have been scaled to match Italy's raw case fatality rate to 19 March, to  reflect likely patterns in younger people's case rate and diagnosis.",         caption = the_caption)

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Modeling Pandemics (3)

Posted: 20 Mar 2020 04:41 AM PDT

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In Statistical Inference in a Stochastic Epidemic SEIR Model with Control Intervention, a more complex model than the one we've seen yesterday was considered (and is called the SEIR model). Consider a population of size $N$ , and assume that $S$ is the number of susceptible, $E$ the number of exposed, $I$ the number of infectious, and $R$ for the number recovered (or immune) individuals, $\displaystyle{\begin{aligned}{\frac {dS}{dt}}&=-\beta {\frac {I}{N}}S\\[8pt]{\frac {dE}{dt}}&=\beta {\frac {I}{N}}S-aE\\[8pt]{\frac {dI}{dt}}&=aE-b I\\[8pt]{\frac {dR}{dt}}&=b I\end{aligned}}$ Between $S$ and $I$ , the transition rate is $\beta I$ , where $\beta$ is the average number of contacts per person per time, multiplied by the probability of disease transmission in a contact between a susceptible and an infectious subject. Between $I$ and $R$ , the transition rate is $b$ (simply the rate of recovered or dead, that is, number of recovered or dead during a period of time divided by the total number of infected on that same period of time). And finally, the incubation period is a random variable with exponential distribution with parameter $a$ , so that the average incubation period is $a^{-1}$ .

Probably more interesting, Understanding the dynamics of ebola epidemics suggested a more complex model, with susceptible people $S$ , exposed $E$ , Infectious, but either in community $I$ , or in hospitals $H$ , some people who died $F$ and finally those who either recover or are buried and therefore are no longer susceptible $R$ .

Thus, the following dynamic model is considered $\displaystyle{\begin{aligned}{\frac {dS}{dt}}&=-(\beta_II+\beta_HH+\beta_FF)\frac{S}{N}\\[8pt]\frac {dE}{dt}&=(\beta_II+\beta_HH+\beta_FF)\frac{S}{N}-\alpha E\\[8pt]\frac {dI}{dt}&=\alpha E+\theta\gamma_H I-(1-\theta)(1-\delta)\gamma_RI-(1-\theta)\delta\gamma_FI\\[8pt]\frac {dH}{dt}&=\theta\gamma_HI-\delta\lambda_FH-(1-\delta)\lambda_RH\\[8pt]\frac {dF}{dt}&=(1-\theta)(1-\delta)\gamma_RI+\delta\lambda_FH-\nu F\\[8pt]\frac {dR}{dt}&=(1-\theta)(1-\delta)\gamma_RI+(1-\delta)\lambda_FH+\nu F\end{aligned}}$ In that model, parameters are $\alpha^{-1}$ is the (average) incubation period (7 days), $\gamma_H^{-1}$ the onset to hospitalization (5 days), $\gamma_F^{-1}$ the onset to death (9 days), $\gamma_R^{-1}$ the onset to "recovery" (10 days), $\lambda_F^{-1}$ the hospitalisation to death (4 days) while $\lambda_R^{-1}$ is the hospitalisation to recovery (5 days), $\eta^{-1}$ is the death to burial (2 days). Here, numbers are from Understanding the dynamics of ebola epidemics (in the context of ebola). The other parameters are $\beta_I$ the transmission rate in community (0.588), $\beta_H$ the transmission rate in hospital (0.794) and $\beta_F$ the transmission rate at funeral (7.653). Thus

1  2  3  4  5  6

epsilon = 0.001   Z = c(S = 1-epsilon, E = epsilon, I=0,H=0,F=0,R=0)  p=c(alpha=1/7*7, theta=0.81, delta=0.81, betai=0.588,      betah=0.794, blambdaf=7.653,N=1, gammah=1/5*7,      gammaf=1/9.6*7, gammar=1/10*7, lambdaf=1/4.6*7,      lambdar=1/5*7, nu=1/2*7)

If $\boldsymbol{Z}=(S,E,I,H,F,R)$ , if we write $\frac{\partial \boldsymbol{Z}}{\partial t} = SEIHFR(\boldsymbol{Z})$ where $SEIHFR$ is

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15

SEIHFR = function(t,Z,p){    S=Z[1]; E=Z[2]; I=Z[3]; H=Z[4]; F=Z[5]; R=Z[6]    alpha=p["alpha"]; theta=p["theta"]; delta=p["delta"]    betai=p["betai"]; betah=p["betah"]; gammah=p["gammah"]    gammaf=p["gammaf"]; gammar=p["gammar"]; lambdaf=p["lambdaf"]    lambdar=p["lambdar"]; nu=p["nu"]; blambdaf=p["blambdaf"]    N=S+E+I+H+F+R    dS=-(betai*I+betah*H+blambdaf*F)*S/N    dE=(betai*I+betah*H+blambdaf*F)*S/N-alpha*E    dI=alpha*E-theta*gammah*I-(1-theta)*(1-delta)*gammar*I-(1-theta)*delta*gammaf*I    dH=theta*gammah*I-delta*lambdaf*H-(1-delta)*lambdaf*H    dF=(1-theta)*(1-delta)*gammar*I+delta*lambdaf*H-nu*F    dR=(1-theta)*(1-delta)*gammar*I+(1-delta)*lambdar*H+nu*F    dZ=c(dS,dE,dI,dH,dF,dR)    list(dZ)}

We can solve it, or at least study the dynamics from some starting values

1  2  3

library(deSolve)  times = seq(0, 50, by = .1)  resol = ode(y=Z, times=times, func=SEIHFR, parms=p)

For instance, the proportion of people infected is the following

1 2	plot(resol[,"time"],resol[,"I"],type="l",xlab="time",ylab="",col="red") lines(resol[,"time"],resol[,"H"],col="blue")

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Modeling pandemics (1)

Posted: 19 Mar 2020 06:48 PM PDT

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The most popular model to model epidemics is the so-called SIR model – or Kermack-McKendrick. Consider a population of size $N$ , and assume that $S$ is the number of susceptible, $I$ the number of infectious, and $R$ for the number recovered (or immune) individuals, $\displaystyle {\begin{aligned}&{\frac {dS}{dt}}=-{\frac {\beta IS}{N}},\\[6pt]&{\frac {dI}{dt}}={\frac {\beta IS}{N}}-\gamma I,\\[6pt]&{\frac {dR}{dt}}=\gamma I,\end{aligned}}$ so that $\displaystyle{{\frac{dS}{dt}}+{\frac {dI}{dt}}+{\frac {dR}{dt}}=0}$ which implies that $S+I+R=N$ . In order to be more realistic, consider some (constant) birth rate $\mu$ , so that the model becomes $\displaystyle {\begin{aligned}&{\frac {dS}{dt}}=\mu(N-S)-{\frac {\beta IS}{N}},\\[6pt]&{\frac {dI}{dt}}={\frac {\beta IS}{N}}-(\gamma+\mu) I,\\[6pt]&{\frac {dR}{dt}}=\gamma I-\mu R,\end{aligned}}$ Note, in this model, that people get sick (infected) but they do not die, they recover. So here, we can model chickenpox, for instance, not SARS.

The dynamics of the infectious class depends on the following ratio: $\displaystyle{R_{0}={\frac {\beta }{\gamma +\mu}}}$ which is the so-called basic reproduction number (or reproductive ratio). The effective reproductive ratio is $R_0S/N$ , and the turnover of the epidemic happens exactly when $R_0S/N=1$ , or when the fraction of remaining susceptibles is $R_0^{-1}$ . As shown in Directly transmitted infectious diseases:Control by vaccination, if $S/N<R_0^{-1}[/latex] the disease (the number of people infected) will start to decrease. Want to see it ? Start with 2f8f7d80e757ec80aa013429cb8a03e4011 for the parameters. Here, [latex]R_0=4$ . We also need starting values

1  2  3  4  5

epsilon = .001  N = 1  S = 1-epsilon  I = epsilon  R = 0

Then use the ordinary differential equation solver, in R. The idea is to say that $\boldsymbol{Z}=(S,I,R)$ and we have the gradient $\frac{\partial \boldsymbol{Z}}{\partial t} = SIR(\boldsymbol{Z})$ where $SIR$ is function of the various parameters. Hence, set

1 2	p = c(mu = 0, N = 1, beta = 2, gamma = 1/2) start_SIR = c(S = 1-epsilon, I = epsilon, R = 0)

The we must define the time, and the function that returns the gradient,

1  2  3  4  5  6  7  8  9

times = seq(0, 10, by = .1)  SIR = function(t,Z,p){  S=Z[1]; I=Z[2]; R=Z[3]; N=S+I+R  mu=p["mu"]; beta=p["beta"]; gamma=p["gamma"]  dS=mu*(N-S)-beta*S*I/N  dI=beta*S*I/N-(mu+gamma)*I  dR=gamma*I-mu*R  dZ=c(dS,dI,dR)  return(dZ)}

To solve this problem use

1 2	library(deSolve) resol = ode(y=start_SIR, times=times, func=SIR, parms=p)

We can visualize the dynamics below

1  2  3  4  5  6  7  8

par(mfrow=c(1,2))  t=resol[,"time"]  plot(t,resol[,"S"],type="l",xlab="time",ylab="")  lines(t,resol[,"I"],col="red")  lines(t,resol[,"R"],col="blue")  plot(t,t*0+1,type="l",xlab="time",ylab="",ylim=0:1)  polygon(c(t,rev(t)),c(resol[,"R"],rep(0,nrow(resol))),col="blue")  polygon(c(t,rev(t)),c(resol[,"R"]+resol[,"I"],rev(resol[,"R"])),col="red")

We can actually also visualize the effective reproductive number is $R_0S/N$ , where

1	R0=p["beta"]/(p["gamma"]+p["mu"])

The effective reproductive number is on the left, and as we mentioned above, when we reach 1, we actually reach the maximum of the infected,

1  2  3  4  5  6  7  8  9

plot(t,resol[,"S"]*R0,type="l",xlab="time",ylab="")  abline(h=1,lty=2,col="red")  abline(v=max(t[resol[,"S"]*R0&gt;=1]),col="darkgreen")  points(max(t[resol[,"S"]*R0&gt;=1]),1,pch=19)  plot(t,resol[,"S"],type="l",xlab="time",ylab="",col="grey")  lines(t,resol[,"I"],col="red",lwd=3)  lines(t,resol[,"R"],col="light blue")  abline(v=max(t[resol[,"S"]*R0&gt;=1]),col="darkgreen")  points(max(t[resol[,"S"]*R0&gt;=1]),max(resol[,"I"]),pch=19)

And when adding a $\mu$ parameter, we can obtain some interesting dynamics on the number of infected,

1  2  3  4  5

times = seq(0, 100, by=.1)  p = c(mu = 1/100, N = 1, beta = 50, gamma = 10)  start_SIR = c(S=0.19, I=0.01, R = 0.8)  resol = ode(y=start_SIR, t=times, func=SIR, p=p)  plot(resol[,"time"],resol[,"I"],type="l",xlab="time",ylab="")

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Apply for the Google Summer of Code and Help Us Improving The R Code Optimizer

Posted: 19 Mar 2020 05:00 PM PDT

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you a BSc, MSc or PhD student, and this summer (winter down south) you would
like to contribute to open-source while earning some cash?

Then you will be interested to know that with @CancuCS this year
we are going to mentor for the Google Summer of Code 2020.

Take a look at The R Code Optimizer,
apply, and help us grow the R code optimizer, rco, package.

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

The significance of population size, year, and per cent women on the education level in Sweden

Posted: 19 Mar 2020 05:00 PM PDT

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In twelve posts I have analysed how different factors are related to salaries in Sweden with data from Statistics Sweden. In this post, I will analyse a new dataset from Statistics Sweden, population by region, age, level of education, sex and year. Not knowing exactly what to find I will use a criterion-based procedure to find the model that minimises the AIC. Then I will perform some test to see how robust the model is. Finally, I will plot the findings.

First, define libraries and functions.

library (tidyverse)

## -- Attaching packages -------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3  ## v tibble  2.1.3     v dplyr   0.8.3  ## v tidyr   1.0.2     v stringr 1.4.0  ## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --  ## x dplyr::filter() masks stats::filter()  ## x dplyr::lag()    masks stats::lag()

library (broom)  library (car)

## Loading required package: carData

##   ## Attaching package: 'car'

## The following object is masked from 'package:dplyr':  ##   ##     recode

## The following object is masked from 'package:purrr':  ##   ##     some

library (sjPlot)

## Registered S3 methods overwritten by 'lme4':  ##   method                          from  ##   cooks.distance.influence.merMod car   ##   influence.merMod                car   ##   dfbeta.influence.merMod         car   ##   dfbetas.influence.merMod        car

library (leaps)  library (splines)  library (MASS)

##   ## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':  ##   ##     select

library (mgcv)

## Loading required package: nlme

##   ## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':  ##   ##     collapse

## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.

library (lmtest)

## Loading required package: zoo

##   ## Attaching package: 'zoo'

## The following objects are masked from 'package:base':  ##   ##     as.Date, as.Date.numeric

library (earth)

## Warning: package 'earth' was built under R version 3.6.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 3.6.3

## Loading required package: plotrix

## Loading required package: TeachingDemos

## Warning: package 'TeachingDemos' was built under R version 3.6.3

library (acepack)

## Warning: package 'acepack' was built under R version 3.6.3

library (lspline)

## Warning: package 'lspline' was built under R version 3.6.3

library (lme4)

## Loading required package: Matrix

##   ## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':  ##   ##     expand, pack, unpack

##   ## Attaching package: 'lme4'

## The following object is masked from 'package:nlme':  ##   ##     lmList

library (pROC)

## Warning: package 'pROC' was built under R version 3.6.3

## Type 'citation("pROC")' for a citation.

##   ## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':  ##   ##     cov, smooth, var

readfile <- function (file1){read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%    gather (starts_with("19"), starts_with("20"), key = "year", value = groupsize) %>%    drop_na() %>%    mutate (year_n = parse_number (year))  }    perc_women <- function(x){      ifelse (length(x) == 2, x[2] / (x[1] + x[2]), NA)  }     nuts <- read.csv("nuts.csv") %>%    mutate(NUTS2_sh = substr(NUTS2, 3, 4))

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, UF0506A1.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

I will calculate the percentage of women in for the different education levels in the different regions for each year. In my later analysis I will use the number of people in each education level, region and year.

The table: Population 16-74 years of age by region, highest level of education, age and sex. Year 1985 – 2018 NUTS 2 level 2008- 10 year intervals (16-74)

tb <- readfile("UF0506A1.csv") %>%      mutate(edulevel = `level of education`) %>%    group_by(edulevel, region, year, sex) %>%    mutate(groupsize_all_ages = sum(groupsize)) %>%      group_by(edulevel, region, year) %>%     mutate (sum_edu_region_year = sum(groupsize)) %>%      mutate (perc_women = perc_women (groupsize_all_ages[1:2])) %>%     group_by(region, year) %>%    mutate (sum_pop = sum(groupsize)) %>% rowwise() %>%    mutate(age_l = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[1]) %>%    rowwise() %>%     mutate(age_h = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[2]) %>%    mutate(age_n = (age_l + age_h) / 2) %>%    left_join(nuts %>% distinct (NUTS2_en, NUTS2_sh), by = c("region" = "NUTS2_en"))

## Warning: Column `region`/`NUTS2_en` joining character vector and factor,  ## coercing into character vector

numedulevel <- read.csv("edulevel_1.csv")     numedulevel %>%    knitr::kable(    booktabs = TRUE,    caption = 'Initial approach, length of education')

Table 1: Initial approach, length of education
level.of.education	eduyears
primary and secondary education less than 9 years (ISCED97 1)	8
primary and secondary education 9-10 years (ISCED97 2)	9
upper secondary education, 2 years or less (ISCED97 3C)	11
upper secondary education 3 years (ISCED97 3A)	12
post-secondary education, less than 3 years (ISCED97 4+5B)	14
post-secondary education 3 years or more (ISCED97 5A)	15
post-graduate education (ISCED97 6)	19
no information about level of educational attainment	NA

tbnum <- tb %>%     right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%    filter(!is.na(eduyears)) %>%     drop_na()

## Warning: Column `level of education`/`level.of.education` joining character  ## vector and factor, coercing into character vector

tbnum %>%    ggplot () +        geom_point (mapping = aes(x = NUTS2_sh,y = perc_women, colour = year_n)) +    facet_grid(. ~ eduyears)

Figure 1: Population by region, level of education, percent women and year, Year 1985 – 2018

summary(tbnum)

##     region              age            level of education     sex             ##  Length:22848       Length:22848       Length:22848       Length:22848        ##  Class :character   Class :character   Class :character   Class :character    ##  Mode  :character   Mode  :character   Mode  :character   Mode  :character    ##                                                                               ##                                                                               ##                                                                               ##      year             groupsize         year_n       edulevel          ##  Length:22848       Min.   :    0   Min.   :1985   Length:22848        ##  Class :character   1st Qu.: 1634   1st Qu.:1993   Class :character    ##  Mode  :character   Median : 5646   Median :2002   Mode  :character    ##                     Mean   : 9559   Mean   :2002                       ##                     3rd Qu.:14027   3rd Qu.:2010                       ##                     Max.   :77163   Max.   :2018                       ##  groupsize_all_ages sum_edu_region_year   perc_women        sum_pop         ##  Min.   :    45     Min.   :   366      Min.   :0.1230   Min.   : 266057    ##  1st Qu.: 20033     1st Qu.: 40482      1st Qu.:0.4416   1st Qu.: 515306    ##  Median : 45592     Median : 90871      Median :0.4816   Median : 740931    ##  Mean   : 57353     Mean   :114706      Mean   :0.4641   Mean   : 823034    ##  3rd Qu.: 86203     3rd Qu.:172120      3rd Qu.:0.5217   3rd Qu.:1195658    ##  Max.   :271889     Max.   :486270      Max.   :0.6423   Max.   :1716160    ##      age_l           age_h        age_n         NUTS2_sh          ##  Min.   :16.00   Min.   :24   Min.   :20.00   Length:22848        ##  1st Qu.:25.00   1st Qu.:34   1st Qu.:29.50   Class :character    ##  Median :40.00   Median :49   Median :44.50   Mode  :character    ##  Mean   :40.17   Mean   :49   Mean   :44.58                       ##  3rd Qu.:55.00   3rd Qu.:64   3rd Qu.:59.50                       ##  Max.   :65.00   Max.   :74   Max.   :69.50                       ##     eduyears      ##  Min.   : 8.00    ##  1st Qu.: 9.00    ##  Median :12.00    ##  Mean   :12.57    ##  3rd Qu.:15.00    ##  Max.   :19.00

In a previous post, I approximated the number of years of education for every education level. Since this approximation is significant for the rest of the analysis I will see if I can do a better approximation. I use Multivariate Adaptive Regression Splines (MARS) to find the permutation of years of education, within the given limitations, which gives the highest adjusted R-Squared value. I choose not to calculate more combinations than between the age of 7 and 19 because I assessed it would take to much time. From the table, we can see that the R-Squared only gains from a higher education year for post-graduate education. A manual test shows that setting years of education to 22 gives a higher R-Squared without getting high residuals.

educomb <- as_tibble(t(combn(7:19,7))) %>%     filter((V6 - V4) > 2) %>% filter((V4 - V2) > 2) %>%     filter(V2 > 8) %>%     mutate(na = NA)

## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.  ## This warning is displayed once per session.

summary_table = vector()    for (i in 1:dim(educomb)[1]) {    numedulevel[, 2] <- t(educomb[i,])      suppressWarnings (tbnum <- tb %>%       right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%      filter(!is.na(eduyears)) %>%       drop_na())      tbtest <- tbnum %>%       dplyr::select(eduyears, sum_pop, sum_edu_region_year, year_n, perc_women)      mmod <- earth(eduyears ~ ., data = tbtest, nk = 12, degree = 2)      summary_table <- rbind(summary_table, summary(mmod)$rsq)  }    which.max(summary_table)

## [1] 235

educomb[which.max(summary_table),] #235

## # A tibble: 1 x 8  ##      V1    V2    V3    V4    V5    V6    V7 na     ##            ## 1     8     9    10    12    13    15    19 NA

numedulevel[, 2] <- t(educomb[235,])    numedulevel[7, 2] <- 22    numedulevel %>%    knitr::kable(    booktabs = TRUE,    caption = 'Recalculated length of education')

Table 2: Recalculated length of education
level.of.education	eduyears
primary and secondary education less than 9 years (ISCED97 1)	8
primary and secondary education 9-10 years (ISCED97 2)	9
upper secondary education, 2 years or less (ISCED97 3C)	10
upper secondary education 3 years (ISCED97 3A)	12
post-secondary education, less than 3 years (ISCED97 4+5B)	13
post-secondary education 3 years or more (ISCED97 5A)	15
post-graduate education (ISCED97 6)	22
no information about level of educational attainment	NA

tbnum <- tb %>%     right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%    filter(!is.na(eduyears)) %>%     drop_na()

## Warning: Column `level of education`/`level.of.education` joining character  ## vector and factor, coercing into character vector

Let's investigate the shape of the function for the response and predictors. The shape of the predictors has a great impact on the rest of the analysis. I use acepack to fit a model and plot both the response and the predictors.

tbtest <- tbnum %>% dplyr::select(sum_pop, sum_edu_region_year, year_n, perc_women)    tbtest <- data.frame(tbtest)    acefit <- ace(tbtest, tbnum$eduyears)    plot(tbnum$eduyears, acefit$ty, xlab = "Years of education", ylab = "transformed years of education")

Figure 2: Plots of the response and predictors using acepack

plot(tbtest[,1], acefit$tx[,1], xlab = "Population in region", ylab = "transformed population in region")

Figure 3: Plots of the response and predictors using acepack

plot(tbtest[,2], acefit$tx[,2], xlab = "# persons with same edulevel, region, year", ylab = "transformed # persons with same edulevel, region, year")

Figure 4: Plots of the response and predictors using acepack

plot(tbtest[,3], acefit$tx[,3], xlab = "Year", ylab = "transformed year")

Figure 5: Plots of the response and predictors using acepack

plot(tbtest[,4], acefit$tx[,4], xlab = "Percent women", ylab = "transformed percent women")

Figure 6: Plots of the response and predictors using acepack

I use MARS to fit hockey-stick functions for the predictors. I do not wish to overfit by using a better approximation at this point. I will include interactions of degree two.

tbtest <- tbnum %>% dplyr::select(eduyears, sum_pop, sum_edu_region_year, year_n, perc_women)    mmod <- earth(eduyears ~ ., data=tbtest, nk = 9, degree = 2)    summary (mmod)

## Call: earth(formula=eduyears~., data=tbtest, degree=2, nk=9)  ##   ##                                                       coefficients  ## (Intercept)                                               9.930701  ## h(37001-sum_edu_region_year)                              0.000380  ## h(sum_edu_region_year-37001)                              0.000003  ## h(0.492816-perc_women)                                    9.900436  ## h(perc_women-0.492816)                                   49.719932  ## h(1.32988e+06-sum_pop) * h(37001-sum_edu_region_year)     0.000000  ## h(sum_pop-1.32988e+06) * h(37001-sum_edu_region_year)     0.000000  ## h(sum_edu_region_year-37001) * h(2004-year_n)            -0.000001  ##   ## Selected 8 of 9 terms, and 4 of 4 predictors  ## Termination condition: Reached nk 9  ## Importance: sum_edu_region_year, perc_women, sum_pop, year_n  ## Number of terms at each degree of interaction: 1 4 3  ## GCV 3.774465    RSS 86099.37    GRSq 0.8049234    RSq 0.8052222

plot (mmod)

Figure 7: Hockey-stick functions fit with MARS for the predictors, Year 1985 – 2018

plotmo (mmod)

##  plotmo grid:    sum_pop sum_edu_region_year year_n perc_women  ##                   740931             90870.5 2001.5  0.4815703

Figure 8: Hockey-stick functions fit with MARS for the predictors, Year 1985 – 2018

model_mmod <- lm (eduyears ~ lspline(sum_edu_region_year, c(37001)) +                 lspline(perc_women, c(0.492816)) +                 lspline(sum_pop, c(1.32988e+06)):lspline(sum_edu_region_year, c(37001)) +                lspline(sum_edu_region_year, c(1.32988e+06)):lspline(year_n, c(2004)),               data = tbnum)     summary (model_mmod)$r.squared

## [1] 0.7792166

anova (model_mmod)

## Analysis of Variance Table  ##   ## Response: eduyears  ##                                                                        Df  ## lspline(sum_edu_region_year, c(37001))                                  2  ## lspline(perc_women, c(0.492816))                                        2  ## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))     4  ## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))       2  ## Residuals                                                           22837  ##                                                                     Sum Sq  ## lspline(sum_edu_region_year, c(37001))                              292982  ## lspline(perc_women, c(0.492816))                                     39071  ## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))   9629  ## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))     2763  ## Residuals                                                            97595  ##                                                                     Mean Sq  ## lspline(sum_edu_region_year, c(37001))                               146491  ## lspline(perc_women, c(0.492816))                                      19535  ## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))    2407  ## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))      1382  ## Residuals                                                                 4  ##                                                                      F value  ## lspline(sum_edu_region_year, c(37001))                              34278.55  ## lspline(perc_women, c(0.492816))                                     4571.22  ## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))   563.27  ## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))     323.30  ## Residuals                                                                     ##                                                                        Pr(>F)  ## lspline(sum_edu_region_year, c(37001))                              < 2.2e-16  ## lspline(perc_women, c(0.492816))                                    < 2.2e-16  ## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880)) < 2.2e-16  ## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))   < 2.2e-16  ## Residuals                                                                      ##                                                                          ## lspline(sum_edu_region_year, c(37001))                              ***  ## lspline(perc_women, c(0.492816))                                    ***  ## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880)) ***  ## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))   ***  ## Residuals                                                                ## ---  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I will use regsubsets to find the model which minimises the AIC. I will also calculate the Receiver Operating Characteristic (ROC) for the model I find for each level of years of education.

b <- regsubsets (eduyears ~ (lspline(sum_pop, c(1.32988e+06)) + lspline(perc_women, c(0.492816)) + lspline(year_n, c(2004)) + lspline(sum_edu_region_year, c(37001))) * (lspline(sum_pop, c(1.32988e+06)) + lspline(perc_women, c(0.492816)) + lspline(year_n, c(2004)) + lspline(sum_edu_region_year, c(37001))), data = tbnum, nvmax = 20)    rs <- summary(b)  AIC <- 50 * log (rs$rss / 50) + (2:21) * 2  which.min (AIC)

## [1] 9

names (rs$which[9,])[rs$which[9,]]

##  [1] "(Intercept)"                                                                ##  [2] "lspline(sum_pop, c(1329880))1"                                              ##  [3] "lspline(sum_edu_region_year, c(37001))2"                                    ##  [4] "lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1"            ##  [5] "lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1"                    ##  [6] "lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1"      ##  [7] "lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1"                ##  [8] "lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1"                ##  [9] "lspline(perc_women, c(0.492816))1:lspline(sum_edu_region_year, c(37001))2"  ## [10] "lspline(year_n, c(2004))1:lspline(sum_edu_region_year, c(37001))2"

model <- lm(eduyears ~     lspline(sum_pop, c(1329880)) +     lspline(sum_edu_region_year, c(37001)) +     lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816)) +    lspline(sum_pop, c(1329880)):lspline(year_n, c(2004)) +    lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001)) +    lspline(perc_women, c(0.492816)):lspline(year_n, c(2004)) +    lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year, c(37001)) +    lspline(year_n, c(2004)):lspline(sum_edu_region_year, c(37001)),     data = tbnum)     summary (model)$r.squared

## [1] 0.8455547

anova (model)

## Analysis of Variance Table  ##   ## Response: eduyears  ##                                                                            Df  ## lspline(sum_pop, c(1329880))                                                2  ## lspline(sum_edu_region_year, c(37001))                                      2  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))               4  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       4  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))         4  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                   4  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))     4  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))             4  ## Residuals                                                               22819  ##                                                                         Sum Sq  ## lspline(sum_pop, c(1329880))                                                 0  ## lspline(sum_edu_region_year, c(37001))                                  306779  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            35378  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      775  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       7224  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 8932  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   6979  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           7700  ## Residuals                                                                68271  ##                                                                         Mean Sq  ## lspline(sum_pop, c(1329880))                                                  0  ## lspline(sum_edu_region_year, c(37001))                                   153389  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))              8844  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       194  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))        1806  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                  2233  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))    1745  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))            1925  ## Residuals                                                                     3  ##                                                                          F value  ## lspline(sum_pop, c(1329880))                                                0.00  ## lspline(sum_edu_region_year, c(37001))                                  51269.26  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            2956.20  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      64.80  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       603.67  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 746.37  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   583.19  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           643.44  ## Residuals                                                                         ##                                                                         Pr(>F)  ## lspline(sum_pop, c(1329880))                                                 1  ## lspline(sum_edu_region_year, c(37001))                                  <2e-16  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))           <2e-16  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                   <2e-16  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))     <2e-16  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))               <2e-16  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816)) <2e-16  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))         <2e-16  ## Residuals                                                                       ##                                                                              ## lspline(sum_pop, c(1329880))                                                 ## lspline(sum_edu_region_year, c(37001))                                  ***  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))           ***  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                   ***  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))     ***  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))               ***  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816)) ***  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))         ***  ## Residuals                                                                    ## ---  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot (model)

Figure 9: Find the model that minimises the AIC, Year 1985 – 2018

Figure 10: Find the model that minimises the AIC, Year 1985 – 2018

Figure 11: Find the model that minimises the AIC, Year 1985 – 2018

Figure 12: Find the model that minimises the AIC, Year 1985 – 2018

tbnumpred <- bind_cols(tbnum, as_tibble(predict(model, tbnum, interval = "confidence")))    suppressWarnings(multiclass.roc(tbnumpred$eduyears, tbnumpred$fit))

## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases

## Setting direction: controls > cases

## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases

##   ## Call:  ## multiclass.roc.default(response = tbnumpred$eduyears, predictor = tbnumpred$fit)  ##   ## Data: tbnumpred$fit with 7 levels of tbnumpred$eduyears: 8, 9, 10, 12, 13, 15, 22.  ## Multi-class area under the curve: 0.8743

There are a few things I would like to investigate to improve the credibility of the analysis. First, the study is a longitudinal study. A great proportion of people is measured each year. The majority of the people in the region remains in the region from year to year. I will assume that each birthyear and each region is a group and set them as random effects and the rest of the predictors as fixed effects. I use the mean age in each age group as the year of birth.

temp <- tbnum %>% mutate(yob = year_n - age_n) %>% mutate(region = tbnum$region)    mmodel <- lmer(eduyears ~    lspline(sum_pop, c(1329880)) +     lspline(sum_edu_region_year, c(37001)) +     lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816)) +    lspline(sum_pop, c(1329880)):lspline(year_n, c(2004)) +    lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001)) +    lspline(perc_women, c(0.492816)):lspline(year_n, c(2004)) +    lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year, c(37001)) +    lspline(year_n, c(2004)):lspline(sum_edu_region_year, c(37001)) +    (1|yob) +     (1|region),    data = temp)

## Warning: Some predictor variables are on very different scales: consider  ## rescaling

## boundary (singular) fit: see ?isSingular

plot(mmodel)

Figure 13: A diagnostic plot of the model with random effects components

qqnorm (residuals(mmodel), main="")

Figure 14: A diagnostic plot of the model with random effects components

summary (mmodel)

## Linear mixed model fit by REML ['lmerMod']  ## Formula:   ## eduyears ~ lspline(sum_pop, c(1329880)) + lspline(sum_edu_region_year,    ##     c(37001)) + lspline(sum_pop, c(1329880)):lspline(perc_women,    ##     c(0.492816)) + lspline(sum_pop, c(1329880)):lspline(year_n,    ##     c(2004)) + lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year,    ##     c(37001)) + lspline(perc_women, c(0.492816)):lspline(year_n,    ##     c(2004)) + lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year,    ##     c(37001)) + lspline(year_n, c(2004)):lspline(sum_edu_region_year,    ##     c(37001)) + (1 | yob) + (1 | region)  ##    Data: temp  ##   ## REML criterion at convergence: 90514.4  ##   ## Scaled residuals:   ##     Min      1Q  Median      3Q     Max   ## -5.1175 -0.5978 -0.0137  0.5766  2.8735   ##   ## Random effects:  ##  Groups   Name        Variance Std.Dev.  ##  yob      (Intercept) 0.000    0.000     ##  region   (Intercept) 1.115    1.056     ##  Residual             2.970    1.723     ## Number of obs: 22848, groups:  yob, 108; region, 8  ##   ## Fixed effects:  ##                                                                             Estimate  ## (Intercept)                                                                2.516e+01  ## lspline(sum_pop, c(1329880))1                                              1.514e-04  ## lspline(sum_pop, c(1329880))2                                              2.912e-03  ## lspline(sum_edu_region_year, c(37001))1                                    2.314e-03  ## lspline(sum_edu_region_year, c(37001))2                                   -2.288e-03  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            5.502e-05  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            7.840e-05  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           -2.061e-05  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            1.467e-05  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   -7.788e-08  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                   -1.428e-06  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -3.009e-07  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    1.430e-07  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -4.707e-10  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -2.387e-09  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      2.554e-13  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      1.137e-12  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -1.659e-02  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                3.580e-02  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                3.888e-01  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               -1.008e+00  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  9.201e-05  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -4.149e-04  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2 -1.441e-04  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  1.086e-04  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         -1.211e-06  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          1.240e-06  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         -2.615e-06  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          1.146e-06  ##                                                                           Std. Error  ## (Intercept)                                                                6.548e-01  ## lspline(sum_pop, c(1329880))1                                              1.494e-05  ## lspline(sum_pop, c(1329880))2                                              6.394e-03  ## lspline(sum_edu_region_year, c(37001))1                                    3.150e-04  ## lspline(sum_edu_region_year, c(37001))2                                    7.229e-05  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            1.344e-06  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            1.213e-05  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            2.853e-06  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            1.540e-05  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                    7.362e-09  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    3.191e-06  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                    1.349e-08  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    7.352e-08  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1      9.596e-12  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1      8.271e-11  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      7.991e-13  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      2.836e-12  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1                4.545e-04  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                4.504e-03  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                3.671e-02  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2                9.737e-02  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  2.688e-05  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1  1.117e-05  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  2.526e-04  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  1.429e-05  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          1.586e-07  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          3.623e-08  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          4.441e-07  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          6.085e-08  ##                                                                           t value  ## (Intercept)                                                                38.420  ## lspline(sum_pop, c(1329880))1                                              10.137  ## lspline(sum_pop, c(1329880))2                                               0.455  ## lspline(sum_edu_region_year, c(37001))1                                     7.345  ## lspline(sum_edu_region_year, c(37001))2                                   -31.645  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            40.921  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1             6.463  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            -7.226  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2             0.952  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   -10.579  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    -0.448  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -22.303  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                     1.945  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -49.052  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -28.855  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2       0.320  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2       0.401  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -36.497  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                 7.949  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                10.593  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               -10.350  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1   3.423  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -37.150  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  -0.571  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2   7.602  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          -7.639  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          34.226  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          -5.887  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          18.833

##   ## Correlation matrix not shown by default, as p = 29 > 12.  ## Use print(x, correlation=TRUE)  or  ##     vcov(x)        if you need it

## fit warnings:  ## Some predictor variables are on very different scales: consider rescaling  ## convergence code: 0  ## boundary (singular) fit: see ?isSingular

anova (mmodel)

## Analysis of Variance Table  ##                                                                         Df  ## lspline(sum_pop, c(1329880))                                             2  ## lspline(sum_edu_region_year, c(37001))                                   2  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            4  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                    4  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))      4  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                4  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))  4  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))          4  ##                                                                         Sum Sq  ## lspline(sum_pop, c(1329880))                                                 0  ## lspline(sum_edu_region_year, c(37001))                                  308190  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            35415  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      589  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       7737  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 8202  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   7316  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           6809  ##                                                                         Mean Sq  ## lspline(sum_pop, c(1329880))                                                  0  ## lspline(sum_edu_region_year, c(37001))                                   154095  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))              8854  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       147  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))        1934  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                  2051  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))    1829  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))            1702  ##                                                                           F value  ## lspline(sum_pop, c(1329880))                                                0.000  ## lspline(sum_edu_region_year, c(37001))                                  51879.188  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            2980.805  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      49.613  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       651.234  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 690.377  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   615.763  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           573.138

tbnumpred <- bind_cols(temp, as_tibble(predict(mmodel, temp, interval = "confidence")))

## Warning in predict.merMod(mmodel, temp, interval = "confidence"): unused  ## arguments ignored

## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.  ## This warning is displayed once per session.

suppressWarnings (multiclass.roc (tbnumpred$eduyears, tbnumpred$value))

## Setting direction: controls < cases

## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases

## Setting direction: controls > cases

## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases

##   ## Call:  ## multiclass.roc.default(response = tbnumpred$eduyears, predictor = tbnumpred$value)  ##   ## Data: tbnumpred$value with 7 levels of tbnumpred$eduyears: 8, 9, 10, 12, 13, 15, 22.  ## Multi-class area under the curve: 0.8754

Another problem could be that the response variable is limited in its range. To get more insight about this issue we could model with Poisson regression.

pmodel <- glm(eduyears ~     lspline(sum_pop, c(1329880)) +     lspline(sum_edu_region_year, c(37001)) +     lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816)) +    lspline(sum_pop, c(1329880)):lspline(year_n, c(2004)) +    lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001)) +    lspline(perc_women, c(0.492816)):lspline(year_n, c(2004)) +    lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year, c(37001)) +    lspline(year_n, c(2004)):lspline(sum_edu_region_year, c(37001)),    family = poisson,    data = tbnum)     plot (pmodel)

Figure 15: A diagnostic plot of Poisson regression

Figure 16: A diagnostic plot of Poisson regression

Figure 17: A diagnostic plot of Poisson regression

Figure 18: A diagnostic plot of Poisson regression

tbnumpred <- bind_cols(tbnum, as_tibble(predict(pmodel, tbnum, interval = "confidence")))    suppressWarnings (multiclass.roc (tbnumpred$eduyears, tbnumpred$value))

## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases

## Setting direction: controls > cases

## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases  ## Setting direction: controls < cases

##   ## Call:  ## multiclass.roc.default(response = tbnumpred$eduyears, predictor = tbnumpred$value)  ##   ## Data: tbnumpred$value with 7 levels of tbnumpred$eduyears: 8, 9, 10, 12, 13, 15, 22.  ## Multi-class area under the curve: 0.8716

summary (pmodel)

##   ## Call:  ## glm(formula = eduyears ~ lspline(sum_pop, c(1329880)) + lspline(sum_edu_region_year,   ##     c(37001)) + lspline(sum_pop, c(1329880)):lspline(perc_women,   ##     c(0.492816)) + lspline(sum_pop, c(1329880)):lspline(year_n,   ##     c(2004)) + lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year,   ##     c(37001)) + lspline(perc_women, c(0.492816)):lspline(year_n,   ##     c(2004)) + lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year,   ##     c(37001)) + lspline(year_n, c(2004)):lspline(sum_edu_region_year,   ##     c(37001)), family = poisson, data = tbnum)  ##   ## Deviance Residuals:   ##      Min        1Q    Median        3Q       Max    ## -2.32031  -0.33091  -0.01716   0.30301   1.40215    ##   ## Coefficients:  ##                                                                             Estimate  ## (Intercept)                                                                3.403e+00  ## lspline(sum_pop, c(1329880))1                                              5.825e-06  ## lspline(sum_pop, c(1329880))2                                             -8.868e-05  ## lspline(sum_edu_region_year, c(37001))1                                    3.722e-04  ## lspline(sum_edu_region_year, c(37001))2                                   -2.310e-04  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            3.838e-06  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            8.103e-06  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           -2.276e-06  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2           -3.732e-06  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   -3.188e-09  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    4.535e-08  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -2.600e-08  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    1.616e-08  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -2.870e-11  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -1.718e-10  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2     -2.527e-13  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2     -2.193e-14  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -9.758e-04  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                2.556e-03  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                3.188e-02  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               -1.221e-01  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1 -1.020e-05  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -2.991e-05  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  1.916e-05  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  1.271e-05  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         -1.874e-07  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          1.224e-07  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         -1.952e-07  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          1.122e-07  ##                                                                           Std. Error  ## (Intercept)                                                                3.236e-02  ## lspline(sum_pop, c(1329880))1                                              1.792e-06  ## lspline(sum_pop, c(1329880))2                                              9.916e-04  ## lspline(sum_edu_region_year, c(37001))1                                    4.837e-05  ## lspline(sum_edu_region_year, c(37001))2                                    1.222e-05  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            1.962e-07  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            2.131e-06  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            4.682e-07  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            2.516e-06  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                    9.022e-10  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    4.948e-07  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                    1.917e-09  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    1.155e-08  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1      1.422e-12  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1      1.343e-11  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      1.161e-13  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      4.747e-13  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1                6.510e-05  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                6.648e-04  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                5.260e-03  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2                1.564e-02  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  4.161e-06  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1  1.813e-06  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  3.734e-05  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  2.408e-06  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          2.435e-08  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          6.124e-09  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          6.510e-08  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          1.002e-08  ##                                                                           z value  ## (Intercept)                                                               105.166  ## lspline(sum_pop, c(1329880))1                                               3.251  ## lspline(sum_pop, c(1329880))2                                              -0.089  ## lspline(sum_edu_region_year, c(37001))1                                     7.694  ## lspline(sum_edu_region_year, c(37001))2                                   -18.907  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            19.559  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1             3.803  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            -4.861  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            -1.483  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                    -3.534  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                     0.092  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -13.558  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                     1.400  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -20.183  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -12.790  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      -2.176  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      -0.046  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -14.991  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                 3.845  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                 6.060  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2                -7.810  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  -2.451  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -16.498  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2   0.513  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2   5.280  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          -7.698  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          19.994  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          -2.998  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          11.202  ##                                                                           Pr(>|z|)  ## (Intercept)                                                                < 2e-16  ## lspline(sum_pop, c(1329880))1                                             0.001151  ## lspline(sum_pop, c(1329880))2                                             0.928739  ## lspline(sum_edu_region_year, c(37001))1                                   1.42e-14  ## lspline(sum_edu_region_year, c(37001))2                                    < 2e-16  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            < 2e-16  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1           0.000143  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           1.17e-06  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2           0.138097  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   0.000410  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                   0.926973  ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                    < 2e-16  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                   0.161556  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1      < 2e-16  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1      < 2e-16  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2     0.029521  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2     0.963157  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1                < 2e-16  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1               0.000121  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2               1.36e-09  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               5.70e-15  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1 0.014246  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1  < 2e-16  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2 0.607856  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2 1.29e-07  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         1.39e-14  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          < 2e-16  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         0.002713  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          < 2e-16  ##                                                                                ## (Intercept)                                                               ***  ## lspline(sum_pop, c(1329880))1                                             **   ## lspline(sum_pop, c(1329880))2                                                  ## lspline(sum_edu_region_year, c(37001))1                                   ***  ## lspline(sum_edu_region_year, c(37001))2                                   ***  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1           ***  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1           ***  ## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           ***  ## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2                ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   ***  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                        ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   ***  ## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                        ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     ***  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     ***  ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2     *    ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2          ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               ***  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1               ***  ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2               ***  ## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               ***  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1 *    ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 ***  ## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2      ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2 ***  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         ***  ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1         ***  ## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         **   ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2         ***  ## ---  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  ##   ## (Dispersion parameter for poisson family taken to be 1)  ##   ##     Null deviance: 32122.2  on 22847  degrees of freedom  ## Residual deviance:  5899.4  on 22819  degrees of freedom  ## AIC: 105166  ##   ## Number of Fisher Scoring iterations: 4

anova (pmodel)

## Analysis of Deviance Table  ##   ## Model: poisson, link: log  ##   ## Response: eduyears  ##   ## Terms added sequentially (first to last)  ##   ##   ##                                                                         Df  ## NULL                                                                        ## lspline(sum_pop, c(1329880))                                             2  ## lspline(sum_edu_region_year, c(37001))                                   2  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            4  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                    4  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))      4  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                4  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))  4  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))          4  ##                                                                         Deviance  ## NULL                                                                              ## lspline(sum_pop, c(1329880))                                                 0.0  ## lspline(sum_edu_region_year, c(37001))                                   21027.5  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))             2729.6  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       51.2  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))        528.8  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                  601.3  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))    502.2  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))            782.2  ##                                                                         Resid. Df  ## NULL                                                                        22847  ## lspline(sum_pop, c(1329880))                                                22845  ## lspline(sum_edu_region_year, c(37001))                                      22843  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))               22839  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       22835  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))         22831  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                   22827  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))     22823  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))             22819  ##                                                                         Resid. Dev  ## NULL                                                                         32122  ## lspline(sum_pop, c(1329880))                                                 32122  ## lspline(sum_edu_region_year, c(37001))                                       11095  ## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))                 8365  ## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                         8314  ## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))           7785  ## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                     7184  ## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))       6682  ## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))               5899

Now let's see what we have found. Note that the models do not handle extrapolation well. I will plot all the models for comparison.

plot_model (model, type = "pred", terms = c("sum_pop"))

Figure 19: The significance of the population in the region on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("sum_pop"))

Figure 20: The significance of the population in the region on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("sum_pop"))

Figure 21: The significance of the population in the region on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("sum_edu_region_year"))

Figure 22: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("sum_edu_region_year"))

Figure 23: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("sum_edu_region_year"))

Figure 24: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

tbnum %>%    ggplot () +        geom_point (mapping = aes(x = sum_edu_region_year, y = eduyears)) +     labs(      x = "# persons with same edulevel, region, year",      y = "Years of education"    )

Figure 25: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("perc_women", "sum_pop"))

Figure 26: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("perc_women", "sum_pop"))

Figure 27: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("perc_women", "sum_pop"))

Figure 28: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

tbnum %>%    ggplot () +        geom_jitter (mapping = aes(x = perc_women, y = eduyears, colour = sum_pop)) +     labs(      x = "Percent women",      y = "Years of education"    )

Figure 29: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("year_n", "sum_pop"))

Figure 30: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("year_n", "sum_pop"))

Figure 31: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("year_n", "sum_pop"))

Figure 32: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

tbnum %>%    ggplot () +        geom_jitter (mapping = aes(x = sum_pop, y = eduyears, colour = year_n)) +     labs(      x = "Population in region",      y = "Years of education"    )

Figure 33: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("sum_edu_region_year", "sum_pop"))

The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 - 2018

Figure 34: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("sum_edu_region_year", "sum_pop"))

Figure 35: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("sum_edu_region_year", "sum_pop"))

Figure 36: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

tbnum %>%    ggplot () +        geom_jitter (mapping = aes(x = sum_edu_region_year, y = eduyears, colour = sum_pop)) +     labs(      x = "# persons with same edulevel, region, year",      y = "Years of education"    )

Figure 37: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("year_n", "perc_women"))

Figure 38: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("year_n", "perc_women"))

Figure 39: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("year_n", "perc_women"))

Figure 40: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

tbnum %>%    ggplot () +        geom_jitter (mapping = aes(x = perc_women, y = eduyears, colour = year_n)) +     labs(      x = "Percent women",      y = "Years of education"    )

Figure 41: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("perc_women", "sum_edu_region_year"))

The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 - 2018

Figure 42: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("perc_women", "sum_edu_region_year"))

Figure 43: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("perc_women", "sum_edu_region_year"))

Figure 44: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

tbnum %>%    ggplot () +        geom_jitter (mapping = aes(x = sum_edu_region_year, y = eduyears, colour = perc_women)) +     labs(      x = "# persons with same edulevel, region, year",      y = "Years of education"    )

Figure 45: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("year_n", "sum_edu_region_year"))

The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 - 2018

Figure 46: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("year_n", "sum_edu_region_year"))

Figure 47: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("year_n", "sum_edu_region_year"))

Figure 48: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

tbnum %>%    ggplot () +        geom_jitter (mapping = aes(x = sum_edu_region_year, y = eduyears, colour = year_n)) +     labs(      x = "# persons with same edulevel, region, year",      y = "Years of education"    )

Figure 49: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Rebalancing history

Posted: 19 Mar 2020 05:00 PM PDT

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our last post on rebalancing struck an equivocal note. We ran a thousand simulations using historical averages across different rebalancing regimes to test whether rebalancing produced better absolute or risk-adjusted returns. The results suggested it did not. But we noted many problems with the tests—namely, unrealistic return distributions and correlation scenarios. We argued that if we used actual historical data and sampled from it, we might resolve many of these issues. But we also asked our readers whether it was worthwhile to test further. Based on the responses and page views, we believe the interest is there, so we'll proceed!

As we mentioned, historical data more closely approximates the fat-tailed, skewed distribution common to asset returns. But only if you have a long enough time series. While we weren't able to find 50 years worth of major asset class returns, we were able to compile a 20-year series that includes two market downturns. The data isn't all from the same source, unfortunately. But it is fairly reputable—Vanguard's stock and US bond index funds, emerging market bond indices from the St. Louis Fed, and the S&P GSCI commodity index. The code will show how we aggregated it for those interested. Using this data series we should be able to test rebalancing more robustly.

Before we proceed, a brief word on methodology. To run the simulation, we need to sample (with replacement) from our twenty year period and combine each sample into an entire series. To capture the non-normal distribution and serial correlation of asset returns, we can't just sample one return, however. We need to sample a block of returns. This allows us to approximate the serial correlation of individual assets as well as the correlation between assets. But how long should the block be? Trying to answer that can get pretty complicated, pretty quickly.¹ We decided to take a shortcut and use a simple block of 6 periods. This equates to six months, since our series is monthly returns. There's nothing magical about this number but it does feature as a period used in academic studies on momentum, a topic beyond the scope of this post.²

We sample six months of returns at at time. Repeat 42 times to get over 20 years of data. Repeat to create 1000 portfolios. From there we apply the different rebalancing regimes on each of the portfolios and then aggregate the data. As before, we first use an equal-weighting, and then a 60/35/5 weighting for stocks, bonds, and commodities. Let's see what we get.

First, we look at the average return for equal-weighted portfolios by rebalancing regime along with the range of outcomes.

Recall, the white line is the mean and the top and bottom of the boxes represent the middle 50% of outcomes, Interestingly, no rebalancing had far more positive and far less negative outliers (the red dots) than any of the rebalancing regimes.

Given where the averages line up, it doesn't look like there are significant differences. Let's run some t-tests for completeness.

Table 1: Aggregate p-values for simulation
Comparison	P-value
None vs. Months	0.87
None vs. Quarters	0.87
None vs. Years	0.88
Months vs. Quarters	0.97
Months vs. Years	0.95
Quarters vs. Years	0.97

As expected, the p-values are quite high, meaning that any differences in mean returns are likely due to chance.

Now we'll check on the number of times each rebalancing strategy outperforms the others.

A dramatic result! No rebalancing beat the other strategies a majority of the time and less rebalancing outperformed more most of the time too. Now for the crux. Does rebalancing lead to better risk-adjusted returns as calculated by the Sharpe ratio.

Table 2: Sharpe ratios by rebalancing period
Period	Ratio
None	0.76
Months	0.74
Quarters	0.75
Years	0.76

Not much difference. Recall that from our previous simulations, no rebalancing actually generated a slightly worse Sharpe ratio by about 30-40 bps. But that result occurred less than 90% of the time, so it could be due to randomness. Let's check the Sharpe ratios for the present simulation.

Table 3: Frequency of a better Sharpe ratio (%)
Periods	Occurence
None vs. Months	60.2
None vs. Quarters	53.4
None vs. Years	48.3
Months vs. Quarters	3.7
Months vs. Years	8.2
Quarters vs. Years	22.1

No rebalancing generates a better Sharpe ratio a majority of the time, but not enough to conclude it isn't due to chance. Interestingly, the frequency with which quarterly and yearly rebalancing produce better Sharpe ratios than monthly rebalancing looks significant. In both cases the frequency is greater than 90% of the time. That the lower frequency rebalancing outperforms the higher frequency likely plays a role in the significance of the Sharpe ratios, but is an area of investigation we'll shelve for now.

Let's move to the next simulation where we weight the portfolios 60/35/5 for stocks, bonds, and commodities. First, we show the boxplot of mean returns and range of outcomes.

Like the equal-weighted simulations, the means don't look that dissimilar and no rebalancing generates more positive and less negative outliers than other rebalancing regimes. We can say almost undoubtedly that the differences in average returns, if there are any, is likely due to chance. The p-values from the t-tests we show below should prove that.

Table 4: Aggregate p-values for simulation
Comparison	P-value
None vs. Months	0.91
None vs. Quarters	0.92
None vs. Years	0.92
Months vs. Quarters	0.98
Months vs. Years	0.97
Quarters vs. Years	0.98

Now let's calculate and present the frequency of outperformance by rebalancing strategy.

No rebalancing outperforms again! Less frequent rebalancing outperforms more frequent. And risk-adjusted returns?

Table 5: Sharpe ratios by rebalancing period
Period	Ratio
None	0.66
Months	0.66
Quarters	0.67
Years	0.68

Here, no rebalancing performed slightly worse than rebalancing quarterly or yearly. How likely should we believe these results to be significant?

Table 6: Frequency of a better Sharpe ratio (%)
Periods	Occurence
None vs. Months	47.8
None vs. Quarters	40.9
None vs. Years	35.5
Months vs. Quarters	2.2
Months vs. Years	6.4
Quarters vs. Years	21.3

Slightly worse than 50/50 for no rebalancing. But less frequent rebalancing appears to have the potential to produce higher risk-adjusted returns that more frequent rebalancing.

Let's briefly sum up what we've discovered thus far. Rebalancing does not seem to produce better risk-adjusted returns. If we threw in taxation and slippage, we think rebalancing might likely be a significant underperformer most of the time.

Does this mean you should never rebalance? No. You should definitely rebalance if you've got a crystal ball. Baring that, if your risk-return parameters change, then you should rebalance. But it would not be bringing the weights back to their original targets; rather, it would be new targets. A entirely separate case.

What do we see as the biggest criticism of the foregoing analysis? That it was a straw man argument. In practice, few professional investors rebalance because it's July 29th or October 1st. Of course, there is quarter-end and year-end rebalancing, but those dates are usually coupled with a threshold. That is, only rebalance if the weights have exceeded some threshold, say five or ten percentage points from target. Analyzing the effects of only rebalancing based on thresholds would require more involved code on our part.³ Given the results thus far, we're not convinced that rebalancing based on the thresholds would produce meaningfully better risk-adjusted returns.

However, rebalancing based on changes in risk-return constraints might do so. Modeling that would be difficult since we'd also have to model (or assume) new risk-return forecasts. But we could model the traditional shift recommended by financial advisors to clients as they age; that is, slowly shifting from high-risk to low-risk assets. In other words, lower the exposure to stocks and increase the exposure to bonds.

As a toy example, we use our data set to compare a no rebalancing strategy with an initial 60/35/5 split between stocks, bonds, and commodities to a yearly rebalancing strategy that starts at a 90/5/5 split and changes to a 40/60/0 split over the period. The Sharpe ratio for the rebalanced portfolio is actually a bit worse than the no rebalancing one. Mean returns are very close. Here's the graph of the cumulative return.

This is clearly one example and highly time-dependent. But we see that rebalancing wasn't altogether different than not, and we're not including tax and slippage effects. To test this notion we'd have to run some more simulations, but that will be for another post.

We'll end this post with a question for our readers. Are you convinced rebalancing doesn't improve returns or do you think more analysis is required? Please send us your answer to nbw dot osm at gmail dot com. Until next time, here's all the code behind the simulations, analyses, and charts.

## Load packages  library(tidyquant)  library(tidyverse)    ### Load data  ## Stocks  symbols <- c("VTSMX", "VGTSX", "VBMFX", "VTIBX")  prices <- getSymbols(symbols, src = "yahoo",                       from = "1990-01-01",                       auto.assign = TRUE) %>%     map(~Ad(get(.))) %>%     reduce(merge) %>%     `colnames<-`(tolower(symbols))    ## Bonds  # Source for bond indices:  # https://fred.stlouisfed.org/categories/32413    em_hg <- getSymbols("BAMLEMIBHGCRPITRIV",                           src = "FRED",                           from = "1990-01-01",                           auto.assign = FALSE)    em_hg <- em_hg %>% na.locf()    em_hy <- getSymbols("BAMLEMHBHYCRPITRIV",                         src = "FRED",                         from = "1990-01-01",                         auto.assign = FALSE)    em_hy <- em_hy %>% na.locf()    # Commodity data  # Source for commodity data  # https://www.investing.com/indices/sp-gsci-commodity-total-return-historical-data  # Unfortunately, the data doesn't open into a separate link so you'll need to download it into a   # csv file unless you're good a web scraping. We're not. Note too, the dates get a little funky   # when being transferred into the csv, so you'll need to clean that up. Finally, the dates are   # give as beginning of the month. But when we spot checked a few, they were actually end of the  # month, which lines up with the other data    cmdty <- read_csv("sp_gsci.csv")  cmdty$Date <- as.Date(cmdty$Date,"%m/%d/%Y")  cmd_price <- cmdty %>%     filter(Date >="1998-12-01", Date <="2019-12-31")    ## Merged  merged <- merge(prices[,1:3], em_hg, em_hy)  colnames(merged) <- c("us_stock", "intl_stock", "us_bond", "em_hg", "em_hy")  merged <- merged["1998-12-31/2019-12-31"] %>% na.locf()    merge_mon <- to.monthly(merged, indexAt = "lastof", OHLC = FALSE)  merge_mon$cmdty <- cmd_price$Price  merge_yr <- to.yearly(merge_mon, indexAt = "lastof", OHLC = FALSE)    merge_ret <- ROC(merge_mon, type = "discrete") %>% na.omit()  merge_ret_yr <- ROC(merge_yr, type = "discrete") %>% na.omit()    ## Data frame  df <- data.frame(date = index(merge_ret), coredata(merge_ret))  df_yr <- data.frame(date = index(merge_ret_yr), coredata(merge_ret_yr))    ### Block sampling  ## Create function  block_samp <- function(dframe,block,cycles){    idx <- seq(1,block*cycles,block)        assets <- ncol(dframe)    size <- block*cycles        mat <- matrix(rep(0,assets*size), ncol = assets)        for(i in 1:cycles){            start <-sample(size,1)            if(start <= (size - block + 1)){        end <- start + block -1        len <- start:end      }else if(start > (size - block + 1) & start < size){        end <- size        step <- block - (end - start) - 1        if(step == 1){          adder <- 1        }else{          adder <- 1:step        }                len <- c(start:end, adder)      }else{        adder <-  1:(block - 1)        len <- c(start, adder)      }            mat[idx[i]:(idx[i]+block-1),] <- data.matrix(df[len,2:7])          }        mat  }    # Create 1000 samples    set.seed(123)  block_list <- list()  for(i in 1:1000){    block_list[[i]] <- block_samp(df[,2:7], 6, 42)  }      ### Rebalancing on simulation  ## Create function    rebal_func <- function(port, wt, ...){        if(missing(wt)){      wt <- rep(1/ncol(port), ncol(port))    }else{      wt <- wt    }        port <- ts(port, start = c(1999,1), frequency = 12)        port_list <- list()    rebals = c("none","months", "quarters", "years")        for(pd in rebals){      if(pd == "none"){        port_list[[pd]] <- Return.portfolio(port, wt) %>%           `colnames<-`(pd)      }else{        port_list[[pd]] <- Return.portfolio(port, wt, rebalance_on = pd)%>%           `colnames<-`(pd)      }    }        port_r <- port_list %>%       bind_cols() %>%      data.frame()        port_r       }      ## Run function on simulations  # Note this may take 10 minutes to run. We hope to figure out a way to speed this up in later  # versions.    rebal_test <- list()  for(i in 1:1000){    rebal_test[[i]] <- rebal_func(block_list[[i]])  }    ### Analyze results  ## Average results  rebal_mean_df <- data.frame(none = rep(0,1000),                              monthly = rep(0,1000),                              quarterly = rep(0,1000),                              yearly = rep(0,1000))  for(i in 1:1000){    rebal_mean_df[i,] <- colMeans(rebal_test[[i]]) %>% as.vector()  }    port_names <-  c("None", "Months", "Quarters", "Years")    # Boxplot of reults  rebal_mean_df %>%     `colnames<-`(port_names) %>%     gather(key,value) %>%    mutate(key = factor(key, levels = port_names)) %>%     ggplot(aes(key,value*1200)) +     geom_boxplot(fill = "blue", color = "blue", outlier.colour = "red") +    stat_summary(geom = "crossbar", width=0.7, fatten=0, color="white",                  fun.data = function(x){ return(c(y=mean(x), ymin=mean(x), ymax=mean(x))) })+    labs(x = "",         y = "Return (%)",         title = "Range of mean annualized returns by rebalancing period")    ## Find percentage of time one rebalancing period generates higher returns than another  # Create means comparison function  freq_comp <- function(df){    count <- 1    opf <- data.frame(comp = rep(0,6), prob = rep(0,6))    port_names <-  c("None", "Months", "Quarters", "Years")        for(i in 1:4){      for(j in 2:4){        if(i < j & count < 7){          opf[count,1] <- paste(port_names[i], " vs. ", port_names[j])          opf[count,2] <- mean(df[,i]) > mean(df[,j])          count <- count + 1        }      }    }    opf  }    # Aggregate function across simulations  prop_df <- matrix(rep(0,6000), nrow = 1000)  for(i in 1:1000){    prop_df[i,] <- freq_comp(rebal_test[[i]])[,2]  }    long_names <- c()  count <- 1  for(i in 1:4){    for(j in 2:4){      if(i < j & count < 7){        long_names[count] <- paste(port_names[i], " vs. ", port_names[j])        count <- count + 1      }    }  }    prop_df %>%     data.frame() %>%     summarize_all(mean) %>%     `colnames<-`(long_names) %>%     gather(key, value) %>%     mutate(key = factor(key, levels = long_names)) %>%     ggplot(aes(key,value*100)) +    geom_bar(stat = "identity", fill = "blue")+    labs(x= "",         y = "Frequency (%)",         title = "Number of times one rebalancing strategy outperforms another") +    geom_text(aes(label = value*100), nudge_y = 2.5)      ## Run t-test  # Create function  t_test_func <- function(df){    count <-  1    t_tests <- c()        for(i in 1:4){      for(j in 2:4){        if(i < j & count < 7){          t_tests[count] <- t.test(df[,i],df[,j])$p.value          count <- count +1        }      }    }        t_tests  }    t_tests <- matrix(rep(0,6000), ncol = 6)  for(i in 1:1000){    t_tests[i,] <- t_test_func(rebal_test[[i]])  }    t_tests <- t_tests %>%     data.frame() %>%     `colnames<-`(long_names)    t_tests %>%     summarise_all(function(x) round(mean(x),2)) %>%     gather(Comparison, `P-value`) %>%     knitr::kable(caption = "Aggregate p-values for simulation")    ## Sharpe ratios  sharpe <- matrix(rep(0,4000), ncol = 4)  for(i in 1:1000){    sharpe[i,] <- apply(rebal_test[[i]], 2, mean)/apply(rebal_test[[i]], 2, sd) * sqrt(12)  }    sharpe <- sharpe %>%     data.frame() %>%     `colnames<-`(port_names)    # Table  sharpe %>%     summarise_all(mean) %>%     gather(Period, Ratio) %>%    mutate(Ratio = round(Ratio,2)) %>%     knitr::kable(caption = "Sharpe ratios by rebalancing period")    # Permutation test for sharpe  sharpe_t <- data.frame(Periods = names(t_tests), Occurence = rep(0,6))  count <- 1  for(i in 1:4){    for(j in 2:4){      if(i  sharpe[,j])        count <- count + 1      }    }  }    # table  sharpe_t %>%     knitr::kable(caption = "Frequency of better Sharpe ratio")      ## Rum simulation pt 2  # This may take 10 minutes or so to run.  wt1 <- c(0.30, 0.30, 0.2, 0.075, 0.075, 0.05)  rebal_wt <- list()  for(i in 1:1000){    rebal_wt[[i]] <- rebal_func(block_list[[i]], wt1)  }    ## Average results  rebal_wt_mean_df <- data.frame(none = rep(0,1000),                                 monthly = rep(0,1000),                                 quarterly = rep(0,1000),                                 yearly = rep(0,1000))  for(i in 1:1000){    rebal_wt_mean_df[i,] <- colMeans(rebal_test[[i]]) %>% as.vector()  }    # Boxplot  rebal_wt_mean_df %>%     `colnames<-`(port_names) %>%     gather(key,value) %>%    mutate(key = factor(key, levels = port_names)) %>%     ggplot(aes(key,value*1200)) +     geom_boxplot(fill = "blue", color = "blue", outlier.colour = "red") +    stat_summary(geom = "crossbar", width=0.7, fatten=0, color="white",                  fun.data = function(x){ return(c(y=mean(x), ymin=mean(x), ymax=mean(x))) })+    labs(x = "",         y = "Return (%)",         title = "Range of mean annualized returns by rebalancing period")    ## Find percentage of time one rebalancing period generates higher returns than another  # Aggregate function across simulations  prop_wt_df <- matrix(rep(0,6000), nrow = 1000)  for(i in 1:1000){    prop_wt_df[i,] <- freq_comp(rebal_wt[[i]])[,2]  }    prop_wt_df %>%     data.frame() %>%     summarize_all(mean) %>%     `colnames<-`(long_names) %>%     gather(key, value) %>%     mutate(key = factor(key, levels = long_names)) %>%     ggplot(aes(key,value*100)) +    geom_bar(stat = "identity", fill = "blue")+    labs(x= "",         y = "Frequency (%)",         title = "Number of times one rebalancing strategy outperforms another") +    geom_text(aes(label = value*100), nudge_y = 2.5)      ## Run t-test  t_tests_wt <- matrix(rep(0,6000), ncol = 6)  for(i in 1:1000){    t_tests_wt[i,] <- t_test_func(rebal_wt[[i]])  }    t_tests_wt <- t_tests_wt %>%     data.frame() %>%     `colnames<-`(long_names)    t_tests_wt %>%     summarise_all(function(x) round(mean(x),2)) %>%     gather(Comparison, `P-value`) %>%     knitr::kable(caption = "Aggregate p-values for simulation")    ## Sharpe ratios  sharpe_wt <- matrix(rep(0,4000), ncol = 4)  for(i in 1:1000){    sharpe_wt[i,] <- apply(rebal_wt[[i]], 2, mean)/apply(rebal_wt[[i]],2, sd) * sqrt(12)  }    sharpe_wt <- sharpe_wt %>%     data.frame() %>%     `colnames<-`(port_names)    # table  sharpe_wt %>%     summarise_all(mean) %>%     gather(Period, Ratio) %>%    mutate(Ratio = round(Ratio,2)) %>%     knitr::kable(caption = "Sharpe ratios by rebalancing period")      # Permutation test for sharpe  sharpe_wt_t <- data.frame(Periods = names(t_tests_wt), Occurence = rep(0,6))  count <- 1  for(i in 1:4){    for(j in 2:4){      if(i  sharpe_wt[,j])        count <- count + 1      }    }  }    sharpe_wt_t %>%     mutate(Occurence = round(Occurence,3)*100) %>%     knitr::kable(caption = "Frequency of better Sharpe ratio (%)")      # Create weight change data frame  weights <- data.frame(us_stock = seq(.45,.2, -0.0125),                         intl_stock =seq(.45,.2, -0.0125),                        us_bond = seq(.025, .3, .01375),                        em_hg = seq(0.0125, .15, .006875),                        em_hy = seq(0.0125, .15, .006875),                        cmdty = seq(.05, 0, -0.0025))    # Change in ts objects  yr_ts <- ts(df_yr[,2:7], start = c(1999,1), frequency = 1)  wts_ts <- ts(weights, start=c(1998,1), frequency = 1)    # Run portfolio rebalancing  no_rebal <- Return.portfolio(yr_ts,wt1)  rebal <- Return.portfolio(yr_ts,wts_ts)    # Convert into data frame  rebal_yr <- data.frame(date = index(no_rebal), no_rebal = as.numeric(no_rebal),                         rebal = as.numeric(rebal))    # Graph  rebal_yr %>%     gather(key, value, -date) %>%    group_by(key) %>%     mutate(value = (cumprod(1+value)-1)*100) %>%     ggplot(aes(date, value, color = key)) +    geom_line() +    scale_color_manual("", labels = c("No rebalancing", "Rebalancing"),                       values = c("black", "blue"))+    labs(x="",         y="Return(%)",         title = "Rebalancing or not?")+    theme(legend.position = "top")

We started experimenting with the cross-correlation function to see which lag had a higher or more significant correlation across the assets. But it became clear, quite quickly, that choosing a reasonable lag would take longer than the time we had allotted for this post. So we opted for the easy way out. Make a simplifying assumption! If anyone can point us to cross-correlation function studies that might apply to this scenario, please let us know.
Please see this article for more detail.
The Performance Analytics package in R doesn't offer a rebalancing algorithm based on thresholds unless we missed it.

To leave a comment for the author, please follow the link and comment on their blog: R on OSM.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

COVID-19 Tracker: Days since N

Posted: 18 Mar 2020 05:00 PM PDT

[This article was first published on R Bloggers on , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There's no shortage of dashboards and data-visualizations covering some aspect of the ongoing coronavirus pandemic, but not having come across a tool that allowed me to easily compare countries myself, I developed this COVID-19 Tracker shiny app both for my own personal use, as well as to get some more experience working with Shiny.

This app was inspired by a vizualization produced by John Burn-Murdoch for the Financial Times (FT) that I thought did a very nice job at allowing cross-country comparisons of the trajectories of the total confirmed cases by standardizing countries using the "Number of days since the 100th case" on the x-axis.

The Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) maintains a dashboard whose data source serves as the underlying data for the FT vizualization, as well as many others floating around on the internet at the moment.

At the time of writing, the JHU GSSE dasboard does not allow for an easy way to select countries for direct comparison. The Shiny app presented here allows the user to select any of the country/region units available in the entire dataset, standardize them on the x-axis using "Days since N", and automatically generate fairly clean level- and log- plots with dynamically rendered titles and axis labels. The data in the app is timestamped and updated automatically along with the JSU CSSE repo, and there are download buttons for the plots and filtered long-format data tables use for those plots in PNG and CSV formats, respectively.

Currently, a maximum of six countries can be compared at a time. The limit was set simply to allow for better readability of the resulting plots. Users can select between total confirmed cases, deaths, and total recovered as the different y-axis outcome variables.

The default N number for the total confirmed cases outcome is set to 100, in keeping with the most widely used convention at the moment. For deaths, N=10 can be used.

There are a few countries that include more detailed regional breakdowns of the data. Where this is the case, the totals for those countries are given by the country name + "(all territories)".

Additional features and edits will be added on an ongoing basis. Feedback or comments are welcome.

COVID-19 Tracker Shiny app

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

THE BENEFIT

Search This Blog