[R-bloggers] Setting up R with Visual Studio Code quickly and easily with the languageserversetup package (and 8 more aRticles)
[R-bloggers] Setting up R with Visual Studio Code quickly and easily with the languageserversetup package (and 8 more aRticles) | |
- Setting up R with Visual Studio Code quickly and easily with the languageserversetup package
- Working with Statistics Canada Data in R, Part 5: Retrieving Census Data
- Impact of a country’s age breakdown on COVID-19 case fatality rate by @ellis2013nz
- Modeling Pandemics (3)
- Modeling pandemics (1)
- Apply for the Google Summer of Code and Help Us Improving The R Code Optimizer
- The significance of population size, year, and per cent women on the education level in Sweden
- Rebalancing history
- COVID-19 Tracker: Days since N
Setting up R with Visual Studio Code quickly and easily with the languageserversetup package Posted: 21 Mar 2020 05:00 AM PDT [This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. IntroductionOver the past years, R has been gaining popularity, bringing to life new tools to with ith it. Thanks to the amazing work by contributors implementing the Language Server Protocol for R and writing Visual Studio Code Extensions for R, the most popular development environment amongst developers across the world now has very strong support for R as well.
ContentsVisual Studio Code and RAccording to the 2019 StackOverflow developer survey, Visual Studio Code is the most popular development environment across the board, with amazing support for many languages and extensions ranging from improved code editing to advanced version control support and Docker integration. Until recently the support for R in Visual Studio Code was in my view not comprehensive enough to justify switching from other tools such as RStudio (Server) to using VS Code exclusively. This has changed with the work done by the team implementing the following 3 tools:
If you are interested in more steps around the setup and the overview of features I recommend the Writing R in VSCode: A Fresh Start blogpost by Kun Ren. I also recommend that you follow Kun on Twitter if you are interested in the latest developments. Setup considerations, issues, and tweaks: creating the | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Working with Statistics Canada Data in R, Part 5: Retrieving Census Data Posted: 20 Mar 2020 09:53 PM PDT [This article was first published on Data Enthusiast's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Back to Working with Statistics Canada Data in R, Part 4. IntroductionNow that we are ready to start working with Canadian Census data, let's first briefly address the question why you may need to use it. After all, CANSIM data is often more up-to-date and covers a much broader range of topics than the national census data, which is gathered every five years in respect of a limited number of questions. The main reason is that CANSIM data is far less granular geographically. Most of it is collected at the provincial or even higher regional level. You may be able to find CANSIM data on a limited number of questions for some of the country's largest metropolitan areas, but if you need the data for a specific census division, city, town, or village, you'll have to use the Census. To illustrate the use of cancensus package, let's do a small research project. First, in this post we'll retrieve these key labor force characteristics of the largest metropolitan areas in each of the five geographic regions of Canada:
The cities (metropolitan areas) that we are going to look at, are: Calgary, Halifax, Toronto, Vancouver, and Whitehorse. We'll also get these data for Canada as a whole for comparison and to illustrate the retrieval of data at different geographic levels Next, in the upcoming Part 6 of the "Working with Statistics Canada Data in R" series, we will visualize these data, including making a faceted plot and writing a function to automate repetitive plotting tasks. Keep in mind that cancensus also allows you to retrieve geospatial data, that is, borders of census regions at various geographic levels, in sp and sf formats. Retrieving and visualizing Statistics Canada geospatial data will be covered later in these series. So, let's get started by loading the required packages: library(cancensus) library(tidyverse) Searching for Datacancensus retrieves census data with the get_census function. get_census can take a number of arguments, the most important of which are dataset, regions, and vectors, which have no defaults. Thus, in order to be able to retrieve census data, you'll first need to figure out:
Find Census DatasetsLet's see which census datasets are available through the CensusMapper API: list_census_datasets() Currently, datasets earlier than 2001 are not available, so if you need to work with the 20th century census data, you won't be able to retrieve it with cancensus. Find Census RegionsNext, let's find the regions that we'll be getting the data for. To search for census regions, use the search_census_regions function. Let's take a look at what region search returns for Toronto. Note that cancensus functions return their output as dataframes, so it is easy to subset. Here I limited the output to the most relevant columns to make sure it fits on screen. You can run the code without [c(1:5, 8)] to see all of it. # all census levels search_census_regions(searchterm = "Toronto", dataset = "CA16")[c(1:5, 8)] Returns: A tibble: 3 x 6 # region name level pop municipal_status PR_UID 1 35535 Toronto CMA 5928040 B 35 2 3520 Toronto CD 2731571 CDR 35 3 3520005 Toronto CSD 2731571 C 35 You may have expected to get only one region: the city of Toronto, but instead you got three! So, what is the difference? Look at the column 'level' for the answer. Often, the same geographic region can be represented by several census levels, as is the case here. There are three levels for Toronto, which is simultaneously a census metropolitan area, a census division, and a census sub-division. Note also the 'PR_UID' column that contains numeric codes for Canada's provinces and territories, which can help you distinguish between different census regions that have the same or similar names. For an example, run the code above replacing "Toronto" with "Windsor". Remember that we were going to plot the data for census metropolitan areas? You can choose the geographic level with the level argument, which can take the following values: 'C' for Canada (national level), 'PR' for province, 'CMA' for census metropolitan area, 'CD' for census division, 'CSD' for census sub-division, or NA: # specific census level search_census_regions("Toronto", "CA16", level = "CMA") Let's now list census regions that may be relevant for our project: # explore available census regions names <- c("Canada", "Calgary", "Halifax", "Toronto", "Vancouver", "Whitehorse") map_df(names, ~ search_census_regions(., dataset = "CA16")) purrr::map_df function applies search_census_regions iteratively to each element of the names vector and returns output as a single dataframe. Note also the ~ . syntax. Think of it as the tilde taking names and passing it as an argument to a place indicated by the dot in the search_census_regions function. You can find more about the tilde-dot syntax here. It may be a good idea to read the whole tutorial: purrr is a super-useful package, but not the easiest to learn, and this tutorial does a great job explaining the basics. So as you can see, there are multiple entries for each search term, so we'll need to choose the results for census metropolitan areas, and for census sub-division in case of Whitehorse, since Whitehorse is too small to be considered a census metropolitan area: # select only the regions we need: CMAs (and CSD for Whitehorse) regions <- list_census_regions(dataset = "CA16") %>% filter(grepl("Calgary|Halifax|Toronto|Vancouver", name) & grepl("CMA", level) | grepl("Canada|Whitehorse$", name)) %>% as_census_region_list() Pay attention to the use of logical operators to filter the output by several conditions at once; also note using $ regex meta-character to choose the entry ending with 'Whitehorse' from the 'names' column (to filter out 'Whitehorse, Unorganized'. Finally, as_census_region_list converts list_census_regions output to a data object of type list that can be passed to the get_census function as its regions argument. Find Census VectorsCanadian census data is made up of individual variables, aka census vectors. Vector number(s) is another argument you need to specify in order to retrieve data with the get_census function. cancensus has two functions that allow you to search through census data variables: list_census_vectors and search_census_vectors. list_census_vectors returns all available vectors for a given dataset as a single dataframe containing vectors and their descriptions: # structure of list_census_vectors output str(list_census_vectors(dataset = 'CA16')) # count variables in 'CA16' dataset nrow(list_census_vectors(dataset = 'CA16')) As you can see, there are 6623 (as of the time of writing this) variables in the 2016 census dataset, so list_census_vectors won't be the most convenient function to find a specific vector. Note however that there are situations (such as when you need to select a lot of vectors at once), in which list_census_vectors would be appropriate. Usually it is more convenient to use search_census_vectors to search for vectors. Just pass the text string of what you are looking for as the searchterm argument. You don't have to be precise: this function works even if you make a typo or are uncertain about the spelling of your search term. Let's now find census data vectors for labor force involvement rates: # get census data vectors for labor force involvement rates lf_vectors <- search_census_vectors(searchterm = "employment rate", dataset = "CA16") %>% union(search_census_vectors("participation rate", "CA16")) %>% filter(type == "Total") %>% pull(vector) Let's take a look at what this code does. Since searchterm doesn't have to be a precise match, "employment rate" search term retrieves unemployment rate vectors too. In the next line, union merges dataframes returned by search_census_vectors into a single dataframe. Note that in this case union could be substituted with bind_rows. I recommend using union in order to avoid data duplication. Next, we choose only the "Total" numbers, since we are not going to plot labor force indicators by gender. Finally, the pull command extracts a single vector from the dataframe, just like the $ subsetting operator: we need 'lf_vectors' to be a data object of type vector in order to pass it to the vectors argument of the get_census function. The second labor force indicator we are looking for, is the number of people who work full-time and part-time, broken down by gender. But before we proceed with getting the respective vectors, let me show you another way to figure out search terms to put inside the search_census_vectors function: use Statistics Canada online Census Profile tool. It can be used to quickly explore census data as well as to figure out variable names (search terms) and their hierarchical structure. For example, let's look at census labor data for Calgary metropolitan area. Scrolling down, you will quickly find the numbers and text labels for full-time and part-time workers: ![]() Now we know the exact search terms, so we can get precisely the vectors we need, free from any extraneous data: Note how this code differs from the code with which we extracted labor force involvement rates: since we need the data to be sub-divided both by the type of work and by gender (hence no "Total" values here), we create a dataframe that assigns respective labels to each vector number. This work_vectors_labels dataframe will supply categorical labels to be attached to the data retrieved with get_census. Also, note these three lines: The first mutate call removes all text up to and including ', ' and 'and/or ' (spaces included) from the 'label' column. The second drops unused factor level "Total" – it is a good practice to make sure there are no unused factor levels if you are going to use ggplot2 to plot your data. Finally, setNames renames variables for convenience. Finally, let's retrieve vectors for the education data for the age group from 25 to 64 years, by gender. Before we do this, I'd like to draw your attention to the fact that some of the census data is hierarchical, which means that some variables (census vectors) are included into parent and/or include child variables. It is very important to choose vectors at proper hierarchical levels so that you do not double-count or omit your data. Education data is a good example of hierarchical data. You can explore data hierarchy using parent_census_vectors and child_census_vectors functions as described here. However, you may find exploring the hierarchy visually using Statistics Canada Census Profile tool to be more convenient: ![]() So, let's now retrieve and label the education data vectors: Note the slice function that allows to manually select specific rows from a dataframe: positive numbers choose rows to keep, negative numbers choose rows to drop. I used slice to drop the hierarchical levels from the data that are either too generalized or too granular. Note also that I had to edit text strings in the data. Finally, I added asterisks after "Doctorate" and "health". These are not regex symbols, but actual asterisks that will be used to refer to footnotes in plot captions later on. Now that we have figured out our dataset, regions, and data vectors (and labelled the vectors, too), we are finally ready to retrieve the data itself. Retrieve Census DataTo retrieve census data, feed the dataset, regions, and data vectors into get_census as its' respective arguments. Note also that get_census has use_cache argument (set to TRUE by default), which tells get_census to retrieve data from cache if available. If there is no cached data, the function will query CensusMapper API for the data and will save it in cache, while use_cache = FALSE will force get_census to query the API and update the cache. The select call drops columns with irrelevant data, setNames renames columns to remove vector numbers from variable names, which will be then converted to values in the 'indicator' column; str_remove inside the mutate call drops municipal status codes '(B)' and '(CY)' from region names; finally, mutate_if converts characters to factors for subsequent plotting. An important function here is tidyr::pivot_longer. It converts the dataframe from wide to long format. It takes three columns: 'employment rate', 'unemployment rate', and 'participation rate', and converts their names into values of the 'indicator' variable, while their numeric values are passed to the 'rate' variable. The reason for the conversion is that we are going to plot the data for all three labor force indicators in the same graphic, which makes it necessary to store the indicators as a single factor variable. Next, let's retrieve census data about the percent of full time vs part time workers, by gender, and the data about the education levels of people aged 25 to 64, by gender: Note one important difference from the code I used to retrieve the labor force involvement data: here I added the dplyr::left_join function that joins labels to the census data. We now have the data and are ready to visualize it, which will be done in the next post. Annex: Notes and DefinitionsFor those of you who are outside of Canada, Canada's geographic regions and their largest metropolitan areas are:
These regions should not be confused with 10 provinces and 3 territories, which are Canada's sub-national administrative divisions, much like states in the U.S. Each region consists of several provinces or territories, except the West Coast, which includes only one province – British Columbia. You can find more about Canada's geographic regions and territorial structure here (pages 44 to 51). For the definitions of employment rate, unemployment rate, labour force participation rate, full-time work, and part-time work, see Statistics Canada's Guide to the Labour Force Survey. You can find more about census geographic areas here and here. There is also a glossary of census-related geographic concepts. The post Working with Statistics Canada Data in R, Part 5: Retrieving Census Data appeared first on Data Enthusiast's Blog. To leave a comment for the author, please follow the link and comment on their blog: Data Enthusiast's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Impact of a country’s age breakdown on COVID-19 case fatality rate by @ellis2013nz Posted: 20 Mar 2020 06:00 AM PDT [This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Italy is routinely and correctly described as particularly vulnerable to COVID-19 because of its older age profile. I set out to understand for myself how important this factor is. What would happen if the case fatality rates observed in Italy were applied to demographic profiles of other countries? Fatality rates by age and sex so far in ItalyThe Istituto Superiore di Sanità , Roma is publishing regular bulletins with the latest data on COVID-19 cases in Italy. I used the 19 March 2020 version. These are the observed case ratality fates for around 35,000 cases to that point: It's worth pointing out that the snapshots presented in these bulletins change fast, including the raw fatality rate (for both sexes) which has increased from 5.8% seven days earlier to 8.5% on 19 March. Further rapid change is to be expected, remembering that deaths lag beginning of the illness by days or weeks, and diagnoses lag infections by days (symptoms start on average around five days after exposure). It's also worth pointing out how much worse this disease seems to be for men. Of the deaths in Italy at the time of this bulletin, 71% were men. Of diagnosed cases, 59% were male (more than 200 Italian boys have the illness but none had died at the time of the bulletin). There were more male fatalities aged 80 and over than female of all ages. Also, it's worth pointing out that while it is definitely worse for older people, fatality rates are pretty bad for middle-aged people – about 1% for those between 30 and 59. That's bad for a disease expecting as many cases as this one. Population profiles in selected countriesI took population breakdowns by age and sex from the United Nations' World Population Prospects. To illustrate I chose nine countries representing a range of cultural and economic situations. I've chosen to present these as density charts, not population pyramids (which I find difficult to make comparisons with). We can readily see the contrast between Italy and (for an extreme example) economically poor Timor Leste: Applying fatality rates to population profilesIt's straightforward to take a country's population and apply the Italian case fatality rates to it to get a weighted average fatality rate. In effect, this tells us what the fatality rate would be in a country, if the Italian rates applied to its whole population or a subpopulation that was representative of the overall age and sex balance. Here's what we get for our nine 'countries' (including the World aggregate): Two things stand out. First, the different demographics of the different countries make a huge difference. On these sorts of age-based rates, Italy can expect twice the fatality rate of China (and nearly five times that of Timor Leste). Second, the death rate for Italy from this method is much lower than the actual fatality rate in the 19 March bulletin – 3.9% compared to 8.5%. This isn't a mistake – it comes about because the profile of Italians diagnosed with COVID-19 is older and more male than Italians in general.
As I note on the graphic, this could be due to women and younger people of either sex being less likely to be diagnosed given they have the disease; or it might mean they are less likely to have the disease at all. There is no way to tell with this data. We can adjust the fatality rates by scaling them up to match Italy's 19 March observed level. This gives a more realistic but still very rough answer to the question "what would Italy's case fatality rates mean, translated to other countries". It's very rough because doing this assumes away a whole bunch of possible complexities and interactions between variables, but it's probably as thorough a method as is warranted at the moment with the fast changing data. Here's those scaled results: What does it all mean?Well, the danger to people over 50, particularly but not only men, is very very real from this disease. And the age profiles of countries vary enough for this to make big differences to the overall impact. But regardless of this, the necessary actions are clear. Work hard to avoid getting this horrible disease and to avoid passing it on. Work to help others do the same, and pull together to manage society through some difficult months ahead. Wash your hands and practice social distancing. Here's the code behind those charts. The Italian data is just entered by hand because it's only 20 numbers, not worth trying to automate. To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 20 Mar 2020 04:41 AM PDT [This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In Statistical Inference in a Stochastic Epidemic SEIR Model with Control Intervention, a more complex model than the one we've seen yesterday was considered (and is called the SEIR model). Consider a population of size N, and assume that S is the number of susceptible, E the number of exposed, I the number of infectious, and R for the number recovered (or immune) individuals, \displaystyle{\begin{aligned}{\frac {dS}{dt}}&=-\beta {\frac {I}{N}}S\\[8pt]{\frac {dE}{dt}}&=\beta {\frac {I}{N}}S-aE\\[8pt]{\frac {dI}{dt}}&=aE-b I\\[8pt]{\frac {dR}{dt}}&=b I\end{aligned}}Between S and I, the transition rate is \beta I, where \beta is the average number of contacts per person per time, multiplied by the probability of disease transmission in a contact between a susceptible and an infectious subject. Between I and R, the transition rate is b (simply the rate of recovered or dead, that is, number of recovered or dead during a period of time divided by the total number of infected on that same period of time). And finally, the incubation period is a random variable with exponential distribution with parameter a, so that the average incubation period is a^{-1}. Probably more interesting, Understanding the dynamics of ebola epidemics suggested a more complex model, with susceptible people S, exposed E, Infectious, but either in community I, or in hospitals H, some people who died F and finally those who either recover or are buried and therefore are no longer susceptible R.
If \boldsymbol{Z}=(S,E,I,H,F,R), if we write \frac{\partial \boldsymbol{Z}}{\partial t} = SEIHFR(\boldsymbol{Z})where SEIHFR is
We can solve it, or at least study the dynamics from some starting values
For instance, the proportion of people infected is the following
To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 19 Mar 2020 06:48 PM PDT [This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The most popular model to model epidemics is the so-called SIR model – or Kermack-McKendrick. Consider a population of size N, and assume that S is the number of susceptible, I the number of infectious, and R for the number recovered (or immune) individuals, \displaystyle {\begin{aligned}&{\frac {dS}{dt}}=-{\frac {\beta IS}{N}},\\[6pt]&{\frac {dI}{dt}}={\frac {\beta IS}{N}}-\gamma I,\\[6pt]&{\frac {dR}{dt}}=\gamma I,\end{aligned}}so that \displaystyle{{\frac{dS}{dt}}+{\frac {dI}{dt}}+{\frac {dR}{dt}}=0}which implies that S+I+R=N. In order to be more realistic, consider some (constant) birth rate \mu, so that the model becomes\displaystyle {\begin{aligned}&{\frac {dS}{dt}}=\mu(N-S)-{\frac {\beta IS}{N}},\\[6pt]&{\frac {dI}{dt}}={\frac {\beta IS}{N}}-(\gamma+\mu) I,\\[6pt]&{\frac {dR}{dt}}=\gamma I-\mu R,\end{aligned}}Note, in this model, that people get sick (infected) but they do not die, they recover. So here, we can model chickenpox, for instance, not SARS. The dynamics of the infectious class depends on the following ratio:\displaystyle{R_{0}={\frac {\beta }{\gamma +\mu}}} which is the so-called basic reproduction number (or reproductive ratio). The effective reproductive ratio is R_0S/N, and the turnover of the epidemic happens exactly when R_0S/N=1, or when the fraction of remaining susceptibles is R_0^{-1}. As shown in Directly transmitted infectious diseases:Control by vaccination, if S/N<R_0^{-1}[/latex] the disease (the number of people infected) will start to decrease.</p> <p>Want to see it ? Start with</p> <p>2f8f7d80e757ec80aa013429cb8a03e4011</p> <p>for the parameters. Here, [latex]R_0=4. We also need starting values
Then use the ordinary differential equation solver, in R. The idea is to say that \boldsymbol{Z}=(S,I,R) and we have the gradient \frac{\partial \boldsymbol{Z}}{\partial t} = SIR(\boldsymbol{Z})where SIR is function of the various parameters. Hence, set
The we must define the time, and the function that returns the gradient,
To solve this problem use
We can visualize the dynamics below
We can actually also visualize the effective reproductive number is R_0S/N, where
The effective reproductive number is on the left, and as we mentioned above, when we reach 1, we actually reach the maximum of the infected,
And when adding a \mu parameter, we can obtain some interesting dynamics on the number of infected,
To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Apply for the Google Summer of Code and Help Us Improving The R Code Optimizer Posted: 19 Mar 2020 05:00 PM PDT [This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Are you a BSc, MSc or PhD student, and this summer (winter down south) you would Then you will be interested to know that with @CancuCS this year Take a look at The R Code Optimizer, To leave a comment for the author, please follow the link and comment on their blog: Pachá. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The significance of population size, year, and per cent women on the education level in Sweden Posted: 19 Mar 2020 05:00 PM PDT [This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In twelve posts I have analysed how different factors are related to salaries in Sweden with data from Statistics Sweden. In this post, I will analyse a new dataset from Statistics Sweden, population by region, age, level of education, sex and year. Not knowing exactly what to find I will use a criterion-based procedure to find the model that minimises the AIC. Then I will perform some test to see how robust the model is. Finally, I will plot the findings. First, define libraries and functions. The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, UF0506A1.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/. I will calculate the percentage of women in for the different education levels in the different regions for each year. In my later analysis I will use the number of people in each education level, region and year. The table: Population 16-74 years of age by region, highest level of education, age and sex. Year 1985 – 2018 NUTS 2 level 2008- 10 year intervals (16-74)
![]() Figure 1: Population by region, level of education, percent women and year, Year 1985 – 2018 In a previous post, I approximated the number of years of education for every education level. Since this approximation is significant for the rest of the analysis I will see if I can do a better approximation. I use Multivariate Adaptive Regression Splines (MARS) to find the permutation of years of education, within the given limitations, which gives the highest adjusted R-Squared value. I choose not to calculate more combinations than between the age of 7 and 19 because I assessed it would take to much time. From the table, we can see that the R-Squared only gains from a higher education year for post-graduate education. A manual test shows that setting years of education to 22 gives a higher R-Squared without getting high residuals.
Let's investigate the shape of the function for the response and predictors. The shape of the predictors has a great impact on the rest of the analysis. I use acepack to fit a model and plot both the response and the predictors. ![]() Figure 2: Plots of the response and predictors using acepack ![]() Figure 3: Plots of the response and predictors using acepack ![]() Figure 4: Plots of the response and predictors using acepack ![]() Figure 5: Plots of the response and predictors using acepack ![]() Figure 6: Plots of the response and predictors using acepack I use MARS to fit hockey-stick functions for the predictors. I do not wish to overfit by using a better approximation at this point. I will include interactions of degree two. ![]() Figure 7: Hockey-stick functions fit with MARS for the predictors, Year 1985 – 2018 ![]() Figure 8: Hockey-stick functions fit with MARS for the predictors, Year 1985 – 2018 I will use regsubsets to find the model which minimises the AIC. I will also calculate the Receiver Operating Characteristic (ROC) for the model I find for each level of years of education. ![]() Figure 9: Find the model that minimises the AIC, Year 1985 – 2018 ![]() Figure 10: Find the model that minimises the AIC, Year 1985 – 2018 ![]() Figure 11: Find the model that minimises the AIC, Year 1985 – 2018 ![]() Figure 12: Find the model that minimises the AIC, Year 1985 – 2018 There are a few things I would like to investigate to improve the credibility of the analysis. First, the study is a longitudinal study. A great proportion of people is measured each year. The majority of the people in the region remains in the region from year to year. I will assume that each birthyear and each region is a group and set them as random effects and the rest of the predictors as fixed effects. I use the mean age in each age group as the year of birth. ![]() Figure 13: A diagnostic plot of the model with random effects components ![]() Figure 14: A diagnostic plot of the model with random effects components Another problem could be that the response variable is limited in its range. To get more insight about this issue we could model with Poisson regression. ![]() Figure 15: A diagnostic plot of Poisson regression ![]() Figure 16: A diagnostic plot of Poisson regression ![]() Figure 17: A diagnostic plot of Poisson regression ![]() Figure 18: A diagnostic plot of Poisson regression Now let's see what we have found. Note that the models do not handle extrapolation well. I will plot all the models for comparison. ![]() Figure 19: The significance of the population in the region on the level of education, Year 1985 – 2018 ![]() Figure 20: The significance of the population in the region on the level of education, Year 1985 – 2018 ![]() Figure 21: The significance of the population in the region on the level of education, Year 1985 – 2018 ![]() Figure 22: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 ![]() Figure 23: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 ![]() Figure 24: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 ![]() Figure 25: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 ![]() Figure 26: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 27: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 28: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 29: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 30: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018 ![]() Figure 31: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018 ![]() Figure 32: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018 ![]() Figure 33: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018 ![]() Figure 34: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 35: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 36: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 37: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018 ![]() Figure 38: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018 ![]() Figure 39: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018 ![]() Figure 40: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018 ![]() Figure 41: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018 ![]() Figure 42: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018 ![]() Figure 43: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018 ![]() Figure 44: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018 ![]() Figure 45: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018 ![]() Figure 46: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 ![]() Figure 47: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 ![]() Figure 48: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 ![]() Figure 49: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018 To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 19 Mar 2020 05:00 PM PDT [This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Our last post on rebalancing struck an equivocal note. We ran a thousand simulations using historical averages across different rebalancing regimes to test whether rebalancing produced better absolute or risk-adjusted returns. The results suggested it did not. But we noted many problems with the tests—namely, unrealistic return distributions and correlation scenarios. We argued that if we used actual historical data and sampled from it, we might resolve many of these issues. But we also asked our readers whether it was worthwhile to test further. Based on the responses and page views, we believe the interest is there, so we'll proceed! As we mentioned, historical data more closely approximates the fat-tailed, skewed distribution common to asset returns. But only if you have a long enough time series. While we weren't able to find 50 years worth of major asset class returns, we were able to compile a 20-year series that includes two market downturns. The data isn't all from the same source, unfortunately. But it is fairly reputable—Vanguard's stock and US bond index funds, emerging market bond indices from the St. Louis Fed, and the S&P GSCI commodity index. The code will show how we aggregated it for those interested. Using this data series we should be able to test rebalancing more robustly. Before we proceed, a brief word on methodology. To run the simulation, we need to sample (with replacement) from our twenty year period and combine each sample into an entire series. To capture the non-normal distribution and serial correlation of asset returns, we can't just sample one return, however. We need to sample a block of returns. This allows us to approximate the serial correlation of individual assets as well as the correlation between assets. But how long should the block be? Trying to answer that can get pretty complicated, pretty quickly.1 We decided to take a shortcut and use a simple block of 6 periods. This equates to six months, since our series is monthly returns. There's nothing magical about this number but it does feature as a period used in academic studies on momentum, a topic beyond the scope of this post.2 We sample six months of returns at at time. Repeat 42 times to get over 20 years of data. Repeat to create 1000 portfolios. From there we apply the different rebalancing regimes on each of the portfolios and then aggregate the data. As before, we first use an equal-weighting, and then a 60/35/5 weighting for stocks, bonds, and commodities. Let's see what we get. First, we look at the average return for equal-weighted portfolios by rebalancing regime along with the range of outcomes. Recall, the white line is the mean and the top and bottom of the boxes represent the middle 50% of outcomes, Interestingly, no rebalancing had far more positive and far less negative outliers (the red dots) than any of the rebalancing regimes. Given where the averages line up, it doesn't look like there are significant differences. Let's run some t-tests for completeness.
As expected, the p-values are quite high, meaning that any differences in mean returns are likely due to chance. Now we'll check on the number of times each rebalancing strategy outperforms the others. A dramatic result! No rebalancing beat the other strategies a majority of the time and less rebalancing outperformed more most of the time too. Now for the crux. Does rebalancing lead to better risk-adjusted returns as calculated by the Sharpe ratio.
Not much difference. Recall that from our previous simulations, no rebalancing actually generated a slightly worse Sharpe ratio by about 30-40 bps. But that result occurred less than 90% of the time, so it could be due to randomness. Let's check the Sharpe ratios for the present simulation.
No rebalancing generates a better Sharpe ratio a majority of the time, but not enough to conclude it isn't due to chance. Interestingly, the frequency with which quarterly and yearly rebalancing produce better Sharpe ratios than monthly rebalancing looks significant. In both cases the frequency is greater than 90% of the time. That the lower frequency rebalancing outperforms the higher frequency likely plays a role in the significance of the Sharpe ratios, but is an area of investigation we'll shelve for now. Let's move to the next simulation where we weight the portfolios 60/35/5 for stocks, bonds, and commodities. First, we show the boxplot of mean returns and range of outcomes. Like the equal-weighted simulations, the means don't look that dissimilar and no rebalancing generates more positive and less negative outliers than other rebalancing regimes. We can say almost undoubtedly that the differences in average returns, if there are any, is likely due to chance. The p-values from the t-tests we show below should prove that.
Now let's calculate and present the frequency of outperformance by rebalancing strategy. No rebalancing outperforms again! Less frequent rebalancing outperforms more frequent. And risk-adjusted returns?
Here, no rebalancing performed slightly worse than rebalancing quarterly or yearly. How likely should we believe these results to be significant?
Slightly worse than 50/50 for no rebalancing. But less frequent rebalancing appears to have the potential to produce higher risk-adjusted returns that more frequent rebalancing. Let's briefly sum up what we've discovered thus far. Rebalancing does not seem to produce better risk-adjusted returns. If we threw in taxation and slippage, we think rebalancing might likely be a significant underperformer most of the time. Does this mean you should never rebalance? No. You should definitely rebalance if you've got a crystal ball. Baring that, if your risk-return parameters change, then you should rebalance. But it would not be bringing the weights back to their original targets; rather, it would be new targets. A entirely separate case. What do we see as the biggest criticism of the foregoing analysis? That it was a straw man argument. In practice, few professional investors rebalance because it's July 29th or October 1st. Of course, there is quarter-end and year-end rebalancing, but those dates are usually coupled with a threshold. That is, only rebalance if the weights have exceeded some threshold, say five or ten percentage points from target. Analyzing the effects of only rebalancing based on thresholds would require more involved code on our part.3 Given the results thus far, we're not convinced that rebalancing based on the thresholds would produce meaningfully better risk-adjusted returns. However, rebalancing based on changes in risk-return constraints might do so. Modeling that would be difficult since we'd also have to model (or assume) new risk-return forecasts. But we could model the traditional shift recommended by financial advisors to clients as they age; that is, slowly shifting from high-risk to low-risk assets. In other words, lower the exposure to stocks and increase the exposure to bonds. As a toy example, we use our data set to compare a no rebalancing strategy with an initial 60/35/5 split between stocks, bonds, and commodities to a yearly rebalancing strategy that starts at a 90/5/5 split and changes to a 40/60/0 split over the period. The Sharpe ratio for the rebalanced portfolio is actually a bit worse than the no rebalancing one. Mean returns are very close. Here's the graph of the cumulative return. This is clearly one example and highly time-dependent. But we see that rebalancing wasn't altogether different than not, and we're not including tax and slippage effects. To test this notion we'd have to run some more simulations, but that will be for another post. We'll end this post with a question for our readers. Are you convinced rebalancing doesn't improve returns or do you think more analysis is required? Please send us your answer to nbw dot osm at gmail dot com. Until next time, here's all the code behind the simulations, analyses, and charts.
To leave a comment for the author, please follow the link and comment on their blog: R on OSM. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
COVID-19 Tracker: Days since N Posted: 18 Mar 2020 05:00 PM PDT [This article was first published on R Bloggers on , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. There's no shortage of dashboards and data-visualizations covering some aspect of the ongoing coronavirus pandemic, but not having come across a tool that allowed me to easily compare countries myself, I developed this COVID-19 Tracker shiny app both for my own personal use, as well as to get some more experience working with Shiny. This app was inspired by a vizualization produced by John Burn-Murdoch for the Financial Times (FT) that I thought did a very nice job at allowing cross-country comparisons of the trajectories of the total confirmed cases by standardizing countries using the "Number of days since the 100th case" on the x-axis. The Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) maintains a dashboard whose data source serves as the underlying data for the FT vizualization, as well as many others floating around on the internet at the moment. At the time of writing, the JHU GSSE dasboard does not allow for an easy way to select countries for direct comparison. The Shiny app presented here allows the user to select any of the country/region units available in the entire dataset, standardize them on the x-axis using "Days since N", and automatically generate fairly clean level- and log- plots with dynamically rendered titles and axis labels. The data in the app is timestamped and updated automatically along with the JSU CSSE repo, and there are download buttons for the plots and filtered long-format data tables use for those plots in PNG and CSV formats, respectively. Currently, a maximum of six countries can be compared at a time. The limit was set simply to allow for better readability of the resulting plots. Users can select between total confirmed cases, deaths, and total recovered as the different y-axis outcome variables. The default N number for the total confirmed cases outcome is set to 100, in keeping with the most widely used convention at the moment. For deaths, N=10 can be used. There are a few countries that include more detailed regional breakdowns of the data. Where this is the case, the totals for those countries are given by the country name + "(all territories)". Additional features and edits will be added on an ongoing basis. Feedback or comments are welcome. To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment