[R-bloggers] A look at past bear markets and implications for the future (and 8 more aRticles)

[R-bloggers] A look at past bear markets and implications for the future (and 8 more aRticles)

Link to R-bloggers

A look at past bear markets and implications for the future

Posted: 16 Mar 2020 11:47 AM PDT

[This article was first published on Data based investing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The S&P 500 is officially in a bear market, and the crash from the high valuation levels has been fast and painful. There is however light at the end of the tunnel. In this post I'll demonstrate how the US stock market has developed during past bear markets and how the market has recovered during the ten years after the peak.
The reason for choosing the ten years as the horizon is because I believe that you should not invest in stocks any money that you are going to need in the next ten years. The chance of having positive returns increases substantially with time and is almost ninety percent for a period of ten years. The worst annual return for a ten-year period has been about negative four percent since 1928 (sources).

We'll use monthly total return data of S&P 500 from Shiller beginning from the year 1871 until the very end of last year. The index has been reconstructed to represent the US stock market for dates the S&P 500 didn't exist yet. The reason why we go so far back in time is to include as many bear markets as possible. Panics and manias have always existed, and the human nature has not changed enough in the past 150 years to make the past data less valid. There has however been a substantial change in the spread of information, which causes panic to spread faster and may possibly make bear markets shorter and deeper.
First, let's take a look at the 14 bear markets found in the data in nominal terms, which describes how a portfolio would have developed without taking inflation into account. The horizontal black line indicates the drop needed to reach a bear market at minus 20 percent, and a blue color indicates that the return has been positive in the 10 years following the peak i.e. the ending value is higher than the value at the peak, and a red color indicates the opposite.

Click to enlarge images
Only two of the fourteen bear markets did not recover in ten years from the initial peak. Not surprisingly, the two bear markets were the ones that peaked at bubble territory in 1929 and 2000. Notice that the bear markets that peaked in 1919 and 1987 we followed by the exact same bubbles.
Below is the same plot with real returns, so the returns describe the actual change of purchasing power by taking inflation into account. Notice that since bear markets are defined as being down by twenty percent in nominal terms, the returns might not dip below the black line because of deflation.
In real terms, four of the fourteen bear markets did not recover after ten years of peaking. Judging by the history, this still leaves us an over 70 percent chance of the index being higher in the next ten years after inflation. Note that the bear market that peaked in 1968 is overlapping heavily with the bear market that peaked in 1972, so they could be considered to be the same bear market, which would increase our chances even further.
Let's then plot the bear markets in red on top of the index to get a sense of the lengths of the bear markets, from peak to full recovery.
The average length of a bear market from peak until recovery has been 3.95 years and the fall length from the peak until bottom i.e. a peak to trough time was 1.45 years. The longest bear market during the 1930s Great Depression was 15.33 years, and the longest time the stock market fell was 2.75 years.
Lastly, let's take a look at just the drawdowns. The bear market threshold is again indicated with a black horizontal line. The monthly data is only until the end of the year 2019, so the recent drawdown of early 2020 is missing from the graph. At the time of writing, the index is down 27 percent, with only seven of the historical drawdowns being as severe as this one.
The average drop in a bear market using monthly data has been 33.9 percent, with a maximum of 81.8 percent during the 1930s. Notice again that these are total returns. The drawdowns have been worse during periods with high valuations, as measured by Shiller CAPE or P/B. The maximum drawdowns seem to have also increased with time, which may be caused by lower valuations at the beginning of the time frame and possibly also because people have been more connected than ever, which makes the spread of panic easier.
To conclude, this bear market has been rough and short this far. However, judging by the history, most bear markets recover fully in ten years. The valuations that are still elevated compared to history may however make the index to not to recover as much as in past bear markets.
Be sure to follow me on Twitter for updates about new blog posts like this!
The R code used in the analysis can be found here.

To leave a comment for the author, please follow the link and comment on their blog: Data based investing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Free Coupon for our R Video Course: Introduction to Data Science

Posted: 16 Mar 2020 09:46 AM PDT

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For all our remote learners, we are sharing a free coupon code for our R video course Introduction to Data Science. The code is ITDS2020, and can be used at this URL https://www.udemy.com/course/introduction-to-data-science/?couponCode=ITDS2020 . Please check it out and share it!

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

A Little Something From Practical Data Science with R Chapter 1

Posted: 16 Mar 2020 07:39 AM PDT

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a small quote from Practical Data Science with R Chapter 1.

It is often too much to ask for the data scientist to become a domain expert. However, in all cases the data scientist must develop strong domain empathy to help define and solve the right problems.

Interested? Please check it out.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Online R, Python & Git Training!

Posted: 16 Mar 2020 07:34 AM PDT

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hey there!

Here at Jumping Rivers, we have the capabilities to teach you R, Python & Git virtually. For the last three years we have been running online training courses for small groups (and even 1 to 1).

How is it different to an in-person course?

It's the same, but also different! The course contents is the same, but obviously the structure is adapted to online training. For example, rather than a single long session, we would break the day up over a couple of days and allow regular check-in points.

For the courses, we use whereby.com. This provides screen-sharing for both instructor and attendees, none of the interactivity is lost.

What about IT restrictions?

Don't worry! If your current IT security/infrastructure is a problem, we have two solutions:

  1. Training can be done using cloud services. We can provide a secure RStudio server or Jupyter notebook environment just for your team. This means attendees simply have to log on to our cloud service to be able to use the appropriate software and packages.

  2. We have a fleet of state of the art Chromebooks, available to post to attendees. Each Chromebook comes with all required software and packages pre-installed. A microphone headset can also be provided if necessary.

What is the classroom size?

We have a maximum online classroom size of 12, including the instructor. Attendees will get the opportunity for a follow-up "virtual coding clinic", split into smaller class sizes, in order to enquire about anything related to the course or how they can apply it to their work.

If you would like to enquire about virtual training, either email info@jumpingrivers.com or contact us via our website.


The post Online R, Python & Git Training! appeared first on Jumping Rivers.

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

All you need to know on Multiple Factor Analysis …

Posted: 15 Mar 2020 10:00 PM PDT

[This article was first published on François Husson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Multiple facrtor analysis deals with dataset where variables are organized in groups. Typically, from data coming from different sources of variables. The method highlights a common structure of all the groups, and the specificity of each group. It allows to compare the results of several PCAs or MCAs in a unique frame of reference. The groups of variables can be continuous, categorical or can be a contingency table.

Implementation with R software

See this video and the audio transcription of this video:

Course videos

Theorectical and practical informations on Multiple Factor Analysis are available in these 4 course videos:

  1. Introduction
  2. Weighting and global PCA
  3. Study of the groups of variables
  4. Complements: qualitative groups, frenquency tables

Here are the slides and the audio transcription of the course.

Materials

Here is the material used in the videos:

To leave a comment for the author, please follow the link and comment on their blog: François Husson.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

U.S. Census Counts Data

Posted: 15 Mar 2020 07:12 PM PDT

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As promised previously, I packaged up the U.S. Census data that I pulled together to make the population density and pyramid animations. The package is called uscenpops and it's available to install via GitHub or with install.packages() if you set up drat first. The instructions are on the package homepage.

A small multiple of population pyramids in selected years

A small multiple plot of selected population pyramids

Instead of an animation, let's make the less-flashy but, frankly, in all likelihood more useful small multiple plot seen here. With the package installed we can produce it as follows:

 1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  
  library(tidyverse)  library(uscenpops)    uscenpops  #> # A tibble: 10,520 x 5  #>     year   age     pop   male female  #>              #>  1  1900     0 1811000 919000 892000  #>  2  1900     1 1835000 928000 907000  #>  3  1900     2 1846000 932000 914000  #>  4  1900     3 1848000 932000 916000  #>  5  1900     4 1841000 928000 913000  #>  6  1900     5 1827000 921000 906000  #>  7  1900     6 1806000 911000 895000  #>  8  1900     7 1780000 899000 881000  #>  9  1900     8 1750000 884000 866000  #> 10  1900     9 1717000 868000 849000  #> # … with 10,510 more rows    

That's what the dataset looks like. We'll lengthen it, calculate a relative frequency (that we won't use in this particular plot) and add a base value that we'll use for the ribbon boundaries below.

 1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  
    pop_pyr <- uscenpops %>% select(year, age, male, female) %>%    pivot_longer(male:female, names_to = "group", values_to = "count") %>%    group_by(year, group) %>%    mutate(total = sum(count),            pct = (count/total)*100,            base = 0)     pop_pyr    #> # A tibble: 21,040 x 7  #> # Groups:   year, group [240]  #>     year   age group   count    total   pct  base  #>                 #>  1  1900     0 male   919000 38867000  2.36     0  #>  2  1900     0 female 892000 37227000  2.40     0  #>  3  1900     1 male   928000 38867000  2.39     0  #>  4  1900     1 female 907000 37227000  2.44     0  #>  5  1900     2 male   932000 38867000  2.40     0  #>  6  1900     2 female 914000 37227000  2.46     0  #>  7  1900     3 male   932000 38867000  2.40     0  #>  8  1900     3 female 916000 37227000  2.46     0  #>  9  1900     4 male   928000 38867000  2.39     0  #> 10  1900     4 female 913000 37227000  2.45     0  #> # … with 21,030 more rows    

Next we set up some little vectors of labels and colors, and then make a mini-dataframe of what we'll use as labels in the plot area, rather than using the default strip labels in facet_wrap().

 1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  
    ## Axis labels  mbreaks <- c("1M", "2M", "3M")    ## colors  pop_colors <- c("#E69F00", "#0072B2")    ## In-plot year labels  dat_text <- data.frame(    label =  c(seq(1900, 2015, 5), 2019),    year  =  c(seq(1900, 2015, 5), 2019),    age = rep(95, 25),     count = rep(-2.75e6, 25)  )    

As before, the trick to making the pyramid is to set all the values for one category (here, males) to negative numbers.

 1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  
    pop_pyr$count[pop_pyr$group == "male"] <- -pop_pyr$count[pop_pyr$group == "male"]    p <- pop_pyr %>%     filter(year %in% c(seq(1900, 2015, 5), 2019)) %>%    ggplot(mapping = aes(x = age, ymin = base,                         ymax = count, fill = group))    p + geom_ribbon(alpha = 0.9, color = "black", size = 0.1) +    geom_label(data = dat_text,                mapping = aes(x = age, y = count,                              label = label), inherit.aes = FALSE,                vjust = "inward", hjust = "inward",               fontface = "bold",                color = "gray40",                fill = "gray95") +     scale_y_continuous(labels = c(rev(mbreaks), "0", mbreaks),                        breaks = seq(-3e6, 3e6, 1e6),                        limits = c(-3e6, 3e6)) +     scale_x_continuous(breaks = seq(10, 100, 10)) +    scale_fill_manual(values = pop_colors, labels = c("Females", "Males")) +     guides(fill = guide_legend(reverse = TRUE)) +    labs(x = "Age", y = "Population in Millions",         title = "Age Distribution of the U.S. Population, 1900-2019",         subtitle = "Age is top-coded at 75 until 1939, at 85 until 1979, and at 100 since then",         caption = "Kieran Healy / kieranhealy.org / Data: US Census Bureau.",         fill = "") +    theme(legend.position = "bottom",          plot.title = element_text(size = rel(2), face = "bold"),          strip.background = element_blank(),            strip.text.x = element_blank()) +    coord_flip() +     facet_wrap(~ year, ncol = 5)    

The calls to geom_ribbon() and geom_label() draw the actual plots, and everything else is just a little attention to detail in order to make it come out nicely.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Outlier Days with R and Python

Posted: 15 Mar 2020 05:00 PM PDT

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.




Welcome to another installment of Reproducible Finance. Today's post will be topical as we look at the historical behavior of the stock market after days of extreme returns and it will also explore one of my favorite coding themes of 2020 – the power of RMarkdown as an R/Python collaboration tool.

This post originated when Rishi Singh, the founder of tiingo and one of the nicest people I have encountered in this crazy world, sent over a note about recent market volatility along with some Python code for analyzing that volatility. We thought it would be a nice project to post that Python code along with the equivalent R code for reproducing the same results. For me, it's a great opportunity to use RMarkdown's R and Python interoperability superpowers, fueled by the reticulate package. If you are an R coder and someone sends you Python code as part of a project, RMarkdown + reticulate makes it quite smooth to incorporate that Python code into your work. It was interesting to learn how a very experienced Python coder might tackle a problem and then think about how to tackle that problem with R. Unsurprisingly, I couldn't resist adding a few elements of data visualization.

Before we get started, if you're unfamiliar with using R and Python chunks throughout an RMarkdown file, have a quick look at the reticulate documentation here.

Let's get to it. Since we'll be working with R and Python, we start with our usual R setup code chunk to load R packages, but we'll also load the reticulate package and source a Python script. Here's what that looks like.

library(tidyverse)  library(tidyquant)  library(riingo)  library(timetk)  library(plotly)  library(roll)  library(slider)  library(reticulate)    riingo_set_token("your tiingo token here")    # Python file that holds my tiingo token  reticulate::source_python("credentials.py")    knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA)

Note that I set my tiingo token twice: first using riingo_set_token() so I can use the riingo package in R chunks and then by sourcing the credentials.py file, where I have put tiingoToken = 'my token'. Now I can use the tiingoToken variable in my Python chunks. This is necessary because we will use both R and Python to pull in data from Tiingo.

Next we will use a Python chunk to load the necessary Python libraries. If you haven't installed these yet, you can open the RStudio terminal and run pip install. Since we'll be interspersing R and Python code chunks throughout, I will add a # Python Chunk to each Python chunk and, um, # R Chunk to each R chunk.

# Python chunk  import pandas as pd  import numpy as np  import tiingo

Let's get to the substance. The goal today is look back at the last 43 years of S&P 500 price history and analyze how the market has performed following a day that sees an extreme return. We will also take care with how we define an extreme return, using rolling volatility to normalize percentage moves.

We will use the mutual fund VFINX as a tradeable proxy for the S&P 500 because it has a much longer history than other funds like SPY or VOO.

Let's start by passing a URL string from tiingo to the pandas function read_csv, along with our tiingoToken.

# Python chunk   pricesDF = pd.read_csv("https://api.tiingo.com/tiingo/daily/vfinx/prices?startDate=1976-1-1&format=csv&token=" + tiingoToken)

We just created a Python object called pricesDF. We can look at that object in an R chunk by calling py$pricesDF.

# R chunk  py$pricesDF %>%     head()

We just created a Python object called pricesDF. Let's reformat the date column becomes the index, in date time format.

# Python chunk   pricesDF = pricesDF.set_index(['date'])  pricesDF.index = pd.DatetimeIndex(pricesDF.index)

Heading back to R for viewing, we see that the date column is no longer a column – it is the index of the data frame and in pandas the index is more like a label than a new column. In fact, here's what happens when call the row names of this data frame.

# R chunk  py$pricesDF %>%     head() %>%     rownames()

We now have our prices, indexed by date. Let's convert adjusted closing prices to log returns and save the results in a new column called returns. Note the use of the shift(1) operator here. That is analogous to the lag(..., 1) function in dplyr.

# Python chunk  pricesDF['returns'] = np.log(pricesDF['adjClose']/pricesDF['adjClose'].shift(1))

Next, we want to calculate the 3-month rolling standard deviation of these daily log returns, and then divide daily returns by the previous rolling 3-month volatility in order to prevent look-ahead error. We can think of this as normalizing today's return by the previous 3-months' rolling volatility and will label it as stdDevMove.

# Python chunk  pricesDF['rollingVol'] = pricesDF['returns'].rolling(63).std()  pricesDF['stdDevMove'] = pricesDF['returns'] / pricesDF['rollingVol'].shift(1)

Finally, we eventually want to calculate how the market has performed on the day following a large negative move. To prepare for that, let's create a column of next day returns using shift(-1).

# Python chunk  pricesDF['nextDayReturns'] = pricesDF.returns.shift(-1)

Now, we can filter by the size of the stdDevMove column and the returns column, to isolate days where the standard deviation move was at least 3 and the returns was less than -3%. We use mean() to find the mean next day return following such large events.

# Python chunk  nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < -.03), ['nextDayReturns']].mean()

Finally, let's loop through and see how the mean next day return changes as we filter on different extreme negative returns or we can call drop tolerances. We will label the drop tolerance as i, set it at -.03 and then run a while loop that decrements down i by .0025 at each pass. In this way we can look at the mean next return following different levels of negative returns.

# Python chunk  i = -.03  while i >= -.0525:      nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < i), ['nextDayReturns']]      print(str(round(i, 5)) + ': ' + str(round(nextDayPerformanceSeries['nextDayReturns'].mean(), 6)))        i -= .0025

It appears that as the size of the drop gets larger and more negative, the mean bounce back tends to get larger.

Let's reproduce these results in R.

First, we import prices using the riingo_prices() function from the riingo.

# R chunk  sp_500_prices <-     "VFINX" %>%     riingo_prices(start_date = "1976-01-01", end_date = today())

We can use mutate() to add a column of daily returns, rolling volatility, standard deviation move and next day returns.

# R chunk  sp_500_returns <-     sp_500_prices %>%    select(date, adjClose) %>%     mutate(daily_returns_log = log(adjClose/lag(adjClose)),           rolling_vol = roll_sd(as.matrix(daily_returns_log), 63),           sd_move = daily_returns_log/lag(rolling_vol),           next_day_returns = lead(daily_returns_log))

Now let's filter() on an sd_move greater than 3 and daily_returns_log less than a drop tolerance of -.03.

# R chunk    sp_500_returns %>%     na.omit() %>%     filter(sd_move < -3 & daily_returns_log < -.03) %>%     select(date, daily_returns_log, sd_move, next_day_returns) %>%     summarise(mean_return = mean(next_day_returns)) %>%     add_column(drop_tolerance = scales::percent(.03), .before = 1)
# A tibble: 1 x 2    drop_tolerance mean_return                      1 3%                 0.00625

We used a while() loop to iterate across different drop tolerances in Python, let's see how to implement that using map_dfr() from the purrr package.

First, we will define a sequence of drop tolerances using the seq() function.

# R chunk  drop_tolerance <- seq(.03, .05, .0025)    drop_tolerance
[1] 0.0300 0.0325 0.0350 0.0375 0.0400 0.0425 0.0450 0.0475 0.0500

Next, we will create a function called outlier_mov_fun that takes a data frame of returns, filters on a drop tolerance and gives us the mean return following large negative moves.

# R chunk  outlier_mov_fun <- function(drop_tolerance, returns) {  returns %>%      na.omit() %>%     filter(sd_move < -3 & daily_returns_log < -drop_tolerance) %>%     select(date, daily_returns_log, sd_move, next_day_returns) %>%     summarise(mean_return = mean(next_day_returns) %>% round(6)) %>%    add_column(drop_tolerance = scales::percent(drop_tolerance), .before = 1) %>%     add_column(drop_tolerance_raw = drop_tolerance, .before = 1)  }

Notice how that function takes two arguments: a drop tolerance and data frame of returns.

Next, we pass our sequence of drop tolerances, stored in a variable called drop_tolerance to map_dfr(), along with our function and our sp_500_returns object. map_dfr will iterate through our sequence of drops and apply our function to each one.

# R chunk  map_dfr(drop_tolerance, outlier_mov_fun, sp_500_returns) %>%     select(-drop_tolerance_raw)
# A tibble: 9 x 2    drop_tolerance mean_return                      1 3%                 0.00625  2 3%                 0.00700  3 3%                 0.00967  4 4%                 0.0109   5 4%                 0.0122   6 4%                 0.0132   7 4%                 0.0149   8 5%                 0.0149   9 5%                 0.0162 

Have a quick glance up that the results of our Python while() and we should see that the results are consistent.

Alright, let's have some fun and get to visualizing these results with ggplot and plotly.

# R chunk  (  sp_500_returns %>%   map_dfr(drop_tolerance, outlier_mov_fun, .) %>%     ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}                                                                        mean next day return: {mean_return * 100}%"))) +    geom_point(color = "cornflowerblue") +    labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +    scale_x_continuous(labels = scales::percent) +    scale_y_continuous(labels = scales::percent) +     theme_minimal()  ) %>% ggplotly(tooltip = "text")

Here's what happens when we expand the upper bound to a drop tolerance of -2% and make our intervals smaller, moving from .25% increments to .125% increments.

# R chunk  drop_tolerance_2 <- seq(.02, .05, .00125)    (  sp_500_returns %>%   map_dfr(drop_tolerance_2, outlier_mov_fun, .) %>%     ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}                                                                        mean next day return: {mean_return * 100}%"))) +    geom_point(color = "cornflowerblue") +    labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +    scale_x_continuous(labels = scales::percent) +    scale_y_continuous(labels = scales::percent) +     theme_minimal()  ) %>% ggplotly(tooltip = "text")

Check out what happens when we expand the lower bound, to a -6% drop tolerance.

# R chunk  drop_tolerance_3 <- seq(.02, .06, .00125)    (  sp_500_returns %>%   map_dfr(drop_tolerance_3, outlier_mov_fun, .) %>%     ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}                                                                        mean next day return: {mean_return * 100}%"))) +    geom_point(color = "cornflowerblue") +    labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +    scale_x_continuous(labels = scales::percent) +    scale_y_continuous(labels = scales::percent) +     theme_minimal()  ) %>% ggplotly(tooltip = "text")

I did not expect that gap upward when the daily drop passes 5.25%.

A quick addendum that if I had gotten my act together and finished this 4 days ago I would not have included, but I'm curious how this last week has compared with other weeks in terms of volatility. I have in mind to visualize weekly return dispersion and that seemed a mighty tall task, until the brand new slider package came to the rescue! slider has a function called slide_period() that, among other things, allows us to break up time series according to different periodicities.

To break up our returns by week, we call slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week"), where first_monday_december is a date that falls on a Monday. We could use our eyeballs to check a calendar and find a date that's a Monday or we could use some good ol' code. Let's assume we want to find the first Monday in December of 2016.

We first filter our data with filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))). Then create a column of weekday names with wday(date, label = TRUE, abbr = FALSE) and filter to our first value of "Monday".

# R Chunk  first_monday_december <-     sp_500_returns %>%    mutate(date = ymd(date)) %>%     filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))) %>%     mutate(day_week = wday(date, label = TRUE, abbr = FALSE)) %>%     filter(day_week == "Monday") %>%     slice(1) %>%     pull(date)

Now we run our slide_period_dfr() code and it will start on the first Monday in December of 2016, and break our returns into weeks. Since we set .names_to = "week", the function will create a new column called week and give a unique number to each of our weeks.

# R chunk    sp_500_returns %>%    select(date, daily_returns_log) %>%    filter(date >= first_monday_december) %>%    slide_period_dfr(.,                     .$date,                     "week",                     ~ .x,                     .origin = first_monday_december,                     .names_to = "week") %>%     head(10)
# A tibble: 10 x 3      week date                daily_returns_log                                   1     1 2016-12-05 00:00:00           0.00589   2     1 2016-12-06 00:00:00           0.00342   3     1 2016-12-07 00:00:00           0.0133    4     1 2016-12-08 00:00:00           0.00226   5     1 2016-12-09 00:00:00           0.00589   6     2 2016-12-12 00:00:00          -0.00105   7     2 2016-12-13 00:00:00           0.00667   8     2 2016-12-14 00:00:00          -0.00810   9     2 2016-12-15 00:00:00           0.00392  10     2 2016-12-16 00:00:00          -0.00172

From here, we can group_by that week column and treat each week as a discrete time period. Let's use ggplotly to plot each week on the x-axis and the daily returns of each week on the y-axis, so that the vertical dispersion shows us the dispersion of weekly returns. Hover on the point to see the exact date of the return.

# R chunk  (  sp_500_returns %>%    select(date, daily_returns_log) %>%    filter(date >= first_monday_december) %>%    slide_period_dfr(.,                     .$date,                     "week",                     ~ .x,                     .origin = first_monday_december,                     .names_to = "week") %>%    group_by(week) %>%    mutate(start_week = ymd(min(date))) %>%    ggplot(aes(x = start_week, y = daily_returns_log, text = str_glue("date: {date}"))) +    geom_point(color = "cornflowerblue", alpha = .5) +    scale_y_continuous(labels = scales::percent,                       breaks = scales::pretty_breaks(n = 8)) +    scale_x_date(breaks = scales::pretty_breaks(n = 10)) +    labs(y = "", x = "", title = "Weekly Daily Returns") +    theme_minimal()  ) %>% ggplotly(tooltip = "text")

We can also plot the standard deviation of returns for each week.

# R chunk    (  sp_500_returns %>%    select(date, daily_returns_log) %>%    filter(date >= first_monday_december) %>%    slide_period_dfr(.,                     .$date,                     "week",                     ~ .x,                     .origin = first_monday_december,                     .names_to = "week") %>%    group_by(week) %>%    summarise(first_of_week = first(date),              sd = sd(daily_returns_log)) %>%    ggplot(aes(x = first_of_week, y = sd, text = str_glue("week: {first_of_week}"))) +    geom_point(aes(color = sd)) +    labs(x = "", title = "Weekly Standard Dev of Returns", y = "") +    theme_minimal()  ) %>% ggplotly(tooltip = "text")

That's all for today! Thanks for reading and stay safe out there.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Flatten the COVID-19 curve

Posted: 15 Mar 2020 04:00 PM PDT

[This article was first published on Theory meets practice..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Abstract:

We discuss why the message of flattening the COVID-19 curve is right, but why some of the visualizations used to show the effect are wrong: Reducing the basic reproduction number does not just stretch the outbreak, it also reduces the final size of the outbreak.



Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Motivation

Current discussions about interventions for the ongoing COVID-19 outbreak talk a lot about flattening the epidemic curve, i.e. to slow down the outbreak dynamics. Because of limited health capacities, stretching out the outbreak over a longer time period will ensure, that a larger proportion of those in need of hospital treatment will actually get it. Other advantages of this approach are to win time in order to find better treatment forms and, possibly, to eventually develop a vaccine. Visualization of the flatten-the-curve-effect often look like this one taken from Twitter:


The origin of these illustration is discussed here.

As much as I support the message and reasons for flattening the curve, some of the visualizations have shortcomings from an infectious disease modelling point of view: They transport the message that the number of individuals, which -as an result of the outbreak- will need hospitalization, is fixed. Hence, the figure suggests that it's impossible to avoid a certain number of infected (say 40-70% of the population), but we save lives by stretching out hospital cases over time. Although the conclusion is correct, the premise is IMHO wrong: Reducing the basic reproduction number by drastically reducing contacts or quickly isolating infectious diseases also reduces the size of the outbreak. Also others, like Ben Bolker have pointed out this flaw.

We shall use a simple and common mathematical model from infectious disease modelling to illustrate this point. This model is easily implemented in R – showing how is a secondary objective of this post. The R code of this post is available from github.

A word of caution at this point: The numbers and illustrations used in this post address the point of visualization and are not an attempt to generate actual policy advice.

Susceptible-Infectious-Recovered modelling

A well-known mathematical model to describe the dynamics of an infectious disease in a population is the so called susceptible-infectious-recovered (SIR) compartment model (Kermack and McKendrick 1927). This model assumes that each individual in the population population belongs to one of three states:

  • Susceptible: the individual has not yet had the disease, but is not immune to the disease and thus can become infectious when having an infectious contact with an infected individual
  • Infectious: a person currently affected by the disease and able to transmit the disease to others
  • Recovered: an infectious person who is not affected anymore by the disease and not able to transmit the disease anymore. No re-infection can occur, i.e. recovered individuals are immune to the disease once they had it.

It is assumed that at time zero everyone is susceptible except for \(m\) individuals, which are infectious at time zero. Once infected an individual becomes infectious and then recovers. Mathematically, we shall by \(S(t)\), \(I(t)\) and \(R(t)\) denote the number of susceptible, infectious and recovered in the population at time \(t\). Furthermore, it is assumed that the population consists of a constant number of \(N\) individuals and at all times \(S(t)+I(t)+R(t)=N\). In other words the population is closed and does not vary over time.

The dynamics in the number of susceptibles, infectious and recovered are now described using the following deterministic ordinary differential equation (ODE) system:

\[
\begin{eqnarray*}
\frac{dS(t)}{dt} &=& -\beta S(t) I(t), \\
\frac{dI(t)}{dt} &=& \beta S(t) I(t) – \gamma I(t), \\
\frac{dR(t)}{dt} &=& \gamma I(t).
\end{eqnarray*}
\]

What does this mean? It denotes the movement of individuals between the three categories, in particular the movement from \(S\rightarrow I\) and \(I \rightarrow R\). The most important term in the equation system is \(\beta S(t) I(t)\) and can be motivated as follows: Consider one specific infectious individual at time \(t\), this individual meets a specific other person in the population at the rate of \(\beta\) contacts per time unit (say day). It is assumed that this rate is the same no matter which other person we talk about (aka. homogeneous mixing in the population). Of course this is a very strong simplification, because it ignores, e.g., the distance between two individuals and that you tend to mix with more with peers. But in a large population, the average is a good description. Hence, the number of contacts with susceptible individuals per time unit is \(\beta S(t)\). Now summing these contacts over all infectious individuals at time \(t\) leads to \(\sum_{j=1}^{I(t)}\beta S(t) = \beta I(t) S(t)\). Note that this is a non-linear term consisting of both \(I(t)\) and \(S(t)\).

In the above process description,for the ease of exposition, it is assumed that once an infectious individual meets a susceptible person, then the disease is always transmitted from the infected to the susceptible person. Hence, the transmission probability does not depend on, e.g., how long the infectious individual has already been infectious. An equivalent way of formulating this statement is to say that each individual has contacts at rate \(\alpha\) for meeting a specific other person, and a proportion \(p\) of these contacts results in an infection. Then \(\beta = \alpha p\) is the rate at which infectious contacts occur.

The second component of the model is the term \(\gamma I(t)\). Again considering one infectious individual it is assumed that the rate at which \(I\rightarrow R\) transition occurs happens at the constant rate \(\gamma\). This means that individuals are on average \(1/\gamma\) days infectious before they recover from the disease. In other words: The smaller \(\gamma\) is the longer people are infectious and, hence, the longer they can transmit the disease to others. Note: Recovering from an epidemic modelling point of view does not distinguish between individuals which recover by becoming healthy or by dying – what is important is that they do not contribute to the spread of the disease anymore.

One important quantity, which can be derived from the above ODE equation system is the so called basic reproduction number, aka. \(R_0\) and is defined as (Diekmann, Heesterbeek, and Britton 2013)

the expected number of secondary cases per primary case in a completely susceptible population. It is computed as** \(R_0 = N \frac{\beta}{\gamma}\).

This means that if we consider the dynamics of a disease in generation time, i.e. in a time scale where one time unit is the time period between infection in the primary case and infection in the secondary case, then \(R_0\) denotes the growth factor in the size of the population at the beginning of the outbreak. What is special about the beginning of the outbreak? Well, more or less all contacts an infectious individual has, will be with susceptible individuals. However, once a large part of the population has already been infected, then \(R_0\) does not necessarily describe the expected number of cases per primary anymore. For the COVID-19 outbreak, since it is assumed that little immunity exists against the disease, all individuals will be susceptible and, hence, almost all contacts an infectious individual has, will be with susceptible persons. However, at a later stage in the epidemic, due to the depletion of susceptibles, the number of secondary cases since the population

Assuming \(R(0)\) and letting \(I(0)=m\) we obtain \(S(0) = N-m\). We can use this initial configuration together with a numerical solver for ODEs as implemented, e.g., in the lsoda function of the R package deSolve (Soetaert, Petzoldt, and Setzer 2010). For this, we need to implement a function, which describes the derivative of the ODE system:

######################################################################  # Function to compute the derivative of the ODE system  #  #  t - time  #  y - current state vector of the ODE at time t  #  parms - Parameter vector used by the ODE system  #  # Returns:  #  list with one component being a vector of length two containing  #  dS(t)/dt and dI(t)/dt  ######################################################################    sir <- function(t, y, parms) {    beta <- parms[1]    gamma <- parms[2]    S <- y[1]    I <- y[2]    return(list(c(S = -beta * S * I, I = beta * S * I - gamma * I)))  }
# Population size   N <- 1e6   # Rate at which person stays in the infectious compartment (disease specific and tracing specific)  gamma <- 1/5   # Infectious contact rate - beta = R0/N*gamma and when R0 \approx 2.25 then  2.25/N*gamma  beta <- 4.5e-07   # R0 for the beta and gamma values  R0 <- beta*N/gamma

Assuming a hypothetical population of \(N = 1,000,000\) and a contact rate of \(\beta = 0.0000004\) means that the contact rate with a given individual is 0.0000004 contacts per day. The choice of \(\gamma = 0.2\) corresponds to an average length of the infective period of 5 days. Furthermore, assuming Altogether, this leads to an \(R_0\) of 2.25, which roughly corresponds to the \(R_0\) of SARS-CoV-2.

We can now solve the ODE system using the above parameters and an initial number of infectious of, say, 10:

# Load package to numerically solve ODEs  suppressPackageStartupMessages(library(deSolve))    # Grid where to evaluate  max_time <- 150  times <- seq(0, max_time, by=0.01)    # Solve ODE system using Runge-Kutta numerical method.  ode_solution <- rk4(y = c(N - 10, 10), times = times, func = sir, parms = c(beta, gamma)) %>%      as.data.frame() %>%      setNames(c("t", "S", "I")) %>%      mutate(beta = beta, gama = gamma, R0 = N * beta / gamma, s = S / N, i = I / N, type = "without_intervention")

Here we have introduced \(s(t) = S(t)/N\) and \(i(t) = I(t)/N\) as, respectively, the proportion of susceptible and infectious individuals in the population. Note that \(I(t)\) is the number of currently infectious persons. Since a person is usually infectious for more than one day this curve is not equivalent to the number of new infections per day. If interest is in this value, which would typically be what is reported by health authorities, this can be computed as \[
C(t) = \int_{t-1}^t \beta S(u) I(u) du.
\]
or due to the SIR structure where re-infections are not possible simply as \(S(t-1) – S(t)\).

The epidemic curve of new infections per day is shown below:

Another important quantity of the model is an estimate for how many individuals are ultimately infected by the disease, i.e. \(1-s(\infty)\) in a population where initially everyone is susceptible to the disease. This can be either calculated numerically from the above numerical solution as:

(1 - tail(ode_solution, n=1) %>% pull(s)) 
## [1] 0.8534244

or by numerically solving the following recursive equation (Diekmann, Heesterbeek, and Britton 2013, 15): \[
s(\infty) = \exp(-R_0 (1-s(\infty)))
\]
This can be solved in R using:

# Function to compute the final size.  s_inf <- function(R0) {      f_target <- function(x) { x - exp(-R0*(1-x)) }      result <- uniroot(f_target, lower=1e-12, upper=1-1e-12)$root      return(result)  }    # Final proportion of infected.  1 - s_inf(R0)
## [1] 0.8534382

We can use the above equation to verify that the larger \(R_0\), the larger is the final size of the outbreak:

R0_grid <- c(1.25, 1.5, 1.75,2, 2.25, 2.5, 3)  map_dbl( R0_grid, ~ 1-s_inf(.x)) %>% setNames(R0_grid) %>% scales::percent(accuracy=1)
##  1.25   1.5  1.75     2  2.25   2.5     3   ## "37%" "58%" "71%" "80%" "85%" "89%" "94%"

However, despite a value of \(R_0>1\), not the entire population will be infected, because of the depletion of susceptibles. Hence, the exponential growth rate interpretation of \(R_0\) is only valid at the beginning of an outbreak.

Reducing \(R_0\)

As we could see from the equation defining \(R_0\) in our simple SIR-model, there are two ways to reduce the \(R_0\) of a disease:

  1. Reduce the number of contacts persons have with each-other, i.e. make \(\beta\) smaller
  2. Reduce the duration for how long people are effectively spreading the disease (e.g. by quarantine), i.e. reduce the average duration \(\gamma\) of how long infectious individuals can infect others.

For simplicity we shall only be interested in the first case and let's purse the following simple strategy, where the measures only depend on time:

\[
\beta(t) =
\left\{
\begin{array}{}
\beta_0 & \text{if } t\leq t_1, \\
\beta_1 & \text{if } t_1 < t\leq t_2, \\
\beta_2 & \text{if } t_2 < t
\end{array}
\right.
\]
where \(\beta_0\) is the ordinary \(\beta\) value of the disease. We will use a drastic reduction of the contacts within the interval \([t_1, t_2]\) and thus \(\beta_1 < \beta_0\). After some time, the control measures are reduced slightly, i.e. \(\beta_1 < \beta_2 < \beta_0\). We shall use \(\beta_1 = r_1 \beta_0\) and \(\beta_2 = r_2 \beta_0\) with \(r_1 \leq r_2\). The two epidemic curves can now be plotted as follows:

The final size in the two cases:

## # A tibble: 2 x 2  ## # Groups:   type [2]  ##   type                 final_fraction  ##                              ## 1 without_intervention 85%             ## 2 with_intervention    68%

Here we used the rather conservative estimate that \(R_0\) can be reduced by 60% to 1.35 for a few weeks, after the reduction is 80% of the original \(R_0\), i.e. 1.8. Things of course become more optimistic, the larger the reduction is. One trap is though to reduce \(R_0\) drastically and then lift the measures too much – in this case the outbreak is delayed, but then almost of the same peak and size, only later.

The simple analysis in this post shows that the final size proportion with interventions is several percentage points smaller than without interventions. The larger the interventions, if done right and timed right, the smaller the final size. In other words: the spread of an infectious disease in a population is a dynamic phenomena. Time matters. The timing of interventions matters. If done correctly, they stretch the outbreak and reduce the final size!

Discussion

The epidemic model based approaches to flatten the curve shows that the effect of reducing the basic reproduction number is not just to stretch out the outbreak, but also to limit the size of the outbreak. This is an aspect which seems to be ignored in some visualizations of the effect.

The simple SIR model used in this post suffers from a number of limitations: It is a deterministic construction averaging over many stochastic phenomena. However, at the large scale of the outbreak we are now talking about, this simplification appears acceptable. Furthermore, it assumes homogeneous mixing between individuals, which is way too simple. Dividing the population into age-groups as well as their geographic locations and modelling the interaction between these groups would be a more realistic reflection of how the population is shaped. Again, for the purpose of the visualization of the flatten-the-curve effect again I think a simple model is OK. More involved modelling covering the establishment of the disease as endemic in the population are beyond this post, so is the effectiveness of the case-tracing. For more background on the modelling see for example the YouTube video about the The Mathematics of the Coronavirus Outbreak by my colleague Tom Britton or the work by Fraser et al. (2004).

It is worth pointing out that mathematical models are only tools to gain insight. They are based on assumptions which are likely to be wrong. The question is, if a violation is crucial or if the component is still adequately captured by the model. A famous quote says: All models are wrong, but some are useful… Useful in this case is the message: flatten the curve by reducing contacts and by efficient contact tracing.

Literature

Diekmann, Odo, Hans Heesterbeek, and Tom Britton. 2013. Mathematical Tools for Understanding Infectious Disease Dynamics. Princeton University Press.

Fraser, Christophe, Steven Riley, Roy M Anderson, and Neil M Ferguson. 2004. "Factors That Make an Infectious Disease Outbreak Controllable." Proceedings of the National Academy of Sciences of the United States of America 101 (16): 6146–51.

Kermack, W. O., and A. G. McKendrick. 1927. "A Contribution to the Mathematical Theory of Epidemics." Proceedings of the Royal Society, Series A 115: 700–721.

Soetaert, Karline, Thomas Petzoldt, and R. Woodrow Setzer. 2010. "Solving Differential Equations in R: Package deSolve." Journal of Statistical Software 33 (9): 1–25. https://doi.org/10.18637/jss.v033.i09.

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Vectorising like a (semi)pro

Posted: 14 Mar 2020 05:00 PM PDT

[This article was first published on Data Imaginist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a short practical post about programming with R. Take it for what it
is and nothing more…

R is slow! That is what they keep telling us (they being someone who "knows"
about "real" programming and has another language that they for some reason fail
to be critical about).

R is a weird thing. Especially for people who has been trained in a classical
programming language. One of the main reasons for this is its vectorised nature,
which is not just about the fact that vectors are prevalent in the language, but
is an underlying principle that should guide the design of efficient algorithms
in the language. IF you write R like you write C (or Python), then sure it is
slow, but really, you are just using it wrong.

This post will take you through the design of a vectorised function. The genesis
of the function comes from my generative art, but I thought it was so nice and
self-contained that it would make a good blog post. If that seems like something
that could take your mind off the pandemic, then buckle up!

The problem

I have a height-map, that is, a matrix of numeric values. You know what? Let's
make this concrete and create one:

library(ambient)  library(dplyr)    z <- long_grid(1:100, 1:100) %>%     mutate(val = gen_simplex(x, y, frequency = 0.02)) %>%     as.matrix(val)    image(z, useRaster = TRUE)

This is just some simplex noise of course, but it fits our purpose…

Anyway, we have a height-map and we want to find the local extrema, that is, the
local minimum and maximum. That's it. Quite a simple and understandable
challenge right.

Vectorised, smecktorised

Now, had you been a trained C-programmer you would probably have solved this
with a loop. This is the way it should be done in C, but applying this to R will
result in a very annoyed programmer who will tell anyone who cares to listen
that R is slow.

We already knew this. We want something vectorised, right? But what is
vectorised anyway? All over the internet the recommendation is to use the
apply()-family of function to vectorise your code, but I have some bad news
for you: This is the absolute wrong way to vectorise. There are a lot of good
reasons to use the functional approach to looping instead of the for-loop, but
when it comes to R, performance is not one of them.

Shit…

To figure this out, we need to be a bit more clear about what we mean with a
vectorised function. There are some different ways to think about it

  1. The broad and lazy definition is a function that operates on the elements of
    a vector. This is where apply() (and friends) based functions reside.
  2. The narrow and performant definition is a function that operates on the
    elements of a vector in compiled code. This is where many of R's base
    functions live along with properly designed functions implemented in C or C++
  3. The middle ground is a function that is composed of calls to 2. to avoid
    explicit loops, thus deferring most element-wise operations to compiled code.

We want to talk about 3.. Simply implementing this in compiled code would be
cheating, and we wouldn't learn anything.

Thinking with vectors

R comes with a lot of batteries included. Some of the more high-level function
are not implemented with performance in mind (sadly), but a lot of the basic
stuff is, e.g. indexing, arithmetic, summations, etc. It turns out that these
are often enough to implement pretty complex functions in an efficient
vectorised manner.

Going back to our initial problem of finding extrema: What we effectively are
asking for is a moving window function where each cell is evaluated on whether
it is the largest or smallest value in its respective window. If you think a bit
about this, this is mainly an issue of indexing. For each element in the matrix,
we want the indices of all the cells within its window. Once we have that, it is
pretty easy to extract all the relevant values and use the vectorised pmin()
and pmax() function to figure out the maximum value in the window and use the
(vectorised) == to see if the extrema is equivalent to the value of the cell.

That's a lot of talk, here is the final function:

extrema <- function(z, neighbors = 2) {    ind <- seq_along(z)    rows <- row(z)    cols <- col(z)    n_rows <- nrow(z)    n_cols <- ncol(z)    window_offsets <- seq(-neighbors, neighbors)    window <- outer(window_offsets, window_offsets * n_rows, `+`)    window_row <- rep(window_offsets, length(window_offsets))    window_col <- rep(window_offsets, each = length(window_offsets))    windows <- mapply(function(i, row, col) {      row <- rows + row      col <- cols + col      new_ind <- ind + i      new_ind[row < 1 | row > n_rows | col < 1 | col > n_cols] <- NA      z[new_ind]    }, i = window, row = window_row, col = window_col, SIMPLIFY = FALSE)    windows <- c(windows, list(na.rm = TRUE))    minima <- do.call(pmin, windows) == z    maxima <- do.call(pmax, windows) == z    extremes <- matrix(0, ncol = n_cols, nrow = n_rows)    extremes[minima] <- -1    extremes[maxima] <- 1    extremes  }

(don't worry, we'll go through it in a bit)

This function takes a matrix, and a neighborhood radius and returns a new matrix
of the same dimensions as the input, with 1 in the local maxima, -1 in the
local minima, and 0 everywhere else.

Let's go through it:

# ...    ind <- seq_along(z)    rows <- row(z)    cols <- col(z)    n_rows <- nrow(z)    n_cols <- ncol(z)  # ...

Here we are simply doing some quick calculations upfront for reuse later. The
ind variable is simply the index for each cell in the matrix. Matrices are
simply vectors underneath, so they can be indexed like that as well. rows and
cols holds the row and column index of each cell, and n_rows and n_cols
are pretty self-explanatory.

# ...    window_offsets <- seq(-neighbors, neighbors)    window <- outer(window_offsets, window_offsets * n_rows, `+`)    window_row <- rep(window_offsets, length(window_offsets))    window_col <- rep(window_offsets, each = length(window_offsets))  # ...

Most of the magic happens here, but it is not that apparent. What we do is that
we use the outer() function to construct a matrix, the size of our window,
holding the index offset from the center for each of the cells in the window. We
also construct vectors holding the rows and column offset for each cell

# ...    windows <- mapply(function(i, row, col) {      row <- rows + row      col <- cols + col      new_ind <- ind + i      new_ind[row < 1 | row > n_rows | col < 1 | col > n_cols] <- NA      z[new_ind]    }, i = window, row = window_row, col = window_col, SIMPLIFY = FALSE)  # ...

This is where all the magic appear to happen. For each cell in the window, we
are calculating it's respective value for each cell in the input matrix. I can
already hear you scream about me using and apply()-like function, but the key
thing is that I'm not using it to loop over the elements of the input vector (or
matrix), but over a much smaller (and often fixed) number of elements.

If you want to leave now because I'm moving the goal-posts by my guest.

Anyway, what is happening inside the mapply() call? Inside the function we
figure out which row and column the offsetted cell is part of. Then we calculate
the index of the cells for the offset. In order to guard against out-of-bounds
errors we set all the indices that are out of bound to NA, and then we simply
index into our matrix. The crucial part is that all of the operations here are
vectorised (indexing, arithmetic, and comparisons). In the end we get a list
holding vectors of values for each cell in the window.

# ..    windows <- c(windows, list(na.rm = TRUE))    minima <- do.call(pmin, windows) == z    maxima <- do.call(pmax, windows) == z    extremes <- matrix(0, ncol = n_cols, nrow = n_rows)    extremes[minima] <- -1    extremes[maxima] <- 1    extremes  # ..

This is really just wrapping up, even though the actual computations are
happening here. We use pmin() and pmax() to find the maximum and minimum
across each window, and compare it to the value in our input matrix (again, all
proper vectorised function). In the end we construct a matrix holding 0s and
use the calculated positions to set 1 or -1 at the location of local
extremes.

Does it work?

I guess that is the million dollar question, closely followed by "is it
faster?". I don't really care enough to implement a "dumb" vectorisation, so
I'll just put my head on the block with the last question and insist that, yes,
it is much faster. You can try to beat me with an apply() based solution and
I'll eat a sticker if you succeed (unless you cheat).

As for the first question, let's have a look

extremes <- extrema(z)  extremes[extremes == 0] <- NA    image(z, useRaster = TRUE)  image(extremes, col = c('black', 'white'), add = TRUE)

Lo and behold, it appears as if we succeeded.

Can vectorisation save the world?

No…

More to the point, not every problem has a nice vectorised solution. Further,
the big downside with proper vectorisation is that it often requires expanding a
lot of variables to the size of the input vector. In our case we needed to hold
all windows in memory simultaneously, and it does not take too much imagination
to think up scenarios where that may make our computer explode. Still, more
often than not it is possible to write super performant R code, and usually the
crucial part is to figure out how to do some intelligent indexing.

If you are still not convinced then read through
Brodie Gaslam's blog. He has a penchant for
implementing ridiculously complicated stuff in highly efficient R
code. It goes without saying that his posts are often more involved than this,
but if you have kept reading until this point, I think you are ready…

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Comments