[R-bloggers] Riddler: Can You Roll The Perfect Bowl? (and 6 more aRticles)

[R-bloggers] Riddler: Can You Roll The Perfect Bowl? (and 6 more aRticles)

Link to R-bloggers

Riddler: Can You Roll The Perfect Bowl?

Posted: 31 May 2020 06:41 AM PDT

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

FiveThirtyEight's Riddler Express

link

At the recent World Indoor Bowls Championships in Great Yarmouth,
England, one of the rolls by Nick Brett went viral. Here it is in all
its glory:

In order for Nick's green bowl to split the two red bowls, he needed
expert precision in both the speed of the roll and its final angle of
approach.

Suppose you were standing in Nick's shoes, and you wanted to split two
of your opponent's bowls. Let's simplify the math a little, and say
that each bowl is a sphere with a radius of 1. Let's further suppose
that your opponent's two red bowls are separated by a distance of 3 —
that is, the centers of the red bowls are separated by a distance of
5. Define phi as the angle between the path your bowl is on and the
line connecting your opponent's bowls.

For example, here's how you could split your opponent's bowls when phi
is 75 degrees:

What is the minimum value of phi that will allow your bowl to split
your opponents' bowls without hitting them?

Plan

I will approximate the solution to this puzzle by simulating the game
from many different angles. Thankfully, because the game is vertically
and horizontally symmetric, I only need to simulate the green ball
reaching the middle point between the red balls and only need to see if
it collides with the top red ball.

Setup

knitr::opts_chunk$set(echo = TRUE, comment = "#>", cache = TRUE)  library(glue)  library(clisymbols)  library(ggforce)  library(gganimate)  library(tidyverse)  theme_set(theme_minimal())  # Some standard colors used throughout  green <- "#54c761"  red <- "#c75454"  purple <- "#a06bdb"  light_grey <- "grey70"  grey <- "grey40"  set.seed(0)  

Simulate a single pass

I split the code into two pieces. The first simulates a bowl with a
given angle, and the second decides on the angle to narrow down the
approximation. The following functions take care of the first part:
simulating a bowl.

A single simulation can be run by calling run_bowl_simulation() with
an angle (in degrees). The function works by changing the hypotenuse,
starting with h_start = 5 and decreasing it to 0 by step_size steps
(the steps are held in the numeric vector h_vals). The actual position
of the ball is calculated from the length of the hypotenuse and angle
with a bit of trigonometry in make_green_ball(). For each hypotenuse
value, the green ball is positioned and then tested to see if it
collides with the red ball (set at ((x,y) = (0,2.5)) as per the
riddle) using the function did_balls_collide(). This information is
recorded by building a single data frame with the data for each step of
the simulation. The data frame is returned at the end of the simulation.

# Run a simulation of the bowling game.  run_bowl_simulation <- function(angle,  step_size = 0.1,  red_ball_loc = list(x = 0, y = 2.5)) {  h_start <- 5  h_vals <- seq(h_start, 0, by = -step_size)  angle <- angle * (pi / 180)  all_ball_pos <- NULL  for (h in h_vals) {  green_ball <- make_green_ball(h, angle)  collision <- did_balls_collide(green_ball, red_ball_loc, radius = 1)  all_ball_pos <- bind_rows(  all_ball_pos,  tibble(h = h,  x = green_ball$x,  y = green_ball$y,  collision = collision)  )  }  return(all_ball_pos)  }  
# Make a green ball location from the x-position and angle.  make_green_ball <- function(h, angle) {  x <- -1 * h * cos(pi/2 - angle)  y <- h * sin(pi/2 - angle)  list(x = x, y = y)  }  
# Decide wether the two balls of radius `r` collided.  did_balls_collide <- function(ball1, ball2, radius) {  d <- sqrt((ball1$x - ball2$x)^2 + (ball1$y - ball2$y)^2)  return(d <= 2*radius)  }  

Below are the results from running the simulation at angles between 90
degrees (horizontal) and 0 degrees (vertical) at 10 degree increments.
Each line is an individual simulation, and each point is a round of the
simulation. A red ball is positioned as per the riddle, and the purple
points indicate where the green ball would collide with the red ball.
These example simulations show that the run_bowl_simulation() function
is working as expected.

map(seq(90, 0, -10), run_bowl_simulation, step_size = 0.1) %>%  map2(seq(90, 0, -10), ~ .x %>% add_column(angle = .y)) %>%  bind_rows() %>%  mutate(collision = ifelse(collision, "collision", "safe")) %>%  ggplot() +  geom_point(aes(x, y, color = collision), size = 2) +  geom_circle(aes(x0 = x0, y0 = y0, r = r),  data = tibble(x0 = 0, y0 = 2.5, r = 1),  color = red, fill = red, alpha = 0.5) +  scale_color_manual(values = c(purple, light_grey)) +  coord_fixed() +  theme(  legend.position = c(0.15, 0.9),  legend.title = element_blank()  ) +  labs(x = "x", y = "y",  title = "Example paths of the green ball",  subtitle = "For the angles between 0 and 90 at 10 degree intervals.")  

Find the smallest angle

The second part of the code is to find the smallest (narrowest) angle at
which there is no collision. Instead of trying every angle between 90
degrees and 0 degrees at some very small increment, I approach this
problem a bit more efficiently. I built an algorithm than starts at 90
degrees and takes large steps until there is an angle that causes a
collision. It then takes a step back an tries again with a progressively
smaller step, until it no longer collides. This continues with the step
size getting smaller and smaller. The algorithm stops when the step size
is small enough for a good approximation and the angle does not cause a
collision. The code chunk below carries out this process, printing the
information for each pass.

The purpose of the angle and previous_angle parameters are fairly
obvious. The angle_delta parameter is the value by which the angle is
reduced at each step. epsilon is used to reduce angle_delta when
there are collisions at an angle. Finally, min_angle_delta is one of
the stopping criteria: when angle_delta gets below this value, the
algorithm is sufficiently close to the correct answer and it stops
trying new angles. Thus, this parameter determines the precision of the
algorithm.
It is set relatively high for now, because this first pass
is just a demonstration and prints out the results of each iteration.

For efficiency, the while loop uses a memoised version of
run_bowl_simulation() because when the balls collide, the previous
step is tried again. Therefore, memoising the function saves some time
instead of running the simulation from the same angle multiple times.

# The starting angle.  angle <- 90  previous_angle <- angle  # The "learning rate" paramerters.  angle_delta <- 10  epsilon <- 0.5  min_angle_delta <- 0.01  # Start with TRUE, though it doesn't matter.  collision <- TRUE  memo_bowl_sim <- memoise::memoise(run_bowl_simulation)  while (angle_delta >= min_angle_delta | collision) {  # Run the bowling simulation with the current angle.  sim_res <- memo_bowl_sim(angle = angle, step_size = 0.1)  # Were there any collisions?  collision <- any(sim_res$collision)  # Print results  msg <- "collision: {ifelse(collision, symbol$cross, symbol$tick)}" %>%  paste("{collision},") %>%  paste("angle: {round(angle, 4)},") %>%  paste("angle_delta: {round(angle_delta, 4)}")  print(glue(msg))  if (!collision) {  # Reduce the angle if there is no collision.  previous_angle <- angle  angle <- angle - angle_delta  } else {  # Revert to the previous angle and reduce delta if there is a collision.  angle_delta <- epsilon * angle_delta  angle <- previous_angle  }  }  
#> collision: ✔ FALSE, angle: 90, angle_delta: 10  #> collision: ✔ FALSE, angle: 80, angle_delta: 10  #> collision: ✔ FALSE, angle: 70, angle_delta: 10  #> collision: ✔ FALSE, angle: 60, angle_delta: 10  #> collision: ✖ TRUE, angle: 50, angle_delta: 10  #> collision: ✔ FALSE, angle: 60, angle_delta: 5  #> collision: ✔ FALSE, angle: 55, angle_delta: 5  #> collision: ✖ TRUE, angle: 50, angle_delta: 5  #> collision: ✔ FALSE, angle: 55, angle_delta: 2.5  #> collision: ✖ TRUE, angle: 52.5, angle_delta: 2.5  #> collision: ✔ FALSE, angle: 55, angle_delta: 1.25  #> collision: ✔ FALSE, angle: 53.75, angle_delta: 1.25  #> collision: ✖ TRUE, angle: 52.5, angle_delta: 1.25  #> collision: ✔ FALSE, angle: 53.75, angle_delta: 0.625  #> collision: ✖ TRUE, angle: 53.125, angle_delta: 0.625  #> collision: ✔ FALSE, angle: 53.75, angle_delta: 0.3125  #> collision: ✔ FALSE, angle: 53.4375, angle_delta: 0.3125  #> collision: ✖ TRUE, angle: 53.125, angle_delta: 0.3125  #> collision: ✔ FALSE, angle: 53.4375, angle_delta: 0.1562  #> collision: ✔ FALSE, angle: 53.2812, angle_delta: 0.1562  #> collision: ✖ TRUE, angle: 53.125, angle_delta: 0.1562  #> collision: ✔ FALSE, angle: 53.2812, angle_delta: 0.0781  #> collision: ✔ FALSE, angle: 53.2031, angle_delta: 0.0781  #> collision: ✖ TRUE, angle: 53.125, angle_delta: 0.0781  #> collision: ✔ FALSE, angle: 53.2031, angle_delta: 0.0391  #> collision: ✔ FALSE, angle: 53.1641, angle_delta: 0.0391  #> collision: ✖ TRUE, angle: 53.125, angle_delta: 0.0391  #> collision: ✔ FALSE, angle: 53.1641, angle_delta: 0.0195  #> collision: ✔ FALSE, angle: 53.1445, angle_delta: 0.0195  #> collision: ✖ TRUE, angle: 53.125, angle_delta: 0.0195  #> collision: ✔ FALSE, angle: 53.1445, angle_delta: 0.0098  

From the print-out above, we can see how the algorithm jumps back an
forth, narrowing in on a solution around 53 degrees.

With that successful proof-of-concept, the following code runs the
algorithm with a smaller min_angle_delta = 1e-5 to achieve greater
precision. Instead of printing out the results of each iteration, the
simulation results and parameters are saved to sim_results_tracker and
sim_parameters_tracker, respectively, and are inspected below.

angle <- 90  previous_angle <- angle  angle_delta <- 10  epsilon <- 0.7  min_angle_delta <- 1e-5  collision <- TRUE  sim_results_tracker <- tibble()  sim_parameters_tracker <- tibble()  memo_bowl_sim <- memoise::memoise(run_bowl_simulation)  while (angle_delta >= min_angle_delta | collision) {  sim_res <- memo_bowl_sim(angle = angle, step_size = 0.01)  collision <- any(sim_res$collision)  sim_results_tracker <- bind_rows(sim_results_tracker,  sim_res %>% add_column(angle = angle))  sim_parameters_tracker <- bind_rows(sim_parameters_tracker,  tibble(angle, angle_delta,  collision, epsilon))  if (!collision) {  previous_angle <- angle  angle <- angle - angle_delta  } else {  angle_delta <- epsilon * angle_delta  angle <- previous_angle  }  }  

The simulation took 89 steps. The plot below shows the angle and
angle_delta at each step, colored by whether there was a collision or
not.

sim_parameters_tracker %>%  mutate(row_idx = row_number()) %>%  pivot_longer(-c(row_idx, epsilon, collision)) %>%  ggplot(aes(x = row_idx, y = value)) +  facet_wrap(~ name, nrow = 2, scales = "free") +  geom_point(aes(color = collision), size = 0.9) +  scale_color_manual(values = c(light_grey, purple)) +  labs(x = "iteration number",  y = "value",  title = "Simulation parameters")  

The following plot shows each of the paths tried, again, coloring the
locations of collisions in purple.

sim_results_tracker %>%  mutate(collision = ifelse(collision, "collision", "safe")) %>%  ggplot() +  geom_point(aes(x = x, y = y, color = collision),  size = 0.1) +  scale_color_manual(values = c(collision = purple,  safe = light_grey)) +  coord_fixed() +  theme(legend.position = "none") +  labs(x = "x",  y = "y",  title = "Paths of the green ball",  subtitle = "Points marked in purple were collisions with the red ball.")  

Finally, we can find the approximated angle by taking the smallest angle
tried in the rounds of simulation that did not have any collisions.

smallest_angle <- sim_parameters_tracker %>%  filter(collision == FALSE) %>%  top_n(1, wt = -angle) %>%  pull(angle) %>%  unique()  

The algorithm approximates the solution to be: 53.1301 degrees (0.9273
in radians).

The simulation with this angle is shown in an animated plot below.

final_result <- sim_results_tracker %>%  filter(angle == smallest_angle) %>%  mutate(row_idx = row_number()) %>%  filter(row_idx == 1)  bind_rows(  final_result,  final_result %>%  mutate(x = -1 * x, y = -1 * y)  ) %>%  mutate(row_idx = row_number()) %>%  ggplot() +  geom_point(aes(x = x, y = y),  color = green, size = 2) +  geom_circle(aes(x0 = x, y0 = y, r = 1),  fill = green, alpha = 0.2, size = 0) +  geom_point(aes(x, y),  data = tibble(x = 0, y = 2.5),  color = red, size = 2) +  geom_circle(aes(x0 = x, y0 = y, r = r),  data = tibble(x = 0, y = 2.5, r = 1),  fill = red, alpha = 0.2, size = 0) +  geom_point(aes(x, y),  data = tibble(x = 0, y = -2.5),  color = red, size = 2) +  geom_circle(aes(x0 = x, y0 = y, r = r),  data = tibble(x = 0, y = -2.5, r = 1),  fill = red, alpha = 0.2, size = 0) +  coord_fixed() +  labs(  x = "x",  y = "y",  title = glue(  "The tightest angle of the perfect bowl: {round(smallest_angle, 3)} deg."  )) +  transition_states(row_idx, transition_length = 2,  state_length = 0, wrap = FALSE) +  ease_aes("sine-in-out")  


Acknowledgements

Repetitive tasks were sped up using the
'memoise' package for
memoization. Plotting was accomplished using
'ggplot2',
'ggforce', and
'gganimate'.

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

gratia 0.4.1 released

Posted: 31 May 2020 06:00 AM PDT

[This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After a slight snafu related to the 1.0.0 release of dplyr, a new version of gratia is out and available on CRAN. This release brings a number of new features, including differences of smooths, partial residuals on partial plots of univariate smooths, and a number of utility functions, while under the hood gratia works for a wider range of models that can be fitted by mgcv.

Partial residuals

The draw() method for gam() and related models produces partial effects plots. plot.gam() has long had the ability to add partial residuals to partial plots of univariate smooths, and with the latest release draw() can now do so too.

df1 <- data_sim("eg1", n = 400, seed = 42)  m1 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df1, method = "REML")  draw(m1, residuals = TRUE)
Partial plots of estimated smooth functions with partial residuals

If the estimated functions have the correct degree of wiggliness, the partial residuals should be approximately uniformly distributed about the estimated smooth.

Simulating data

The previous example demonstrated another new feature of the latest release; data_sim(). This is a reimplementation of mgcv::gamSim(), which is used to simulate data for testing GAMs. Data can be simulated from several widely-used functions that illustrate the power an capabilities of estimating smooth functions using penalised splines.

data_sim() returns simulated data in a tidy fashion and all the various example test data sets return consistently. Also, data from the example functions can be simulated from a number of probability distributions — currently the Gaussian, Poisson, and Bernoulli distributions are supported, but future versions will offer a wider range to simulate from.

For example, the response data modelled above came from the following four functions used by Gu and Wahba

df1 %>% mutate(id = seq_len(nrow(df1))) %>%    select(id, x0:x3, f0:f3) %>%    pivot_longer(x0:f3, names_sep = 1, names_to = c("var", "fun")) %>%    pivot_wider(names_from = var, values_from = value) %>%    ggplot(aes(x = x, y = f)) +       geom_line() +       facet_wrap(~ fun)
Gu and Wahba four term additive example functions

Difference smooths

When GAMs contain smooth-factor interactions, we often want to compare smooths between levels of the factor to determine how the smooth effects vary between groups. The new release contains a function difference_smooths() that implements this idea.

The mgcv example for factor-smooth interactions using the by mechanism can be simulated from using data_sim(). The model fitted to the data contains a smooth of covariate x1 and a smooth of x2 for each level of the factor fac. Note that we need the parametric effect for fac as the by smooths are all centred about 0; the parametric term models the different group means.

df <- data_sim("eg4", n = 1000, seed = 42)  m2 <- gam(y ~ fac + s(x2, by = fac) + s(x0), data = df, method = "REML")

difference_smooths() returns differences between the smooth functions for all pairs of the levels of fac, plus a credible interval for the difference.

sm_diffs <- difference_smooths(m2, smooth = "s(x2)")  sm_diffs
# A tibble: 300 x 9     smooth by    level_1 level_2  diff    se   lower upper      x2                         1 s(x2)  fac   1       2       0.797 0.536 -0.253   1.85 0.00170   2 s(x2)  fac   1       2       0.846 0.500 -0.135   1.83 0.0118    3 s(x2)  fac   1       2       0.896 0.467 -0.0190  1.81 0.0219    4 s(x2)  fac   1       2       0.945 0.435  0.0929  1.80 0.0319    5 s(x2)  fac   1       2       0.994 0.405  0.200   1.79 0.0420    6 s(x2)  fac   1       2       1.04  0.378  0.302   1.78 0.0521    7 s(x2)  fac   1       2       1.09  0.354  0.397   1.78 0.0622    8 s(x2)  fac   1       2       1.14  0.332  0.485   1.79 0.0722    9 s(x2)  fac   1       2       1.18  0.314  0.566   1.80 0.0823   10 s(x2)  fac   1       2       1.22  0.298  0.641   1.81 0.0924   # … with 290 more rows

There is a draw() method for objects returned by difference_smooths(), which will plot the pairwise differences

draw(sm_diffs)
Differences between estimated smooth functions

Note that these differences exclude differences in the group means and the differences between smooths are computed on the scale of the link function. A future version will allow for differences that include the group means.

Fitted values and residuals utility functions

Two new utility functions are in the current release, add_fitted() and add_residuals() add fitted values and residuals to a data frame of observations used to fit a model.

df1 %>% add_fitted(m1, value = ".fitted") %>%    add_residuals(m1, value = ".resid")
# A tibble: 400 x 12         y    x0     x1    x2    x3     f    f0    f1     f2    f3 .fitted .resid                        1  2.99 0.915 0.0227 0.909 0.402  1.62 0.529  1.05 0.0397     0    2.57  0.419   2  4.70 0.937 0.513  0.900 0.432  3.25 0.393  2.79 0.0630     0    3.91  0.788   3 13.9  0.286 0.631  0.192 0.664 13.5  1.57   3.53 8.41       0   12.9   1.03    4  5.71 0.830 0.419  0.532 0.182  6.12 1.02   2.31 2.79       0    6.57 -0.859   5  7.63 0.642 0.879  0.522 0.838 10.4  1.80   5.80 2.76       0   10.3  -2.67    6  9.80 0.519 0.108  0.160 0.917 10.4  2.00   1.24 7.18       0    9.23  0.571   7 10.4  0.737 0.980  0.520 0.798 11.3  1.47   7.10 2.75       0   11.2  -0.754   8 12.8  0.135 0.265  0.225 0.503 11.4  0.821  1.70 8.90       0   11.0   1.77    9 13.8  0.657 0.0843 0.282 0.254 11.1  1.76   1.18 8.20       0   11.5   2.28   10  7.51 0.705 0.386  0.504 0.667  6.50 1.60   2.16 2.74       0    6.71  0.792  # … with 390 more rows

Other changes

This release contains a number of other less-visible changes. gratia now handles models fitted by gamm4::gamm4() in more functions than before, while the utility functions link() and inv_link() now work for all families in mgcv, including the general family functions and those used for fitting location scale models.

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

RSqLParser – tool to parse your SQL queries.

Posted: 30 May 2020 11:01 PM PDT

[This article was first published on R – FordoX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A slow performing query is a ticking bomb which can lead to explosion i.e a huge performance overhead in your application, any time specially when there is load on database servers. And knowing the its and bits of your SQL query is of utmost importance in diffusing the bomb.

This is not the only scenario when knowing your SQL is important. From your slow query logs, you might want to find the most used tables and time when a particular table gets maximum hits to do some analysis. This information probably can help you decide upon  a time for you to take dumps or fire alter queries on the table.

Say for instance, you have a relatively large SQL query embedded in your application code which has probably more than tens of bind variables scattered here and there. For debugging purpose, you might want to replace those variable with your chosen values and fire them in a particular SQL execution tool which does not support dynamic bind variable replacement.

To cater to all the needs, I felt there is a need of SQL parser in R and came up with this package – RSqlParser inspired by Java's JSqlParser. This tool will come handy for carrying out many analysis on SQL queries.

With this package, you can design your free tool to identify the reasons for your poorly performing queries or to address your various other use cases.

RSqlParser is a non-validating SQL parser. It expects syntactically correct SQL statements. It can be used to get various components of SQL statements.

Currently, it supports only SELECT statements.

library(RSqlParser)

Methods

There are currently 4 methods in the package:

get_all_bind_variables: Get the bind variables in sql.
get_all_select_cols_with_alias: Get the names of the selected columns in the sql
get_all_subqueries: Get the subqueries in sql.
get_all_tables_with_alias: Get the names of the tables with alias present in
the sql

There are many more methods waiting to be released in upcoming versions of the package. Not only that, in upcoming versions, package should be able to parse all DML and DDL statements.

Till then, if you are facing any issue using the package, please let me know.

To leave a comment for the author, please follow the link and comment on their blog: R – FordoX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Learning Shiny for Production

Posted: 30 May 2020 05:00 PM PDT

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hey Shiny devs of the world! I'm leading a training in July about
building a Shiny application for production. It will be a 10 half-day
session, with everything happening remotely, meaning that you can attend
from anywhere in the world 🎉.

During this online workshop, I'll share a lot of what we know about
building a robust Shiny application that is ready to be sent to
production: {golem}, prototyping, modules, reactivity, a little bit of
JavaScript for Shiny, and how and where to deploy your application once
it's ready.

I'm very excited about this training as it will cover a lot of the
things we've been working on for the past months, and will guide you
towards being a better Shiny programmer. And on top of that, we will be
working on our brand new online e-learning platform, which I've been
developing since the beginning of 2020 (and I'm sure you will enjoy
it!).

We still have a couple of tickets left to this session, so if you want
to know more, and register, please visit Learning Shiny for Production
– Remote
session
.

See you there!

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Superspreading and the Gini Coefficient

Posted: 30 May 2020 03:00 PM PDT

[This article was first published on Theory meets practice..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Abstract:

We look at superspreading in infectious disease transmission from a statistical point of view. We characterise heterogeneity in the offspring distribution by the Gini coefficient instead of the usual dispersion parameter of the negative binomial distribution. This allows us to consider more flexible offspring distributions.



Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Motivation

The recent Science report on Superspreading during the COVID-19 pandemic by Kai Kupferschmidt has made the dispersion parameter \(k\) of the negative binomial distribution a hot quantity1 in the discussions of how to determine effective interventions. This short blog post aims at understanding the math behind statements such as "Probably about 10% of cases lead to 80% of the spread" and replicate them with computations in R.

Warning: This post reflects more my own learning process of what is superspreading than trying to make any statements of importance.

Superspreading

Lloyd-Smith et al. (2005) show that the 2002-2004 SARS-CoV-1 epidemic was driven by a small number of events where one case directly infected a large number of secondary cases – a so called superspreading event. This means that for SARS-CoV-1 the distribution of how many secondary cases each primary case generates is heavy tailed. More specifically, the effective reproduction number describes the mean number of secondary cases a primary case generates during the outbreak, i.e. it is the mean of the offspring distribution. In order to address dispersion around this mean, Lloyd-Smith et al. (2005) use the negative binomial distribution with mean \(R(t)\) and over-dispersion parameter \(k\) as a probability model for the offspring distribution. The number of offspring that case \(i\), which got infected at time \(t_i\), causes is given by \[
Y_{i} \sim \operatorname{NegBin}(R(t_i), k),
\]
s.t. \(\operatorname{E}(Y_{i}) = R(t_i)\) and \(\operatorname{Var}(Y_{i}) = R(t_i) (1 + \frac{1}{k} R(t_i))\). This parametrisation makes it easy to see that the negative binomial model has an additional factor \(1 + \frac{1}{k} R(t_i)\) for the variance, which allows it to have excess variance (aka. over-dispersion) compared to the Poisson distribution, which has \(\operatorname{Var}(Y_{i}) = R(t_i)\). If \(k\rightarrow \infty\) we get the Poisson distribution and the closer \(k\) is to zero the larger the variance, i.e. the heterogeneity, in the distribution is. Note the deliberate use of the effective reproduction number \(R(t_i)\) instead of the basic reproduction number \(R_0\) (as done in Lloyd-Smith et al. (2005)) in the model. This is to highlight, that one is likely to observe clusters in the context of interventions and depletion of susceptibles.

That the dispersion parameter \(k\) is making epidemiological fame is a little surprising, because it is a parameter in a specific parametric model. A parametric model, which might be inadequate for the observed data. A secondary objective of this post is thus to focus more on describing the heterogeneity of the offspring distribution using classical statistical concepts such as the Gini coefficient.

Negative binomial distributed number of secondary cases

Let's assume \(k=0.45\) as done in Adam et al. (2020). This is a slightly higher estimate than the \(k=0.1\) estimate by Endo et al. (2020)2 quoted in the Science article. We want to derive statements like "the x% most active spreaders infected y% of all cases" as a function of \(k\). The PMF of the offspring distribution with mean 2.5 and dispersion 0.45 looks as follows:

Rt <- 2.5  k  <- 0.45     # Evaluate on a larger enough grid, so E(Y_t) is determined accurate enough  # We also include -1 in the grid to get a point (0,0) needed for the Lorenz curve  df <- data.frame(x=-1:250) %>% mutate(pmf= dnbinom(x, mu=Rt, size=k))

So we observe that 43% of the cases never manage to infect a secondary case, whereas some cases manage to generate more than 10 new cases. The mean of the distribution is checked empirically to equal the specified \(R(t)\) of 2.5:

sum(df$x * df$pmf)
## [1] 2.5

Lloyd-Smith et al. (2005) define a superspreader to be a primary case, which generates more secondary cases than the 99th quantile of the Poisson distribution with mean \(R(t)\). We use this to compute the proportion of superspreaders in our distribution:

(superspreader_threshold <- qpois(0.99, lambda=Rt))
## [1] 7
(p_superspreader <- pnbinom(superspreader_threshold, mu=Rt, size=k, lower.tail=FALSE))
## [1] 0.09539277

So 10% of the cases will generate more than 7 new cases. To get to statements such as "10% generate 80% of the cases" we also need to know how many cases those 10% generate out of the 2.5 average.

# Compute proportion of the overall expected number of new cases  df <- df %>% mutate(cdf = pnbinom(x, mu=Rt, size=k),                       expected_cases=x*pmf,                       prop_of_Rt=expected_cases/Rt,                      cum_prop_of_Rt = cumsum(prop_of_Rt))    # Summarise  info <- df %>% filter(x > superspreader_threshold) %>%     summarise(expected_cases = sum(expected_cases), prop_of_Rt = sum(prop_of_Rt))  info
##   expected_cases prop_of_Rt  ## 1       1.192786  0.4771144

In other words, the superspreaders generate (on average) 1.19 of the 2.5 new cases of a generation, i.e. 48%.

These statements can also be made without formulating a superspreader threshold by graphing the cumulative share of the distribution of primary cases against the cumulative share of secondary cases these generate. This is exactly what the Lorenz curve is doing. However, for outbreak analysis it appears clearer to graph the cumulative distribution in decreasing order of the number of offspring, i.e. following Lloyd-Smith et al. (2005) we plot the cumulative share as \(P(Y\geq y)\) instead of \(P(Y \leq y)\). This is a variation of the Lorenz curve, but allows statements such as "the %x cases with highest number of offspring generate %y of the secondary cases".

# Add information for plotting the modified Lorenz curve  df <- df %>%     mutate(cdf_decreasing = pnbinom(x-1, mu=Rt, size=k, lower.tail=FALSE)) %>%    arrange(desc(x)) %>%      mutate(cum_prop_of_Rt_decreasing = cumsum(prop_of_Rt))
# Plot the modified Lorenz curve as in Fig 1b of Lloyd-Smith et al. (2005)  ggplot(df, aes(x=cdf_decreasing, y=cum_prop_of_Rt_decreasing)) + geom_line() +     coord_cartesian(xlim=c(0,1)) +     xlab("Proportion of the infectious cases (cases with most secondary cases first)") +     ylab("Proportion of the secondary cases") +    scale_x_continuous(labels=scales::percent, breaks=seq(0,1,length=6)) +    scale_y_continuous(labels=scales::percent, breaks=seq(0,1,length=6)) +    geom_line(data=data.frame(x=seq(0,1,length=100)) %>% mutate(y=x), aes(x=x, y=y), lty=2, col="gray") + ggtitle(str_c("Scenario: R(t) = ", Rt, ", k = ", k))

Using the standard formulas to compute the Gini coefficient for a discrete distribution with support on the non-negative integers, i.e.  \[
G = \frac{1}{2\mu} \sum_{y=0}^\infty \sum_{z=0}^\infty f(y) f(z) |y-z|,
\]
where \(f(y)\), \(y=0,1,\ldots\) denotes the PMF of the distribution and \(\mu=\sum_{y=0}^\infty y f(y)\) is the mean of the distribution. In our case \(\mu=R(t)\). From this we get

# Gini index for a discrete probability distribution  gini_coeff <- function(df) {    mu <- sum(df$x * df$pmf)    sum <- 0    for (i in 1:nrow(df)) {      for (j in 1:nrow(df)) {        sum <- sum + df$pmf[i] * df$pmf[j] * abs(df$x[i] - df$x[j])      }    }    return(sum/(2*mu))  }    gini_coeff(df)  
## [1] 0.704049

A plot of the relationship between the dispersion parameter and the Gini index, given a fixed value of \(R(t)=2.5\), looks as follows

We see that the Gini index converges from above to the Gini index of the Poisson distribution with mean \(R(t)\). In our case this limit is

gini_coeff( data.frame(x=0:250) %>% mutate(pmf = dpois(x, lambda=Rt)))
## [1] 0.3475131

Red Marble Toy Example

For the toy example offspring distribution used by Christian Drosten in his Coronavirus Update podcast episode 44 on COVID-19 superspreading (in German). The described hypothetical scenario is translated to an offspring distribution, where a primary case either generates 1 (with probability 9/10) or 10 (with probability 1/10) secondary cases:

# Offspring distribution  df_toyoffspring <- data.frame( x=c(1,10), pmf=c(9/10, 1/10))    # Hypothetical outbreak with 10000 cases from this offspring distribution  y_obs <- sample(df_toyoffspring$x, size=10000, replace=TRUE, prob=df_toyoffspring$pmf)    # Fit the negative binomial distribution to the observed offspring distribution  # Note It would be better to fit the PMF directly instead of to the hypothetical  # outbreak data  (fit <- MASS::fitdistr(y_obs, "negative binomial"))
##       size          mu      ##   1.69483494   1.90263640   ##  (0.03724779) (0.02009563)
# Note: different parametrisation of the k parameter  (k.hat <- 1/fit$estimate["size"])
##     size   ## 0.590028

In other words, when fitting a negative binomial distribution to these data (probably not a good idea) we get a dispersion parameter of 0.59.

The Gini coefficient allows for a more sensible description for offspring distributions, which are clearly not negative-binomial.

gini_coeff(df_toyoffspring) 
## [1] 0.4263158

Discussion

The effect of superspreaders underlines the stochastic nature of the dynamics of an person-to-person transmitted disease in a population. The dispersion parameter \(k\) is conditional on the assumption of a given parametric model for the offspring distribution (negative binomial). The Gini index is an alternative characterisation to measure heterogeneity. However, in both cases the parameters are to be interpreted together with the expectation of the distribution. Estimation of the dispersion parameter is orthogonal to the mean in the negative binomial and its straightforward to also get confidence intervals for it. This is less straightforward for the Gini index.

A heavy tailed offspring distribution can make the disease easier to control by targeting intervention measures to restrict superspreading (Lloyd-Smith et al. 2005). The hope is that such interventions are "cheaper" than interventions which target the entire population of infectious contacts. However, the success of such a targeted strategy also depends on how large the contribution of superspreaders really is. Hence, some effort is needed to quantify the effect of superspreaders. Furthermore, the above treatment also underlines that heterogeneity can be a helpful feature to exploit when trying to control a disease. Another aspect of such heterogeneity, namely its influence on the threshold of herd immunity, has recently been invested by my colleagues at Stockholm University (Britton, Ball, and Trapman 2020).

Literature

Adam, DC, P Wu, J Wong, E Lau, T Tsang, S Cauchemez, G Leung, and B Cowling. 2020. "Clustering and Superspreading Potential of Severe Acute Respiratory Syndrome Coronavirus 2 (Sars-Cov-2) Infections in Hong Kong." Research Square. https://doi.org/10.21203/rs.3.rs-29548/v1.

Britton, T, F Ball, and P Trapman. 2020. "The Disease-Induced Herd Immunity Level for Covid-19 Is Substantially Lower Than the Classical Herd Immunity Level." https://arxiv.org/abs/2005.03085.

Endo, A, Centre for the Mathematical Modelling of Infectious Diseases COVID-19 Working Group, S Abbott, AJ Kucharski, and S Funk. 2020. "Estimating the Overdispersion in Covid-19 Transmission Using Outbreak Sizes Outside China [Version 1; Peer Review: 1 Approved, 1 Approved with Reservations]." Wellcome Open Res. https://doi.org/10.12688/wellcomeopenres.15842.1.

Lloyd-Smith, J. O., S. J. Schreiber, P. E. Kopp, and W. M. Getz. 2005. "Superspreading and the Effect of Individual Variation on Disease Emergence." Nature 438 (7066): 355–59. https://doi.org/10.1038/nature04153.


  1. To be added to the list of characterising quantities such as doubling time, reproduction number, generation time, serial interval, …↩

  2. Lloyd-Smith et al. (2005) estimated \(k=0.16\) for SARS-CoV-1.↩

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Mimic Excel’s Conditional Formatting in R

Posted: 30 May 2020 02:14 PM PDT

[This article was first published on triKnowBits, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The DT package is an interface between R and the JavaScript DataTables library (RStudio DT documentation). In Example 3 (at this page) they show how to heatmap-format a table. This post modifies the example to

  1. format each column individually
  2. shade in green rather than red
  3. use base R syntax rather than piping
  4. omit the extra accoutrements of the displayed table (from the answer to this stackoverflow post), except
  5. include a title.
Here we generate data similar to that in Example 3, but with average values growing by column
set.seed(12345)
df = as.data.frame(
  cbind(round(rnorm(10, mean = 0), 3), 
  round(rnorm(10, mean = 4), 3), 
  round(rnorm(10, mean = 8), 3), 
  round(rnorm(10, mean = 16), 3), 
  round(rnorm(10, mean = 32), 3), 
  sample(0:1, 10, TRUE)))
Using the code in the example — modified to green — the darker values naturally appear in columns V4 and V5.

But that's not what we want.
For each column to have it's own scale, simply apply RStudio's algorithm to each column of df in a loop. The trick to notice is that formatStyle wants a datatable object as its first argument, and produces a datatable object as its result. Therefore, start off with a plain-Jane datatable and successively format each column, saving the result each time. Almost like building a ggplot. At the end, view the final result.
# Start with a (relatively) plain, unformatted datatable object
dt <- DT::datatable(df, 
                    options = list(dom = 't', ordering = FALSE),
                    caption = "Example 3 By Column")
# Loop through the columns formatting according to that column's distribution
for (j in seq_along(df)) {
  # Create breaks for shading column values high to low
brks <- stats::quantile(x <- df[[j]], probs = seq(.05, .95, .05), na.rm = TRUE)
# Create shades of green for backgrounds
y <- round(seq(255, 40, length.out = length(brks) + 1), 0)
clrs <- paste0("rgb(", y, ", 255,", y, ")")
# Format cells in j-th column
dt <- DT::formatStyle(dt, j, backgroundColor = DT::styleInterval(brks, clrs))
}
dt
Actuaries in the crowd might recognize the image at the top of the post as the table of link ratios from the GenIns dataset in the ChainLadder package. There do not appear to be any distinctive trends in the ratios by age.

To leave a comment for the author, please follow the link and comment on their blog: triKnowBits.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

drat 0.1.6: Rewritten macOS binary support

Posted: 30 May 2020 12:01 PM PDT

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

drat user

A new version of drat arrived on CRAN overnight, once again taking advantage of the fully automated process available for such packages with few reverse depends and no open issues. As we remarked at the last release fourteen months ago when we scored the same nice outcome: Being a simple package can have its upsides…

This release is mostly the work of Felix Ernst who took on what became a rewrite of how binary macOS packages are handled. If you need to distribute binary packages for macOS users, this may help. Two more small updates were made, see below for full details.

drat stands for drat R Archive Template, and helps with easy-to-create and easy-to-use repositories for R packages. Since its inception in early 2015 it has found reasonably widespread adoption among R users because repositories with marked releases is the better way to distribute code.

As your mother told you: Friends don't let friends install random git commit snapshots. Rolled-up releases it is. drat is easy to use, documented by five vignettes and just works.

The NEWS file summarises the release as follows:

Changes in drat version 0.1.6 (2020-05-29)

  • Changes in drat functionality

    • Support for the various (current) macOS binary formats was rewritten (Felix Ernst in #89 fixing #88).

    • Travis CI use was updated to R 4.0.0 and bionic (Dirk).

    • A drat repo was added to the README (Thomas Fuller in #86)

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Comments