[R-bloggers] A brief primer on Variational Inference (and 14 more aRticles)

A brief primer on Variational Inference
81st TokyoR Meetup Roundup: A Special Session in {Shiny}!
The Mysterious Case of the Ghost Interaction
Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively
Enlarging the eBook supply
What’s new in DALEX v 0.4.9?
Extracting basic Plots from Novels: Dracula is a Man in a Hole
(Re)introducing skimr v2 – A year in the life of an open source R project
An Amazon SDK for R!?
Sept 2019: “Top 40” New R Packages
Mocking is catching
Any one interested in a function to quickly generate data with many predictors?
Dogs of New York
Spelunking macOS ‘ScreenTime’ App Usage with R
The Chaos Game: an experiment about fractals, recursivity and creative coding

Posted: 30 Oct 2019 05:00 AM PDT

[This article was first published on Fabian Dablander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let's start unpacking!

Preliminaries

Bayes' rule states that

$\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,$

where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.¹ Bayes' rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much easier and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization.

Variational objective

We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution:

$q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,$

where $\text{KL}(. \lvert \lvert.)$ denotes the Kullback-Leibler divergence:

$\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) \, \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .$

We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that:

$\begin{aligned} \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace , \end{aligned} %]]>$

where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity:

$\begin{aligned} \text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x}) \right) \\[.5em] &= -\left(\mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x}) - \text{log } p(\mathbf{x})}_{\text{Nemesis perishes}}\right) \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \enspace . \end{aligned} %]]>$

We can expand the joint probability to get more insight into this equation:

$\begin{aligned} \text{ELBO}(q) &= \underbrace{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z})\right]}_{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]} - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z})}{q(\mathbf{z})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z})\right) \enspace . \end{aligned} %]]>$

This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017).

ELBO stands for evidence lower bound. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence:

$\begin{aligned} \text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x})\right) \\[.5em] \text{log } p(\mathbf{x}) &= \text{ELBO}(q) + \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \\[.5em] \text{log } p(\mathbf{x}) &\geq \text{ELBO}(q) \enspace , \end{aligned} %]]>$

since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see this and this blog post.

Why variational?

Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the single best value that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example this blog post). In our setting, we do not want to find a single best value but rather a single best function. To do this, we can use variational calculus from which variational inference derives its name (Bishop, 2006, p. 462).

A function takes an input value and returns an output value. We can define a functional which takes a whole function and returns an output value. The entropy of a probability distribution is a widely used functional:

$\text{H}[p] = \int p(x) \, \text{log } p(x) \mathrm{d} x \enspace ,$

which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional.

In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem.

Mean-field variational family

A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density:

$q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j) \enspace .$

Note that this mean-field variational family cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem.

We start from our definition of the ELBO and apply the mean-field assumption:

$\begin{aligned} \text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em] &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z}\enspace . \end{aligned} %]]>$

In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed:

$\begin{aligned} \text{ELBO}(q_j) &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z} \\[.5em] &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j - \underbrace{\int \prod_{i\neq j}^m q_i(z_i) \, \text{log} \prod_{i\neq j}^m q_i(z_i) \, \mathrm{d}\mathbf{z}_{-j}}_{\text{Constant with respect to } q_j(z_j)} \\[.5em] &\propto \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em] &= \int q_j(z_j) \left(\int \prod_{i\neq j}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z}_{-j}\right)\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em] &= \int q_j(z_j) \, \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \enspace . \end{aligned} %]]>$

One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution

$\text{log } \tilde{p}{(\mathbf{x}, z_j)} = \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathcal{Z} \enspace ,$

where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that:

$\begin{aligned} \text{ELBO}(q_j) &\propto \int q_j(z_j) \, \text{log } \tilde{p}{(\mathbf{x}, z_j)} \, \mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em] &= \int q_j(z_j) \, \text{log } \frac{\tilde{p}{(\mathbf{x}, z_j)}}{q_j(z_j)} \, \mathrm{d}z_j \\[.5em] &= -\int q_j(z_j) \, \text{log } \frac{q_j(z_j)}{\tilde{p}{(\mathbf{x}, z_j)}} \, \mathrm{d}z_j \\[.5em] &= -\text{KL}\left(q_j(z_j) \, \lvert\lvert \, \tilde{p}(\mathbf{x}, z_j)\right) \enspace . \end{aligned} %]]>$

Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by:

$\begin{aligned} q_j^\star(z_j) &= \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right] - \mathcal{Z}\right) \\[.5em] &= \frac{\text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right)}{\int \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right) \mathrm{d}z_j} \enspace , \end{aligned} %]]>$

see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as Coordinate Ascent Variational Inference (CAVI). Further, note that

$p(z_j \mid \mathbf{z}_{-j}, \mathbf{x}) = \frac{p(z_j, \mathbf{z}_{-j}, \mathbf{x})}{p(\mathbf{z}_{-j}, \mathbf{x})} \propto p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \enspace ,$

which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks a lot like Gibbs sampling, which we discussed in detail in a previous blog post. In the next section, we implement CAVI for a simple linear regression problem.

Application: Linear regression

In a previous blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is:

$\begin{aligned} y &\sim \mathcal{N}(\beta x, \sigma^2) \\[.5em] \beta &\sim \mathcal{N}(0, \sigma^2 \tau^2) \\[.5em] \sigma^2 &\propto \frac{1}{\sigma^2} \enspace , \end{aligned} %]]>$

where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys' prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification:

$\text{Var}\left[\frac{\beta}{\sigma}\right] = \frac{1}{\sigma^2} \text{Var}[\beta] = \frac{\sigma^2 \tau^2}{\sigma^2} = \tau^2 \enspace .$

As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$.

Variational density for $\sigma^2$

Our optimal variational density $q^\star(\sigma^2)$ is given by:

$q^\star(\sigma^2) \propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \enspace .$

To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write:

$\begin{aligned} p(\sigma^2 \mid \mathbf{y}, \beta) &\propto p(\mathbf{y} \mid \sigma^2, \beta) \, p(\beta) \, p(\sigma^2) \\[.5em] &= \prod_{i=1}^n (2\pi)^{-\frac{1}{2}} \left(\sigma^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2} \left(y_i - \beta x_i\right)^2\right) \underbrace{(2\pi)^{-\frac{1}{2}} \left(\sigma^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}_{p(\beta)} \underbrace{\left(\sigma^2\right)^{-1}}_{p(\sigma^2)} \\[.5em] &= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em] &\propto\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{2\sigma^2} \underbrace{\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)}_{A}\right) \enspace , \end{aligned} %]]>$

which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write:

$\begin{aligned} q^\star(\sigma^2) &\propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \\[.5em] &= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{2\sigma^2}A \right]\right) \\[.5em] &= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1}\right] - \mathbb{E}_{q(\beta)}\left[\frac{1}{2\sigma^2}A \right]\right) \\[.5em] &= \text{exp}\left(\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \\[.5em] &= \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \enspace . \end{aligned} %]]>$

This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at:

$q^\star(\sigma^2)= \frac{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \underbrace{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]}_{\nu}\right) \enspace .$

Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$.

Variational density for $\beta$

Our optimal variational density $q^\star(\beta)$ is given by:

$q^\star(\beta) \propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \enspace ,$

and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write:

$\begin{aligned} p(\beta \mid \mathbf{y}, \sigma^2) &\propto p(\mathbf{y} \mid \beta, \sigma^2) \, p(\beta) \, p(\sigma^2) \\[.5em] &= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em] &= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^ny_i^2- 2 \beta \sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em] &\propto \text{exp}\left(-\frac{1}{2\sigma^2} \left( \beta^2 \left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) - 2 \beta \sum_{i=1}^n y_i x_i\right)\right) \\[.5em] &=\text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta^2 - 2 \beta \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)\right) \\[.5em] &\propto \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace , \end{aligned} %]]>$

where we have "completed the square" (see also this blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations:

$\begin{aligned} q^\star(\beta) &\propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \\[.5em] &= \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right]\right) \\[.5em] &= \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace , \end{aligned} %]]>$

which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields:

$q^\star(\beta) = \left(2\pi\underbrace{\frac{\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}}_{\sigma^2_{\beta}}\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left(\beta - \underbrace{\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}}_{\mu_{\beta}}\right)^2\right) \enspace ,$

Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not.

To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys's prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations.

Removing expectations

Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, for the variational density $q^\star(\sigma^2)$ we write:

$\begin{aligned} \mathbb{E}_{q(\beta)}\left[A \right] &= \mathbb{E}_{q(\beta)}\left[\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right] \\[.5em] &= \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mathbb{E}_{q(\beta)}\left[\beta\right] + \sum_{i=1}^n x_i^2 \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] + \frac{1}{\tau^2} \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] \enspace . \end{aligned} %]]>$

Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that:

$\mathbb{E}_{q(\beta)}[\beta^2] = \text{Var}_{q(\beta)}\left[\beta\right] + \mathbb{E}_{q(\beta)}[\beta]^2 = \sigma^2_{\beta} + \mu_{\beta}^2 \enspace ,$

the expectation becomes:

$\mathbb{E}_{q(\beta)}\left[A\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) \enspace .$

For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. Note that since $\sigma^2$ follows an inverse Gamma distribution, $1 / \sigma^2$ follows a Gamma distribution which has mean:

$\begin{aligned} \mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] &= \frac{n + 1}{2} \left(\frac{1}{2}\mathbb{E}_{q(\beta)}\left[A \right]\right)^{-1} \\[.5em] &= \frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \enspace . \end{aligned} %]]>$

Monitoring convergence

The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write:

$\begin{aligned} \text{ELBO}(q) &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y}, \beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2) \right] \\[.5em] &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \mathbb{E}_{p(\beta, \sigma^2)}\left[\text{log } p(\beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2)\right] \\[.5em] &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \underbrace{\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right]}_{-\text{KL}\left(q(\beta, \sigma^2) \, \lvert\lvert \, p(\beta, \sigma^2)\right)}\enspace . \end{aligned} %]]>$

Let's take a deep breath and tackle the second term first:

$\begin{aligned} \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{p(\beta \mid \sigma^2)}{q(\beta)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\left(2\pi\sigma^2\tau^2\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}{\left(2\pi\sigma^2_\beta\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2_\beta} (\beta - \mu_\beta)^2\right)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\frac{1}{\sigma^2\tau^2} \beta^2}{\frac{1}{\sigma^2_\beta} (\beta - \mu_\beta)^2}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \mathbb{E}_{q(\sigma^2)}\left[\text{log}\frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\sigma^2_\beta + \mu_\beta^2}{\sigma^2\tau^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\sigma^2)\right] - \mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]\enspace . \end{aligned} %]]>$

Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration:

$\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right] = \int q(\sigma^2) \, \text{log } q(\sigma^2) \, \mathrm{d}\sigma^2 \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\text{log } q(\sigma^2)}_{\sigma^2 \, \sim \, q(\sigma^2)} \enspace ,$

We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient:

$\begin{aligned} \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] &= \int \int q(\beta) \, q(\sigma^2) \, \text{log } p(\mathbf{y} \mid \beta, \sigma^2) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em] &=\int q(\beta) \int q(\sigma^2) \, \text{log} \left(\left(2\pi\sigma^2\right)^{-\frac{n}{2}}\text{exp}\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i\beta)^2\right)\right) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em] &= \frac{n}{4} \text{log}\left(2\pi\right)\int q(\beta) \left(\sum_{i=1}^n (y_i - x_i\beta)^2\right) \, \mathrm{d}\beta\int q(\sigma^2) \, \, \text{log} \left(\sigma^2\right)\frac{1}{\sigma^2} \, \mathrm{d}\sigma \enspace . \end{aligned} %]]>$

Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically:

$\mathbb{E}_{q(\beta)}\left[\sum_{i=1}^n (y_i - x_i\beta)^2\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right) \enspace .$

In the next section, we implement the algorithm for our linear regression problem in R.

Implementation in R

Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO:

library('MCMCpack')     #' Computes the ELBO for the linear regression example  #'   #' @param y univariate outcome variable  #' @param x univariate predictor variable  #' @param beta_mu mean of the variational density for \beta  #' @param beta_sd standard deviation of the variational density for \beta  #' @param nu parameter of the variational density for \sigma^2  #' @param nr_samples number of samples for the Monte carlo integration  #' @returns ELBO  compute_elbo <- function(y, x, beta_mu, beta_sd, nu, tau2, nr_samples = 1e4) {    n <- length(y)    sum_y2 <- sum(y^2)    sum_x2 <- sum(x^2)    sum_yx <- sum(x*y)        # Takes a function and computes its expectation with respect to q(\beta)    E_q_beta <- function(fn) {      integrate(function(beta) {        dnorm(beta, beta_mu, beta_sd) * fn(beta)      }, -Inf, Inf)$value    }        # Takes a function and computes its expectation with respect to q(\sigma^2)    E_q_sigma2 <- function(fn) {      integrate(function(sigma) {        dinvgamma(sigma^2, (n + 1)/2, nu) * fn(sigma)      }, 0, Inf)$value    }            # Compute expectations of log p(\sigma^2)    E_log_p_sigma2 <- E_q_sigma2(function(sigma) log(1/sigma^2))        # Compute expectations of log p(\beta \mid \sigma^2)    E_log_p_beta <- (      log(tau2 / beta_sd^2) * E_q_sigma2(function(sigma) log(sigma^2)) +      (beta_sd^2 + tau2) / (tau2) * E_q_sigma2(function(sigma) 1/sigma^2)    )        # Compute expectations of the log variational densities q(\beta)    E_log_q_beta <- E_q_beta(function(beta) dnorm(beta, beta_mu, beta_sd, log = TRUE))    # E_log_q_sigma2 <- E_q_sigma2(function(x) log(dinvgamma(x, (n + 1)/2, nu))) # fails        # Compute expectations of the log variational densities q(\sigma^2)    sigma2 <- rinvgamma(nr_samples, (n + 1)/2, nu)    E_log_q_sigma2 <- mean(log(dinvgamma(sigma2, (n + 1)/2, nu)))            # Compute the expected log likelihood    E_log_y_b <- sum_y2 - 2*sum_yx*beta_mu + (beta_sd^2 + beta_mu^2)*sum_x2    E_log_y_sigma2 <- E_q_sigma2(function(sigma) log(sigma^2) * 1/sigma^2)    E_log_y <- n/4 * log(2*pi) * E_log_y_b * E_log_y_sigma2            # Compute and return the ELBO    ELBO <- E_log_y + E_log_p_beta + E_log_p_sigma2 - E_log_q_beta - E_log_q_sigma2    ELBO  }

The function below implements coordinate ascent mean-field variational inference for our simple linear regression problem. Recall that the variational parameters are:

$\begin{aligned} \nu &= \frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right) \\[.5em] \mu_\beta &= \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \\[.5em] \sigma^2_\beta &= \frac{\left(\frac{n + 1}{2}\right) \nu^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \enspace . \end{aligned} %]]>$

The following function implements the iterative updating of these variational parameters until the ELBO has converged.

#' Implements CAVI for the linear regression example  #'   #' @param y univariate outcome variable  #' @param x univariate predictor variable  #' @param tau2 prior variance for the standardized effect size  #' @returns parameters for the variational densities and ELBO  lmcavi <- function(y, x, tau2, nr_samples = 1e5, epsilon = 1e-2) {    n <- length(y)    sum_y2 <- sum(y^2)    sum_x2 <- sum(x^2)    sum_yx <- sum(x*y)        # is not being updated through variational inference!    beta_mu <- sum_yx / (sum_x2 + 1/tau2)        res <- list()    res[['nu']] <- 5    res[['beta_mu']] <- beta_mu    res[['beta_sd']] <- 1    res[['ELBO']] <- 0        j <- 1    has_converged <- function(x, y) abs(x - y) < epsilon    ELBO <- compute_elbo(y, x, beta_mu, 1, 5, tau2, nr_samples = nr_samples)        # while the ELBO has not converged    while (!has_converged(res[['ELBO']][j], ELBO)) {            nu_prev <- res[['nu']][j]      beta_sd_prev <- res[['beta_sd']][j]            # used in the update of beta_sd and nu      E_qA <- sum_y2 - 2*sum_yx*beta_mu + (beta_sd_prev^2 + beta_mu^2)*(sum_x2 + 1/tau2)            # update the variational parameters for sigma2 and beta      nu <- 1/2 * E_qA      beta_sd <- sqrt(((n + 1) / E_qA) / (sum_x2 + 1/tau2))            # update results object      res[['nu']] <- c(res[['nu']], nu)      res[['beta_sd']] <- c(res[['beta_sd']], beta_sd)      res[['ELBO']] <- c(res[['ELBO']], ELBO)            # compute new ELBO      j <- j + 1      ELBO <- compute_elbo(y, x, beta_mu, beta_sd, nu, tau2, nr_samples = nr_samples)    }        res  }

Let's run this on a simulated data set of size $n = 100$ with a true coefficient of $\beta = 0.30$ and a true error variance of $\sigma^2 = 1$. We assign $\beta$ a Gaussian prior with variance $\tau^2 = 0.25$ so that values for $\lvert \beta \rvert$ larger than two standard deviations ($0.50$) receive about $0.68$ prior probability.

gen_dat <- function(n, beta, sigma) {    x <- rnorm(n)    y <- 0 + beta*x + rnorm(n, 0, sigma)    data.frame(x = x, y = y)  }     set.seed(1)  dat <- gen_dat(100, 0.30, 1)     mc <- lmcavi(dat$y, dat$x, tau2 = 0.50^2)  mc

## $nu  ## [1]  5.00000 88.17995 45.93875 46.20205 46.19892 46.19895  ##   ## $beta_mu  ## [1] 0.2800556  ##   ## $beta_sd  ## [1] 1.00000000 0.08205605 0.11368572 0.11336132 0.11336517 0.11336512  ##   ## $ELBO  ## [1]       0.0000 -297980.0495     493.4807    -281.4578    -265.1289  ## [6]    -265.3197

From the output, we see that the ELBO and the variational parameters have converged. In the next section, we compare these results to results obtained with Stan.

Comparison with Stan

Whenever one goes down a rabbit hole of calculations, it is good to sanity check one's results. Here, we use Stan's variational inference scheme to check whether our results are comparable. It assumes a Gaussian variational density for each parameter after transforming them to the real line and automates inference in a "black-box" way so that no problem-specific calculations are required (see Kucukelbir, Ranganath, Gelman, & Blei, 2015). Subsequently, we compare our results to the exact posteriors arrived by Markov chain Monte carlo. The simple linear regression model in Stan is:

data {    int n;    vector[n] y;    vector[n] x;    real tau;  }     parameters {    real b;    real sigma;  }     model {    target += -log(sigma);    target += normal_lpdf(b | 0, sigma*tau);    target += normal_lpdf(y | b*x, sigma);  }

We use Stan's black-box variational inference scheme:

library('rstan')     # save the above model to a file and compile it  # model <- stan_model(file = 'regression.stan')     stan_dat <- list('n' = nrow(dat), 'x' = dat$x, 'y' = dat$y, 'tau' = 0.50)  fit <- vb(    model, data = stan_dat, output_samples = 20000, adapt_iter = 10000,    init = list('b' = 0.30, 'sigma' = 1), refresh = FALSE, seed = 1  )

This gives similar estimates as ours:

fit

## Inference for Stan model: variational-regression.  ## 1 chains, each with iter=20000; warmup=0; thin=1;   ## post-warmup draws per chain=20000, total post-warmup draws=20000.  ##   ##       mean   sd 2.5%  25%  50%  75% 97.5%  ## b     0.28 0.13 0.02 0.19 0.28 0.37  0.54  ## sigma 0.99 0.09 0.82 0.92 0.99 1.05  1.18  ## lp__  0.00 0.00 0.00 0.00 0.00 0.00  0.00  ##   ## Approximate samples were drawn using VB(meanfield) at Wed Oct 30 13:20:01 2019.

## We recommend genuine 'sampling' from the posterior distribution for final inferences!

Their recommendation is prudent. If you run the code with different seeds, you can get quite different results. For example, the posterior mean of $\beta$ can range from $0.12$ to $0.45$, and the posterior standard deviation can be as low as $0.03$; in all these settings, Stan indicates that the ELBO has converged, but it seems that it has converged to a different local optimum for each run. (For seed = 3, Stan gives completely nonsensical results). Stan warns that the algorithm is experimental and may be unstable, and it is probably wise to not use it in production.

Although the posterior distribution for $\beta$ and $\sigma^2$ is available in closed-form (see the Post Scriptum), we check our results against exact inference using Markov chain Monte carlo by visual inspection.

fit <- sampling(model, data = stan_dat, iter = 8000, refresh = FALSE, seed = 1)

The Figure below overlays our closed-form results to the histogram of posterior samples obtained using Stan.

plot of chunk unnamed-chunk-10

Note that the posterior variance of $\beta$ is slightly overestimated when using our variational scheme. This is in contrast to the fact that variational inference generally underestimates variances. Note also that Bayesian inference using Markov chain Monte Carlo is very fast on this simple problem. However, the comparative advantage of variational inference becomes clear by increasing the sample size: for sample sizes as large as $n = 100000$, our variational inference scheme takes less then a tenth of a second!

Conclusion

In this blog post, we have seen how to turn an integration problem into an optimization problem using variational inference. Assuming that the variational densities are independent, we have derived the optimal variational densities for a simple linear regression problem with one predictor. While using variational inference for this problem is unnecessary since everything is available in closed-form, I have focused on such a simple problem so as to not confound this introduction to variational inference by the complexity of the model. Still, the derivations were quite lengthy. They were also entirely specific to our particular problem, and thus generic "black-box" algorithms which avoid problem-specific calculations hold great promise.

We also implemented coordinate ascent mean-field variational inference (CAVI) in R and compared our results to results obtained via variational and exact inference using Stan. We have found that one probably should not trust Stan's variational inference implementation, and that our results closely correspond to the exact procedure. For more on variational inference, I recommend the excellent review article by Blei, Kucukelbir, and McAuliffe (2017).

I would like to thank Don van den Bergh for helpful comments on this blog post.

Post Scriptum

Normal-inverse-gamma Distribution

The posterior distribution is a Normal-inverse-gamma distribution:

$p(\beta, \sigma^2 \mid \mathbf{y}) = \frac{\gamma^{\alpha}}{\Gamma\left(\alpha\right)} \left(\sigma^2\right)^{-\alpha - 1} \text{exp}\left(-\frac{2\gamma + \lambda\left(\beta - \mu\right)^2}{2\sigma^2}\right) \enspace ,$

where

$\begin{aligned} \mu &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}} \\[.5em] \lambda &= \sum_{i=1}^n x_i + \frac{1}{\tau^2} \\[.5em] \alpha &= \frac{n + 1}{2} \\[.5em] \gamma &= \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2 - \frac{\left(\sum_{i=1}^n y_i x_i\right)^2}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}}\right)\right) \enspace . \end{aligned} %]]>$

Note that the marginal posterior distribution for $\beta$ is actually a Student-t distribution, contrary to what we assume in our variational inference scheme.

References

Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859-877.
Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (2015). Automatic variational inference in Stan. In Advances in Neural Information Processing Systems (pp. 568-576).
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. The Journal of Machine Learning Research, 18(1), 430-474.

Footnotes

The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation.

To leave a comment for the author, please follow the link and comment on their blog: Fabian Dablander.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

81st TokyoR Meetup Roundup: A Special Session in {Shiny}!

Posted: 29 Oct 2019 05:00 PM PDT

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As another sweltering summer ends, another TokyoR Meetup! With global
warming in full swing and it still being around 30 degrees at the end of
September, this month's meetup was held at DIP
Corporation, an personnel/recruitment
services company, in their headquarters in Roppongi, Tokyo. This month's
session was another special-themed session involving Shiny
apps!

In line with my previous round up posts:

I will be going over around half of all the talks. Hopefully, my efforts
will help spread the vast knowledge of Japanese R users to the wider R
community. Throughout I will also post helpful blog posts and links from
other sources if you are interested in learning more about the topic of
a certain talk. You can follow Tokyo.R by searching for the
#TokyoR hashtag on Twitter.

Anyways…

Let's get started!

BeginneR Session

As with every TokyoR meetup, we began
with a set of beginner user focused talks:

Main Talks

hoxo_m: Asynchronous Programming for Shiny!

Slides

@hoxo_m of HOXO-M Inc. talked about asynchronous programming with
Shiny. Starting off with an introduction into the history of single and
multi-threading in both R and how the growing popularity of Shiny has
lead to a demand for multithreadedness to cater to the multitude of
users using a Shiny app at once!

The main package that facilitates this in both R and Shiny is the
{future} package which allows one to be able to evaluate R expressions
asynchronously (in parallel or concurrently on a cluster). By using
{future} in a Shiny context you can shift resource-intensive tasks (ex.
grabbing data from an API) activated by one user into another process
and free up time for other users' tasks and reduce their waiting time.

The plan() function allows you to choose from a variety of options for
launching/attaching R processes. The choices are multisession,
multicore, and multiprocess. You can read more about it
here.

There's not a whole lot you need to do to grab the results from another
process as all the render_*() are able to take "promise" objects. As a
reminder, a "promise" in this context is an object that takes a result
from an asynchronous process that happens later/slightly later. A
"promise" object takes the result from a {future} code result and it
will wait until a result appears from another process finishes running
the code.

Another important component of the asynchronous framework in R is the
{promises} package. It's this package that allows for the actual
abstractions within your R code for asynchronous programming such as the
"promise pipe", %...>%! You insert whatever long task code you have
into the future() function then use the "promise pipe" to pass it to
the rest of the code. As a future/promise object is not a data frame you
can't pass filter() or other functions to it, so you have to pass the
"promise pipe" first before other regular functions can be run.

In a Shiny context, you can't use reactives inside a future() function
so one needs to assign a reactive as an object before the future()
code and then pass that object into the function.

You also need to carefully think about WHERE (as in which process)
the code is running. For example in the code below, the results are the
same in both the top and bottom code. The code in black is done by the
main process while the code in green is done by another process.

Although the above code works in both cases, for some functions such as
plot() and print() can be run in another process and but their
output can not be returned by the main process. The solution is to
use the "promise pipe" to make sure that plot()/print() is being run
by the main process instead. On the other hand you can still use
promises within observe*() and eventReactive*()/reactive() code,
you just have to remember to use the "promise pipes".

Np_Ur: A Simple Shiny App in 30 Minutes!

Slides

@Np_Ur is known in the Japan community for his love of {shiny}, he
even wrote a book on it called "Building Web Applications in R with
Shiny". This presentation was largely a demonstration as @Np_Ur
explained, from the ground up, a lot of the basic functions that can get
you a working Shiny app in the space of 30 minutes! From creating a
navigation bar via navbarPage() to creating different drop-down
options for customizing a plot and talking about extra functionality
from other packages such as {DT} and {shinycssloaders}, @Np_Ur took us
through the code and showed us completed Shiny apps for each step of the
way.

I recommend going through his slides (also hosted on Shiny) as well as
checking out the code for each of the Shiny apps he made for all
different functionalities he talked about by clicking on the link below!

Code for all Shiny Apps

kashitan: Making {shiny} Maps with {leaflet}!

Slides

@kashitan presented some tips (that you normally won't see in
books/articles) for using {leaflet} with Shiny for map applications! He
took us through four different functions that he found very useful for
making {leaflet} maps with Japan census data.

The first function: updateSelectInput() allows you to update a
drop-down menu with new values after selecting a certain input. In
@kashitan's case using the Japan census Shiny app, he wanted to be
able to update the choices of the city/district after choosing a
different prefecture on the app. Using the updateSelectInput()
function the list of choices from the drop down menu updates to the
city/districts of the newly chosen prefecture!

You can check out the documentation
here.

The second function: leafletProxy() allows you to customize a
{leaflet} map even after it has been rendered by Shiny. For @kashitan
this was necessary as he didn't want the map's active zoom level and
center coordinates to change even after choosing a new prefecture to
look at.

The third function: fitBounds() allows you to set the boundaries of
the map. For @kashitan similar to the previous function shown, he
wanted the bounds to the view, following a change in the city/district,
to always be within a certain bounding box.

The last function: input${id}shape_click shows you information about
the polygon shape of the map you just clicked. {leaflet}'s "click" event
currently only shows you the coordinate and id values from this
function.

okiyuki: Software Engineering for Shiny!

Slides

@okiyuki presented on the various R packages used for the software
engineering that supports Shiny apps.

{memoise}: Caches data when
certain function is run for the first time (useful for dashboard
shiny apps where similar use cases can be predicted)
{pool}: Easy database connection
management in an interactive context. After inserting/accessing SQL
database connection info, the connection is closed when app itself
closes!
{shinyProxy}: Deploy Shiny apps with
LDAP authentication/authorization and TLS protocols for an
enterprise context. It uses Docker so that each user is using the
app in their own single Docker container.
{shinyloadtest}: Helps
analyze load tests and Shiny app performance with multiple users.

@okiyuki also talked about some of his personal struggles and pitfalls
that he has come across when building Shiny apps at work. These include:

Deployed on ShinyServer but there was an error! Even though it was
working fine a minute ago!
- Solution: Use {Shinytest} and {testthat} to test deployment
  and other actions in Shiny
Unknowingly/unintentionally using functions from a different
namespace
- Solution: Make sure to explicitly :: your functions
- Also restart your app via altering restart.txt in your Shiny
  app directory

An extra section talked about various helpful packages for Shiny app
aesthetics such as:

{shinycssloaders}:
{shinyace}:
dreamRs' suite of Shiny packages such
as {shinyWidgets}
I introduced some of dreamRs' packages in my useR!2019 blog post
here.
Various packages to create Shiny Templates: {bs4dash},
{shinymaterial}, {fullpage}, {shiny.semantic}

LTs

igjit: Edit Your Photos with {shinyroom}!

Slides

You might remember from a few months back, @igjit presented on "RAW
image processing with R" (TokyoR
#79).
Continuing where he left off he decided to create a photo-editing UI
using the power of {shiny}. Motivated by comments following the previous
presentation, @igjit decided to base it on "Adobe Lightroom", and call
it the {shinyroom} package. You can take a look at it
here.

In terms of actually building the Shiny app he used the {imager} package
for the actual photo editing functionality while {golem} was used as the
package framework for the app. For appearances @igjit used the
{shinythemes} package

During the course of building the app, @igjit came across a peculiar
problem concerning the screen when the Shiny app was busy. By default, a
particular panel becomes slightly opaque when the server is busy doing
stuff in the background but this is annoying when you are working on
editing images. To get around this problem, @igjit created another
package called {shinyloadermessage} so that instead of the screen
graying out a small message will appear instead.

flaty13: Reproducible Shiny with {shinymeta}!

Slides

@flaty13 talked about the recently made public {shinymeta} package and
reproducibility with Shiny apps. This is a topic that has taken a while
to develop due to the complexity of the issue, where the end goal was to
find a way to document and describe the actions of users who interacted
with very dynamic Shiny apps with many different features. With the
{shinymeta} package you can now download R scripts highlight the steps
you took in interacting with the app.

The next step that is currently in development is to output an .RMD
report among a number of other features as the package is still in the
experimental phase. See the resources below for more details, especially
Joe Cheng's keynote for all the stuff under-the-hood that's making this
exciting new development possible!

Other talks

Food, Drinks, and Conclusion

TokyoR happens almost monthly and it's a great way to mingle with
Japanese R users as it's the largest regular meetup here in Japan. The
next meetup will be on October
26 and I will also be one of
the presenters!

Talks in English are also welcome so if you're ever in Tokyo come join
us!

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

The Mysterious Case of the Ghost Interaction

Posted: 29 Oct 2019 05:00 PM PDT

[This article was first published on R on I Should Be Writing: Est. 1641, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

_{This spooky post was written in collaboration with Yoav Kessler (@yoav_kessler) and Naama Yavor (@namivor).}.

Experimental psychology is moving away from repeated-measures-ANOVAs, and towards linear mixed models (LMM¹). LMMs have many advantages over rmANOVA, including (but not limited to):

Analysis of single trial data (as opposed to aggregated means per condition).
Specifying more than one random factor (typically crossed random intercepts of subject and item).
The use of continuous variables as predictors.
~~Making you look like you know what you're doing.~~
~~Defeating the un-dead / reviewer 2.~~
The ability to specify custom models.²

This post will focus on this last point. Specifically, why you should always include main-effects when modeling interactions, and what happens if you don't (spooky).

Fitting the Right (yet oh so wrong) Model

Say you've finally won that grant you submitted to study candy consumption during ghostly themed holidays. As part of your first study, you decide to measure the effects of costume type (scary / cute) and level of neighborhood decor (high / low levels of house decorations) on the total weight of collected candy (in Kg). A simple, yet informative 2-by-2 design.

Being the serious scientist you are, you have several hypotheses:

A main effect for decor level – neighborhoods with more decorations will overall give out more candy.
No main effect for costume – overall, children with cute and scary costumes will receive the same amount of candy (in Kg).
A decor level $\times$ costume interaction – high decor neighborhoods will favor scary costumes, while low decor neighborhoods will favor cute costumes.

It would only make sense to specify your statistical model accordingly – after all, why shouldn't your model represent your hypotheses?

In R, such a model is described as candy_kg ~ decor + decor:costume, instructing R to model candy_kg as a function of the effect for decor + the interaction decor:costume.

And so, you fit the model:

options(contrasts = c('contr.sum', 'contr.poly')) # set effects coding (just once)    fit <- aov(candy_kg ~ decor + decor:costume, data = spooky_data)

Term	df	SS	MS	F	p-value	$\eta^2$
decor	1	30.00	30.00	23.64	<0.001	0.10
decor:costume	2	120.00	60.00	47.28	<0.001	0.40
Residuals	116	147.20	1.27

As predicted, you find both a significant main effect for decor and the interaction decor $\times$ costume, with the interaction explaining 40% of the variance in collected candy weight. So far so good – your results reflect your hypotheses!

But then you plot your data, and to your horror you find…

It looks like there is no interaction at all! Your interaction was nothing more than a ghost! An apparition! How is this possible?? Where has all of variance explained by it gone???

What IS This??

In fact, had you fit the full model, you would have found:

fit <- aov(candy_kg ~ decor * costume, data = spooky_data)

Term	df	SS	MS	F	p-value	$\eta^2$
decor	1	30.00	30.00	23.64	<0.001	0.10
costume	1	120.00	120.00	94.56	<0.001	0.40
decor:costume	1	0.00	0.00	0.00	1.000	0.00
Residuals	116	147.20	1.27

The interaction actually explains 0% of the variance! And the effect of costume is the one that explains 40% of the variance!³ How could this be?? Have we angered Fisher's spirit somehow?

What happened was that because we did not account for costume in our model, the variance explained by costume was swallowed by the interaction decor $\times$ costume!

The Math

_{If you find math too scary, feel free to skip to conclusion.}

Travel back to Intro to Stats, and recall that the interaction's sum-of-squares – $SS_{A\times B}$ – is calculated as:

$SS_{A\times B} = (\bar{x}_{ij} – \bar{x}_{i.} – \bar{x}_{.j} + \bar{\bar{x}}_{..})^2$

This is a simplification of the following equation:

$SS_{A\times B} = \sum \sum (\bar{x}_{ij} – (\bar{x}_{i.} – \bar{\bar{x}}_{..}) – (\bar{x}_{.j} – \bar{\bar{x}}_{..}) + \bar{\bar{x}}_{..})^2$

Where $(\bar{x}_{i.} – \bar{\bar{x}}_{..})$ represents the main effect for $A$ and $(\bar{x}_{.j} – \bar{\bar{x}}_{..})$ represents the main effect for $B$. We can see that $SS_{A\times B}$ represents the deviation from the additive model – i.e., it is the degree by which the observed cells' means deviate from what would be expected if there were only the two main effects.

When we exclude the main effect of $B$ from out model, we are telling our model that there is no need to estimate the main effect. That is, we set $(\bar{x}_{.j} – \bar{\bar{x}}_{..})=0$. The resulting $SS_{A\times B}$ is computed not as above, but as:

$SS_{A\times B} = \sum \sum (\bar{x}_{ij} – (\bar{x}_{i.} – \bar{\bar{x}}_{..}) + \bar{\bar{x}}_{..})^2$

This formula represents the degree by which the observed cells' means deviate from what would be expected if there was only the main effect of $A$. But now if the cells' means deviate in a way that would otherwise have been part of a main effect for $B$, the cells' deviations from the main effect for $A$ will now include the deviations that would otherwise have been accounted for by a main effect of $B$! This results in the main effect for $B$ essentially getting "pooled" into $SS_{A\times B}$. Furthermore, had you also excluded a main effect for $A$, this effect too would have been "pooled" into the so-called $A\times B$ interaction.

In other words:

When we don't estimate (model) main effects, we change the meaning of interactions – they no longer represents a deviation from the additive model.

Conclusion

Sure, you can specify a model with no main effect and only interactions, but in such a case the interactions no longer mean what we expect them to mean. If we want interactions to represent deviation from the additive model, our model must also include the additive model!

For simplicity's sake, this example has focused on a simple 2-by-2 between subject design, but the conclusions drawn here are relevant for any design in which a factor interacts with or moderates the effect of another factor or continuous variable.

Or hierarchical linear models (HLM)… or mixed linear models (MLM)…↩
Whereas in an AVONA analysis with 4 factors you always have: Four main effects + Six 2-way interaction + Four 3-way interaction + One 4-way interaction.↩
Note also that the $df_{residual}$ is the same for both models, indicating the same number of parameters overall have been estimated in both. E.g., while in the full model we would have 3 parameters – one for each main effect + one for the interaction, in the misspecified model we have one for the main effect, and two for the interaction. That is, no matter how you tell the model to split the $SS$s, the number of parameters needed to model 4 cells will always be 3.↩

To leave a comment for the author, please follow the link and comment on their blog: R on I Should Be Writing: Est. 1641.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively

Posted: 29 Oct 2019 12:00 PM PDT

[This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley's blog. Part of the problem is all the jargon in survey sampling—I couldn't understand Lumley's language of estimators and least squares; part of it is that missing data is hard.

The full data model

Imagine we have a a very simple population of $N^{\textrm{pop}}$ items with values normally distributed members with standard deviation known to be 2,

$y_n \sim \textrm{normal}(\mu, 2) \ \textrm{for} \ i \in 1:N^{\textrm{pop}}.$

To complete the Bayesian model, we'll assume a standard normal prior on $\mu$ ,

$\mu \sim \textrm{normal}(0, 1).$

Now we're not going to observe all $y_n$ , but only a sample of the $N^{\textrm{pop}}$ elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items $y_n$ from the population.

Missing data

Now let's assume the sample of $y_n$ we observe is not drawn at random from the population. Imagine instead that we have a subset of $N$ items from the population, and for each item $n$ , there is a probability $\pi_n$ that the item will be included in the sample. We'll take the log odds of inclusion to be equal to the item's value,

$\pi_n = \textrm{logit}^{-1}(y_n)$ .

Now when we collect our sample, we'll do something like poll $N = 2000$ people from the population, but each person $n$ only has a $\pi_n$ chance of responding. So we only wind up with $N^{\textrm{obs}}$ observations, with $N^{\textrm{miss}} = N - N^{\textrm{obs}}$ observations missing.

This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew's post on pre-election polling, Don't believe the bounce).

So how do we do the careful adjustment?

Approach 1: Weighted likelihood

A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,

$\sum_{n = 1}^{N^{\textrm{obs}}} \frac{1}{\pi_n} \log \textrm{normal}(y_n \mid \mu, 2).$

Thus if an item has a 20% chance of being included, its weight is 5.

In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).

  for (n in 1:N_obs)    target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);

If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate $\hat{\pi}$ is the true value $\pi$ ). This is borne out in simulation.

Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan's blog post You better check yo self before you wreck yo self.

One reason the intervals are too narrow is that we are weighting the data as if we had observed $N$ items when we've only observed $N^{\textrm{obs}}$ items. That is, their weights are what we'd expect to get if we'd observed $N$ items.

So my next thought was to standardize. Let's take the inverse weights and normalize so the sum of inverse weights is equal to $N^{\textrm{obs}}.$ That also fails. The posterior intervals are still too narrow under simulation.

Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We're Bayesians and should be thinking generatively. Maybe that's why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).

Approach 2: Missing data

What is going on generativey? We poll $N$ people out of a population of $N^{\textrm{pop}}$ , each of which has a $\pi_n$ chance of responding, leading to a set of responses of size $N^{\textrm{obs}}.$

Given that we know how $\pi$ relates to $y$ , we can just model everything (in the real world, this stuff is really hard and everything's estimated jointly).

Specifically, the $N^{\textrm{miss}} = N - N^{\textrm{obs}}$ missing items each get parameters $y^{\textrm{miss}}_n$ representing how they would've responded had they responded. We also model response, so we have an extra term $\textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(y_n^{\textrm{miss}}))$ for the unobserved values and an extra term $\textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(y_n))$ for the observed values.

This works. Here's the Stan program.

  data {    int N_miss;    int N_obs;    vector[N_obs] y_obs;  }  parameters {    real mu;    vector[N_miss] y_miss;  }  model {    // prior    mu ~ normal(0, 1);    // observed data likelihood    y_obs ~ normal(mu, 2);    1 ~ bernoulli_logit(y_obs);    // missing data likelihood and missingness    y_miss ~ normal(mu, 2);    0 ~ bernoulli_logit(y_miss);  }

The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:

  for (n in 1:N_miss)    0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))

And here's the simulation code, including a cheap run at SBC:

  library(rstan)  rstan_options(auto_write = TRUE)  options(mc.cores = parallel::detectCores(), logical = FALSE)    printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") }  inv_logit <- function(u) 1 / (1 + exp(-u))    printf("Compiling model.")  model <- stan_model('missing.stan')    for (m in 1:20) {    # SIMULATE DATA  mu <- rnorm(1, 0, 1);  N_tot <- 1000  y <- rnorm(N_tot, mu, 2)  z <- rbinom(N_tot, 1, inv_logit(y))  y_obs <- y[z == 1]  N_obs <- length(y_obs)  N_miss <- N_tot - N_obs    # COMPILE AND FIT STAN MODEL  fit <- sampling(model,                  data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs),                  chains = 1, iter = 5000, refresh = 0)  mu_ss <- extract(fit)$mu  mu_hat <- mean(mu_ss)  q25 <- quantile(mu_ss, 0.25)  q75 <- quantile(mu_ss, 0.75)  printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s;  mu_hat = %5.2f",         mu, q25, q75, ifelse(q25 <= mu && mu <= q75, "yes", "no"), mean(mu_ss))    }

Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.

  mu =  0.60 in 50pct( 0.50,  0.60) =  no;  mu_hat =  0.55  mu = -0.73 in 50pct(-0.67, -0.56) =  no;  mu_hat = -0.62  mu =  1.13 in 50pct( 1.00,  1.10) =  no;  mu_hat =  1.05  mu =  1.71 in 50pct( 1.67,  1.76) = yes;  mu_hat =  1.71  mu =  0.03 in 50pct(-0.02,  0.08) = yes;  mu_hat =  0.03  mu =  0.80 in 50pct( 0.76,  0.86) = yes;  mu_hat =  0.81

The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.

Exercise

How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.

Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Enlarging the eBook supply

Posted: 29 Oct 2019 06:59 AM PDT

[This article was first published on Blog: John C. Nash, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There has been a campaign by members of the Canadian library community to ask international publishers to reduce prices to libraries for eBooks. If my experience in the scientific publishing arena is any guide, the publishers will continue to charge the maximum they can get until the marketplace — that is us — forces them to change or get out of business.

This does not mean that the campaign is without merit. I just feel it needs to be augmented by some explorations of alternatives that could lead to a different ecosystem for writers and readers, possibly without traditional publishers.

Where am I coming from?

I am a retired university professor with a number of techinical and scientific books to my name over a long period. Oddly the first published, in February 1979, is a computer book that is still in print on real paper. All the others have gone out of print and I have put those for which I could generate an e-version on archive.org.
Since retiring in 2008, I have published a traditional book with Wiley. The royalties were pitiful, and the publisher had deigned to ignore requests to fix the web-site. However, I believe they are still selling the book, though I've not had any reports. If it was worthwhile, I'd ask a lawyer to prod them. However, I know the amounts involved are much, much less than any fee a lawyer would charge.
I've also written four novels as well as many shorter pieces, and am one of a collective that has published three creative writing anthologies. My novels are all epub versions and I have put them on obooko.com. They are freely available for individual readers.

How does this help libraries?

First, I'd be really happy if libraries where my novels are of local interest would make them freely available. There are clearly some minor costs to doing this, but we are talking of a few tens of dollars in labour to add a web-page and a link or two, and some cents of storage cost. I offered my first novel, which involves Ottawa, to the Ottawa Public Library, but was told I would have to pay an external (indeed US) company for the DRM to be applied. But I offered an unlimited number of downloads! Is it surprising that I went to archive.org and obooko.com? And the library has fewer offerings for its readers, but also I miss out on local readers being offered my work.

Second, I believe there are other authors whose motivations are not primarily financial who may be willing to offer their works in a similar fashion. My own motivations are to record anecdotes and events not otherwise easily accessed, if at all. I put these in a fictional shell to provide readability and structure. Given that obooko.com seems to be doing reasonably well, though some of the offerings are less than stellar, there are clearly other authors willing to make their works available to readers.

Third, there may be authors willing to offer their works for free for a limited time to allow initiatives by libraries to be tried out, even if they do want or need to be remunerated eventually.

What about the longer term?

DRM imposes considerable costs on both libraries and readers. Whether it protects authors can be debated. Its main damage is that it cedes control of works from authors and readers to foreign, often very greedy, companies, whose interests are not in creative works but in selling software systems. Worse, it only really protects the income of those companies. All DRM schemes are breakable, though they are always a nuisance.

Some workers talk of "social DRM" or watermarking. I believe this could be a useful tool. The borrower/buyer is identified on the work. I have done this with my books in the past, using a scheme developed to put a name on each page of student examination papers so each student had a unique paper. This avoided the need for strict invigilation, since copied answers could be wrong. In an ebook, the idea could be enhanced with invisible encoding of the name (steganography). However, each additional feature is another cost, another obstacle to getting the work to the reader. And no scheme is unbreakable.

Publishers now insist on "renewal" of the license for an eBook. The library may not get a "reference" copy. Personally, I believe one of the important roles of libraries is to serve as repositories of works. A single disk can store an awful number of eBooks, so the cost is small. As a possibility, reading the archival copies could be restricted to in-library access only, but users would have a chance to verify material for study and reference purposes. Audiobooks require more storage, and are less easy to use for reference purposes, but could be similarly archived, as the audiobook reader's voice may be of interest.

For a sustainable writer-library-reader system, there does need to be a way to reward writers. The current scheme, with DRM, counts downloads. This is costly to libraries. How many times have you abandoned a book after a few pages? This might be overcome with some sort of sampling mechanism providing more material than currently offered as a "tease".

If eBooks are available in unlimited quantities, authors could actually benefit more. There are often temporary fads or surges of interest, or else book club members all want to read a title. At the moment, I know some club members will decide not to bother with a given book, or will find ways to "share". Clearly, libraries will not want to have open-ended costs, but it is likely that works will be temporarily popular. As I understand things, traditional publishers allow a fixed number of borrowings, so there is in essence a per-borrowing charge. In effect this is infinite for a never-borrowed work. Those of us in the non-traditional community who still want some reward might be happy with a smaller per-borrowing reward, and may also be willing to accept a cap or a per-year maximum or some similar arrangement that still keeps costs lower for the library.

My view

I want my work read. Readers want choice and diversity. Libraries want to grow the community of written work. It is time to think out of the DRM box.

To leave a comment for the author, please follow the link and comment on their blog: Blog: John C. Nash.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

What’s new in DALEX v 0.4.9?

Posted: 29 Oct 2019 01:26 AM PDT

[This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Few days ago a new version of DALEX was accepted by CRAN (v 0.4.9). Here you will find short overview what was added/changed.

DALEX is an R package with methods for visual explanation and exploration of predictive models.
Here you will find short overview with examples based on Titanic data.
For real world use cases:
Here you will find a conference talk related to credit scoring based on FICO data.
Here you will find an example use case for insurance pricing.

Major changes in the last version

Verbose model wrapping

Function explain() is now more verbose. During the model wrapping it checks the consistency of arguments. This works as unit tests for a model. This way most common problems with model exploration can be identified early during the exploration.

Support for new families of models

We have added more convenient support for gbm models. The ntrees argument for predict_function is guessed from the model structure.
Support for mlr, scikit-learn, h2o and mljar was moved to DALEXtra in order to limit number of dependencies.

Integration with other packages

DALEX has now better integration with the auditor package. DALEX explainers can be used with any function form the auditor package. So, now you can easily create an ROC plot, LIFT chart or perform analysis of residuals. This way we have access to a large number of loss functions.

Richer explainers

Explainers have now new elements. Explainers store information about packages that were used for model development along with their versions.
Latest version of explainers stores also sampling weights for the data argument.

A bit of philosophy

Cross-comparisons of models is tricky because predictive models may have very different structures and interfaces. DALEX is based on an idea of an universal adapter that can transform any model into a wrapper with unified interface that can be digest by any model agnostic tools.

In this medium article you will find a longer overview of this philosophy.

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Extracting basic Plots from Novels: Dracula is a Man in a Hole

Posted: 29 Oct 2019 01:00 AM PDT

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In 1965 the University of Chicago rejected Kurt Vonnegut's college thesis, which claimed that all stories shared common structures, or "shapes", including "Man in a Hole", "Boy gets Girl" and "Cinderella". Many years later the then already legendary Vonnegut gave a hilarious lecture on this idea – before continuing to read on please watch it here (about 4 minutes):

When you think about it the shape "Man in a Hole" (characters plunge into trouble and crawl out again) really is one of the most popular – even the Bible follows this simple script (see below)!

A colleague of mine, Professor Matthew Jockers from the University of Nebraska, has analyzed 50,000 novels and found out that Vonnegut was really up to something: there are indeed only half a dozen possible plots most novels follow.

You can read more about this project here: The basic plots of fiction

Professor Jockers has written a whole book about this topic: "The Bestseller Code". But what is even more mind-blowing than this unifying pattern of all stories is that you can do these analyses yourself – with any text of your choice! Professor Jockers made the syuzhet package publicly available on CRAN ("Syuzhet" is the Russian term for narrative construction).

A while ago I finished Dracula, the (grand-)father of all vampire and zombie stories. What a great novel that is! Admittedly it is a little slow-moving but the atmosphere is better than in any of the now popular TV series. Of course, I wanted to do an analysis of the publicly available Dracula text.

The following code should be mostly self-explanatory. First the original text (downloaded from Project Gutenberg: Bram Stoker: Dracula) is broken down into separate sentences. After that the sentiment for each sentence is being evaluated and all the values smoothed out (by using some kind of specialized low pass filter). Finally the transformed values are plotted:

  library(syuzhet)  dracula <- get_text_as_string("data/pg345.txt")  Dracula <- get_sentences(dracula)  Dracula_sent <- get_sentiment(Dracula, method = "bing")  ft_values <- get_dct_transform(Dracula_sent, low_pass_size = 3, scale_range = TRUE)  plot(ft_values, type = "l", main = "Dracula using Transformed Values", xlab = "Narrative Time", ylab = "Emotional Valence", col = "red")  abline(h = 0)

In a way R has "read" the novel in no time and extracted the basic plot – pretty impressive, isn't it! As you can see the story follows the "Man in a Hole"-script rather exemplary, which makes sense because at the beginning everything seems to be fine and well, then Dracula appears and, of course, bites several protagonists, but in the end they catch and kill him – everything is fine again.

THE END

…as a bonus, here is the plot that shows that the Bible also follows a simple "Man in a Hole" narrative (paradise, paradise lost, paradise regained). Fortunately, you can conveniently install the King James Bible as a package: https://github.com/JohnCoene/sacred

  # devtools::install_github("JohnCoene/sacred")  library(sacred)  KJV_sent <- get_sentiment(king_james_version$text, method = "bing")  ft_values <- get_dct_transform(KJV_sent, low_pass_size = 3, scale_range = TRUE)  plot(ft_values, type = "l", main = "King James Bible using Transformed Values", xlab = "Narrative Time", ylab = "Emotional Valence", col = "red")  abline(h = 0)

Simple stories often work best!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

(Re)introducing skimr v2 – A year in the life of an open source R project

Posted: 28 Oct 2019 05:00 PM PDT

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Theme song: PSA by Jay-Z

We announced the testing version of skimr v2 on
June 19, 2018. After more than a
year of (admittedly intermittent) work, we're thrilled to be able to say that
the package is ready to go to CRAN. So, what happened over the last year? And
why are we so excited for v2?

Wait, what is a "skimr"?

skimr is an R package for summarizing your data. It extends tidyverse packages,
and dplyr in particular, so that you can get a broad set of summary statistics
with a single function call. You can install a pre-release version from the
package's GitHub repo.

devtools::install_github("ropensci/skimr")

skimr is also on
CRAN, and v2 should
be appearing there soon. For those of you that might have never seen skimr,
here's a typical call.

library(skimr)  library(dplyr)  options(width = 90)    skim(iris)    ## ── Data Summary ────────────────────────  ##                            Values  ## Name                       iris  ## Number of rows             150  ## Number of columns          5  ## _______________________  ## Column type frequency:  ##   factor                   1  ##   numeric                  4  ## ________________________  ## Group variables            None  ##  ## ── Variable type: factor ─────────────────────────────────────────────────────────────────  ##   skim_variable n_missing complete_rate ordered n_unique top_counts  ## 1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50  ##  ## ── Variable type: numeric ────────────────────────────────────────────────────────────────  ##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist  ## 1 Sepal.Length          0             1  5.84 0.828   4.3   5.1  5.8    6.4   7.9 ▆▇▇▅▂  ## 2 Sepal.Width           0             1  3.06 0.436   2     2.8  3      3.3   4.4 ▁▆▇▂▁  ## 3 Petal.Length          0             1  3.76 1.77    1     1.6  4.35   5.1   6.9 ▇▁▆▇▂  ## 4 Petal.Width           0             1  1.20 0.762   0.1   0.3  1.3    1.8   2.5 ▇▁▇▅▃

Setting the stage

Before we can talk about the last year of skimr development, we need to lay out
the timeline that got us to this point. For those deeply enmeshed in skimr lore,
all dozens of you, bear with.

skimr was originally an
rOpenSci unconf17 project, a big
collaboration between eight different participants that resulted in a conceptual
outline of the package and a basic working version. Participating in the unconf
was a truly magical experience, with everyone bringing a tremendous amount of
energy and ideas to the project, and implementation happening over a flurry of
"fancy git commits".

About six months later, we released our first version on CRAN. The time between
these two milestones was mostly spent on fleshing out all of the different ideas
that were generated during the unconf (like handling grouped data frames) and
fixing all the bugs we discovered along the way.

Getting the package on CRAN opened the gates for bug reports and feature
requests on GitHub. About the same
time we pushed our first version to CRAN, Elin got skimr's rOpenSci package
peer review started
(thank you Jennifer and
Jim!), opening another incredibly useful channel
for collecting feedback on the package. All of these new ideas and suggestions
gave us the opportunity to really push skimr to the next level, but doing that
would require rethinking the package, from the ground up.

A month after finishing the peer review (and six months after the process
began), we announced v2. Over the first phase of skimr's life, we accumulated
700 commits, two releases, 400 GitHub stars, 95 percent code coverage and a
lifetime's worth of
unicode rendering bugs!

Just kidding! We love our little histograms, even when they don't love us back!

Getting it right

Under normal circumstances (i.e. not during a hackathon), most software
engineering projects begin with a design phase and series of increasingly
detailed design docs. skimr is only a few hundred lines of code, which means
"increasingly detailed design docs" translates to one doc. But we did actually
write it!
It's here.
And it still goes a good job of laying out some of the big ideas we were
interested in taking on for v2.

Eliminating frictions that resulted from differences in the way we stored
data vs how it was displayed to users
Getting away from using a global environment to configure skimr
Making it easier for others to extend skimr
Create more useful ways to use skimr

Better internal data structures

In v1, skimr stored all of its data in a "long format", data frame. Although
hidden from the user by its print methods, this format would appear any time
you'd try do something with the results of a skim() call. It looked something
like this:

skim(mtcars) %>% dplyr::filter(stat=="hist")    # A tibble: 11 x 6     variable type    stat  level value formatted                  1 mpg      numeric hist  .all     NA ▃▇▇▇▃▂▂▂   2 cyl      numeric hist  .all     NA ▆▁▁▃▁▁▁▇   3 disp     numeric hist  .all     NA ▇▆▁▂▅▃▁▂   4 hp       numeric hist  .all     NA ▃▇▃▅▂▃▁▁   5 drat     numeric hist  .all     NA ▃▇▁▅▇▂▁▁   6 wt       numeric hist  .all     NA ▃▃▃▇▆▁▁▂   7 qsec     numeric hist  .all     NA ▃▂▇▆▃▃▁▁   8 vs       numeric hist  .all     NA ▇▁▁▁▁▁▁▆   9 am       numeric hist  .all     NA ▇▁▁▁▁▁▁▆  10 gear     numeric hist  .all     NA ▇▁▁▆▁▁▁▂  11 carb     numeric hist  .all     NA ▆▇▂▇▁▁▁▁

Big ups to anyone who looked at the rendered output and saw that this was how
you actually filtered the results. Hopefully there are even better applications
of your near-telepathic abilities.

Now, working with skimr is a bit more sane.

skimmed <- iris %>%    skim() %>%    dplyr::filter(numeric.sd > 1)    skimmed    ## ── Data Summary ────────────────────────  ##                            Values  ## Name                       Piped data  ## Number of rows             150  ## Number of columns          5  ## _______________________  ## Column type frequency:  ##   numeric                  1  ## ________________________  ## Group variables            None  ##  ## ── Variable type: numeric ────────────────────────────────────────────────────────────────  ##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist  ## 1 Petal.Length          0             1  3.76  1.77     1   1.6  4.35   5.1   6.9 ▇▁▆▇▂

And

dplyr::glimpse(skimmed)    ## Observations: 1  ## Variables: 15  ## $ skim_type          "numeric"  ## $ skim_variable      "Petal.Length"  ## $ n_missing          0  ## $ complete_rate      1  ## $ factor.ordered     NA  ## $ factor.n_unique    NA  ## $ factor.top_counts  NA  ## $ numeric.mean       3.758  ## $ numeric.sd         1.765298  ## $ numeric.p0         1  ## $ numeric.p25        1.6  ## $ numeric.p50        4.35  ## $ numeric.p75        5.1  ## $ numeric.p100       6.9  ## $ numeric.hist       "▇▁▆▇▂"

It's still not perfect, as you need to rely on a pseudo-namespace to refer to
the column that you want. But this is unfortunately a necessary trade-off. As
the Rstats Bible, errr Hadley Wickham's Advanced R, states, all elements of
an atomic vector must have the same type.
This normally isn't something that you have to think too much about, that is
until you try to combine the means of all your Date columns with the means of
your numeric columns and everything comes out utterly garbled. So instead of
that basket of laughs, we prefix columns names by their data type.

There's a couple of other nuances here:

The data frame skim() produces always starts off with some metadata
columns
Functions that always produce the same, regardless of input type, can be
treated as base_skimmers and don't need a namespace

Manipulating internal data

A better representation of internal data comes with better tools for reshaping
the data and getting it for other contexts. A common request in v1 was tooling
to handle the skimr subtables separately. We now do this with partition(). It
replaces the v1 function skim_to_list().

partition(skimmed)    ## $numeric  ##  ## ── Variable type: numeric ────────────────────────────────────────────────────────────────  ##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist  ## 1 Petal.Length          0             1  3.76  1.77     1   1.6  4.35   5.1   6.9 ▇▁▆▇▂

You can undo a call to partition() with bind(), which joins the subtables
into the original skim_df object and properly accounts for metadata. You can
skip a step with the function yank(), which calls partition and pulls out a
particular subtable

yank(skimmed, "numeric")    ##  ## ── Variable type: numeric ────────────────────────────────────────────────────────────────  ##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist  ## 1 Petal.Length          0             1  3.76  1.77     1   1.6  4.35   5.1   6.9 ▇▁▆▇▂

Last, with support something close to the older format with the to_long()
function. This can be added for something close to backwards compatibility.
Being realistic on open source sustainability means that we are not able to
support 100% backward compatibility in v2 even with new functions. Meanwhile you
can keep using v1 if you are happy with it. However, because skimr's
dependencies are under ongoing development, sooner or later skimr v1 will no
longer work with updates to them.

Working with dplyr

Using skimr in a dplyr pipeline was part of the original package design, and
we've needed to devote some extra love to making sure that everything is as
seamless as possible. Part of this is due to the object produce by skim(),
which we call skim_df. It's a little weird in that it needs both metadata and
columns in the underlying data frame.

In practice, this means that you can coerce it into a different type through
normal dplyr operations. Here's one:

select(skimmed, numeric.mean)    ## # A tibble: 1 x 1  ##   numeric.mean  ##            ## 1         3.76

To get around this, we've added some helper functions and methods. The more
skimr-like replacement for select() is focus(), which preserves metadata
columns.

focus(skimmed, numeric.mean)    ## ── Data Summary ────────────────────────  ##                            Values  ## Name                       Piped data  ## Number of rows             150  ## Number of columns          5  ## _______________________  ## Column type frequency:  ##   numeric                  1  ## ________________________  ## Group variables            None  ##  ## ── Variable type: numeric ────────────────────────────────────────────────────────────────  ##   skim_variable  mean  ## 1 Petal.Length   3.76

Configuring and extending skimr

Most of skimr's magic, to
steal a term,
comes from the fact that you can do most everything with one function. But
believe it or not, there's actually a bit more to the package.

One big one is customization. We like the skimr defaults, but that doesn't
guarantee you will. So what if you want to do something different, we have a
function factory for that!

my_skim <- skim_with(numeric = sfl(iqr = IQR, p25 = NULL, p75 = NULL))  my_skim(faithful)    ## ── Data Summary ────────────────────────  ##                            Values  ## Name                       faithful  ## Number of rows             272  ## Number of columns          2  ## _______________________  ## Column type frequency:  ##   numeric                  2  ## ________________________  ## Group variables            None  ##  ## ── Variable type: numeric ────────────────────────────────────────────────────────────────  ##   skim_variable n_missing complete_rate  mean    sd    p0   p50  p100 hist    iqr  ## 1 eruptions             0             1  3.49  1.14   1.6     4   5.1 ▇▂▂▇▇  2.29  ## 2 waiting               0             1 70.9  13.6   43      76  96   ▃▃▂▇▂ 24

Those of you familiar with customizing skim() in v1 will notice a couple
differences:

we now has an object called sfl() for managing skimr function lists; more
below
instead of setting global options, we now have a function factory

Yes! A function factory. skim_with() gives us a new function each time we call
it, and the returned function is configured by the arguments in skim_with().
This works the same way as ecdf() in the stats package or colorRamp in
grDevices. Creating new functions has a few advantages over the previous
approach.

you can export a skim() function in a package or create it in a
.Rprofile
you avoid a bunch of potential side effects from setting options with
skim_with()

The other big change is how we now handle different data types. Although many
will never see it, a key piece of skimr customization comes from the
get_skimmers() generic. It's used to detect different column types in your
data and set the appropriate summary functions for that type. It's also designed
to work with sfl(). Here's an example from the "Supporting additional objects"
vignette. Here, we'll create some skimmers for
sf data types:

get_skimmers.sfc_POINT <- function(column) {    sfl(      skim_type = "sfc_POINT",      n_unique = n_unique,      valid = ~ sum(sf::st_is_valid(.))    )  }

While it was required in skim_with(), users must provide a skim_type value
when creating new methods. With that, you can export this method in a new
package (be sure to import the generic), and the new default skimmer is added
when you load the package.

get_default_skimmer_names()    ...  $sfc_POINT  [1] "missing"  "complete" "n"        "n_unique" "valid"  ...

Even if you don't go the full route of supporting a new data type, creating a
couple of skimr function lists has other benefits. For example, you can add some
to your .Rprofile as a way to quickly configure skimr interactively.

sfc_point_sfl <- sfl(    n_unique = n_unique,    valid = ~ sum(sf::st_is_valid(.))  )    my_skimmer <- skim_with(sfc_POINT = sfc_point_sfl)

Using skimr in other contexts

In skimr v1, we developed some slightly hacky approaches to getting nicer
skim() output in RMarkdown docs. These have been removed in favor of the
actually-supported knit_print
API. Now, calling skim(), within an RMarkdown doc should produce something
nice by default.

skim(chickwts)

**Data summary**
Name	chickwts
Number of rows	71
Number of columns	2
_______________________
Column type frequency:
factor	1
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
feed	0	1	FALSE	6	soy: 14, cas: 12, lin: 12, sun: 12

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
weight	0	1	261.31	78.07	108	204.5	258	323.5	423	▆▆▇▇▃

You get a nice html version of both the summary header and the skimr subtables
for each type of data.

In this context, you configure the output the same way you handle other knitr
code chunks.

This means that we're dropping direct support for kable.skim_df() and
pander.skim_df(). But you can still get pretty similar results to these
functions by using the reshaping functions described above to get subtables. You
can also still use Pander and other nice rendering packages on an ad hoc basis
as you would for other data frames or tibbles.

We also have a similarly-nice rendered output in
Jupyter
and RMarkdown notebooks. In the latter, the summary is separated from the rest
of the output when working interactively. We like it that way, but we'd be happy
to hear what the rest of you think!

Wait, that took over a year?

Well, we think that's a lot! But to be fair, it wasn't exactly simple to keep up
with skimr. Real talk, open source development takes up a lot of time, and the
skimr developers have additional important priorities. Michael's family added a
new baby, and despite swearing up and down otherwise, he got absolutely nothing
not-baby-related done during his paternity leave (take note new dads!). Elin
ended up taking a much bigger role on at Lehman, really limiting time for any
other work.

Even so, these are just the highlights in the normal ebb and flow of this sort
of work. Since it's no one's real job, it might not always be the first focus.
And that's OK! We've been really lucky to have a group of new users that have
been very patient with this slow development cycle while still providing really
good feedback throughout. Thank you all!

We're really excited about this next step in the skimr journey. We've put a huge
amount of work into this new version. Hopefully it shows. And hopefully it
inspires some of you to send more feedback and help us find even more ways to
improve!

If you want to learn more about skimr, check out our
GitHub repo. GitHub is also the best place
to file issues. We'd love to hear
from you!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

An Amazon SDK for R!?

Posted: 28 Oct 2019 05:00 PM PDT

[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RBloggers|RBloggers-feedburner

Intro:

For a long time I have found it difficult to appreciate the benefits of "cloud compute" in my R model builds. This was due to my initial lack of understanding and the setting up of R on cloud compute environments. When I noticed that AWS was bringing out a new product AWS Sagemaker, the possiblities of what it could provide seemed like a dream come true.

Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action. Your models get to production faster with much less effort and lower cost. (https://aws.amazon.com/sagemaker/)

A question about AWS Sagemake came to mind: Does it work for R developers??? Well…not exactly. True it provides a simple way to set up an R environment in the cloud but it doesn't give the means to access other AWS products for example AWS S3 and AWS Athena out of the box. However for Python this is not a problem. Amazon has provided a Software Development Kit (SDK) for Python called boto3, which comes pre-installed on AWS Sagemaker.

It isn't all bad news, RStudio has developed a package called reticulate that lets R interfaced into Python. So using reticulate in combination with boto3 gives R full access to all of AWS products from Sagemaker similar to Python. However are there any other methods for R user to connect to AWS?

AWS interfaces for R:

`paws` an R SDK:

Paws is a Package for Amazon Web Services in R. Paws provides access to the full suite of AWS services from within R.(https://github.com/paws-r/paws)

When I want to connect to AWS I usually turn to Python. AWS's boto3 is an excellent means of connecting to AWS and exploit its resources. However R now has it's own SDK into AWS, paws. This came as a little surprise to me as I started to accept that R might never have an SDK for AWS. How wrong I was.

What's pleasing to me was how well developed and easy the package was to use. It felt natural to switch between boto3 and paws. Almost like it was a long lost brother.

Here is a quick example to show the comparison between boto3 and paws. Returning a list of all objects in S3 inside a prefix:

Python

import boto3    s3 = boto3.Session().client("s3")  obj = s3.list_objects(Bucket = 'mybucket', Prefix = "prefix_1/")  [x.get("Key") for x in obj.get("Contents")]

s3 <- paws::s3()    obj <- s3$list_objects(Bucket = 'mybucket', Prefix = "prefix_1/")  lapply(obj$Contents, function(x) x$Key)

From this quick example it is clear that the paws SDK's syntax is extremely similar to boto3, although with an R twist. This can only a good thing, as hundreds of people know boto3 already and therefore they will be familiar with paws by association. I can't express the potential the package paws gives R users. A good project that utilises the paws sdk is the package noctua. noctua creates a wrapper of the paws connection to AWS Athena and developes a DBI interface for R users. We will go into the package noctua in the next blog. First here is an example how of to work with AWS Athena when using paws.

Querying to AWS Athena using paws

# create an AWS Athena object  athena <- paws::athena()    # Submit query to AWS Athena  res <- athena$start_query_execution(              QueryString = "show Databases",              ResultConfiguration =                   list(OutputLocation = "s3://mybucket/queries/"))    # Get Status of query  result <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId)    # Return results if query is successful  if(result$QueryExecution$Status$State == "FAILED") {    stop(result$QueryExecution$Status$StateChangeReason, call. = FALSE)  } else {output <-             athena$get_query_results(                QueryExecutionId = res$QueryExecutionId,                MaxResults = 1)}

From an initial view it might look daunting however this is exactly the same interface that boto3 provides when working with AWS Athena. The good news is that noctua wraps all of this and creates the DBI method dbGetQuery for paws.

paws is an excellent R SDK into AWS, so please download paws and give it ago, I am sure you will be pleasantly surprised like myself.

install.packages("paws")

Note: For more examples, the developers of paws have created some code examples https://github.com/paws-r/paws/tree/master/examples and a documentation website https://paws-r.github.io/.

`botor` :

This R package provides raw access to the 'Amazon Web Services' ('AWS') 'SDK' via the 'boto3' Python module and some convenient helper functions (currently for S3 and KMS) and workarounds, eg taking care of spawning new resources in forked R processes. (https://daroczig.github.io/botor/)

When using botor on AWS Sagemaker, R users can easily interact with all of AWS products in the exact same manner as a Python user. However botor's convenient helper functions certainly does make the experience working on AWS Sagemaker easier. Here is a quick example to demostrate how easy/ useful these helper function are:

Upload iris data.frame to s3 bucket

library(botor)    write_s3(iris, data.table::fwrite, "s3://mybucket/iris.csv")

Read s3 file back into R as a data.frame

read_s3("s3:://mybucket/iris.csv", data.table::fread)

These convenient helper functions are not limited to just reading/writing data in csv format. They can also be used to upload R models, which can be really useful when wanted to store pre-built models. Here is a quick example of what I like to call a crap model.

train <- iris[1:20,1:4]  test <- iris[21:40,1:4]     model <- lm(Petal.Width ~., train)

Uploading and downloading R models to S3

s3_write(model, saveRDS, "s3://mybucket/crap_model.RDS")  s3_model <- s3_read("s3://mybucket/crap_model.RDS", readRDS)

It is clear to see how useful botor is when working with AWS S3.

Cloudyr Project:

I personally haven't used the AWS cloudyr packages, however I don't want to leave them out. The cloudyr project aim is to bring R onto the cloud compute:

The goal of this initiative is to make cloud computing with R easier, starting with robust tools for working with cloud computing platforms.(https://cloudyr.github.io/)

As I haven't utilised the wide range of packages that the cloudyr project provides I won't give examples. Please go to the cloudyr github https://github.com/cloudyr as a lot of work has gone into making R easier to work with cloud computing. They have a lot of documentation plus they are actively developing R packages to make user experience better.

Summary:

I believe that all of these packages have advantages in working with AWS when using R. As R has a SDK paws for AWS it would be great if it was added to the base image, as it allows R developers to utilise AWS products in their AWS Sagemaker environments. Alternatively the botor package would be another package for AWS to consider putting in their AWS Sagemaker image.

To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Sept 2019: “Top 40” New R Packages

Posted: 28 Oct 2019 05:00 PM PDT

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One hundred and thirteen new packages made it to CRAN in September. Here are my "Top 40" picks in eight categories: Computational Methods, Data, Economics, Machine Learning, Statistics, Time Series, Utilities, and Visualization.

Computational Methods

eRTG3D v0.6.2: Provides functions to create realistic random trajectories in a 3-D space between two given fixed points (conditional empirical random walks), based on empirical distribution functions extracted from observed trajectories (training data), and thus reflect the geometrical movement characteristics of the mover. There are several small vignettes, including sample data sets, linkage to the sf package, and point cloud analysis.

freealg v1.0: Implements the free algebra in R: multivariate polynomials with non-commuting indeterminates. See the vignette for the math.

HypergeoMat v3.0.0: Implements Koev & Edelman's algorithm (2006) to evaluate the hypergeometric functions of a matrix argument, which appear in random matrix theory. There is a vignette.

opart v2019.1.0: Provides a reference implementation of standard optimal partitioning algorithm in C using square-error loss and Poisson loss functions as described by Maidstone (2016), Hocking (2016), Rigaill (2016), and Fearnhead (2016) that scales quadratically with the number of data points in terms of time-complexity. There are vignettes for Gaussian and Poisson squared error loss.

Data

cde v0.4.1: Facilitates searching, download and plotting of Water Framework Directive (WFD) reporting data for all water bodies within the UK Environment Agency area. This package has been peer-reviewed by rOpenSci. There is a Getting Started Guide and a vignette on output reference.

eph v0.1.1: Provides tools to download and manipulate data from the Argentina Permanent Household Survey. The implemented methods are based on INDEC (2016).

leri v0.0.1: Fetches Landscape Evaporative Response Index (LERI) data using the raster package. The LERI product measures anomalies in actual evapotranspiration, to support drought monitoring and early warning systems. See the vignette for examples.

rwhatsapp v0.2.0: Provides functions to parse and digest history files from the popular messenger service WhatsApp. There is a vignette.

tidyUSDA v0.2.1: Provides a consistent API to pull United States Department of Agriculture census and survey data from the National Agricultural Statistics Service (NASS) QuickStats service. See the vignette.

Economics

bunching v0.8.4: Implements the bunching estimator from economic theory for kinks and knots. There is a vignette on Theory, and another with Examples.

fixest v0.1.2: Provides fast estimation of econometric models with multiple fixed-effects, including ordinary least squares (OLS), generalized linear models (GLM), and the negative binomial. The method to obtain the fixed-effects coefficients is based on Berge (2018). There is a vignette.

raceland v1.0.3: Implements a computational framework for a pattern-based, zoneless analysis, and visualization of (ethno)racial topography for analyzing residential segregation and racial diversity. There is a vignette describing the Computational Framework, one describing Patterns of Racial Landscapes, and a third on SocScape Grids.

Machine Learning

biclustermd v0.1.0: Implements biclustering, a statistical learning technique that simultaneously partitions, and clusters rows and columns of a data matrix in a manner that can deal with missing values. See the vignette for examples.

bbl v0.1.5: Implements supervised learning using Boltzmann Bayes model inference, enabling the classification of data into multiple response groups based on a large number of discrete predictors that can take factor values of heterogeneous levels. See Woo et al. (2016) for background, and the vignette for how to use the package.

corporaexplorer v0.6.3: Implements Shiny apps to dynamically explore collections of texts. Look here for more information.

fairness v1.0.1: Offers various metrics to assess and visualize the algorithmic fairness of predictive and classification models using methods described by Calders and Verwer (2010), Chouldechova (2017), Feldman et al. (2015), Friedler et al. (2018), and Zafar et al. (2017). There is a tutorial for the package.

imagefluency v0.2.1: Provides functions to collect image statistics based on processing fluency theory that include scores for several basic aesthetic principles that facilitate fluent cognitive processing of images: contrast, complexity / simplicity, self-similarity, symmetry, and typicality. See Mayer & Landwehr (2018) and Mayer & Landwehr (2018) for the theoretical background, and the vignette for an introduction.

ineqJD v1.0: Provides functions to compute and decompose Gini, Bonferroni, and Zenga 2007 point and synthetic concentration indexes. See Zenga M. (2015), Zenga & Valli (2017), and Zenga & Valli (2018) for more information.

lmds v0.1.0: Implements Landmark Multi-Dimensional Scaling (LMDS), a dimensionality reduction method scaleable to large numbers of samples, because rather than calculating a complete distance matrix between all pairs of samples, it only calculates the distances between a set of landmarks and the samples. See the README for an example.

modelStudio v0.1.7: Implements an interactive platform to help interpret machine learning models. There is a vignette, and look here for a demo of the interactive features.

nlpred v1.0: Provides methods for obtaining improved estimates of non-linear cross-validated risks obtained using targeted minimum loss-based estimation, estimating equations, and one-step estimation. Cross-validated area under the receiver operating characteristics curve ( LeDell sr al. (2015) ) and other metrics are included. There is a vignette on small sample estimates.

pyMTurkR v1.1: Provides access to the latest Amazon Mechanical Turk' ('MTurk') Requester API (version '2017–01–17'), replacing the now deprecated MTurkR package.

stagedtrees v1.0.0: Creates and fits staged event tree probability models, probabilistic graphical models capable of representing asymmetric conditional independence statements among categorical variables. See Görgen et al. (2018), Thwaites & Smith (2017), Barclay et al. (2013), and Smith & Anderson](doi:10.1016/j.artint.2007.05.004) for background, and look here for and overview.

Statistics

confoundr v1.2: Implements three covariate-balance diagnostics for time-varying confounding and selection-bias in complex longitudinal data, as described in Jackson (2016) and Jackson (2019). There is a Demo vignette and another Describing Selection Bias from Dependent Censoring

distributions3 v0.1.1: Provides tools to create and manipulate probability distributions using S3. Generics random(), pdf(), cdf(), and quantile() provide replacements for base R's r/d/p/q style functions. The documentation for each distribution contains detailed mathematical notes. There are several vignettes: Intro to hypothesis testing, One-sample sign tests,
One-sample T confidence interval, One-sample T-tests, Z confidence interval for a mean, One-sample Z-tests for a proportion, One-sample Z-tests, Paired tests, and Two-sample Z-tests.

dobin v0.8.4: Implements a dimension reduction technique for outlier detection, which constructs a set of basis vectors for outlier detection that bring outliers to the forefront using fewer basis vectors. See Kandanaarachchi & Hyndman (2019) for background, and the vignette for a brief introduction.

glmpca v0.1.0: Implements a generalized version of principal components analysis (GLM-PCA) for dimension reduction of non-normally distributed data, such as counts or binary matrices. See Townes et al. (2019) and Townes (2019) for details, and the vignette for examples.

immuneSIM v0.8.7: Provides functions to simulate full B-cell and T-cell receptor repertoires using an in-silico recombination process that includes a wide variety of tunable parameters to introduce noise and biases. See Weber et al. (2019) for background, and look here for information about the package.

irrCAC v1.0: Provides functions to calculate various chance-corrected agreement coefficients (CAC) among two or more raters, including Cohen's kappa, Conger's kappa, Fleiss' kappa, Brennan-Prediger coefficient, Gwet's AC1/AC2 coefficients, and Krippendorff's alpha. There are vignettes on benchmarking, Calculating Chance-corrected Agreement Coefficients, and Computing weighted agreement coefficients.

LPBlg v1.2: Provides functions that estimate a density and derive a deviance test to assess if the data distribution deviates significantly from the postulated model, given a postulated model and a set of data. See Algeri S. (2019) for details.

SynthTools v1.0.0: Provides functions to support experimentation with partially synthetic data sets. Confidence interval and standard error formulas have options for either synthetic data sets or multiple imputed data sets. For more information, see Reiter & Raghunathan (2007).

Time Series

fable v0.1.0: Provides a collection of commonly used univariate and multivariate time series forecasting models, including automatically selected exponential smoothing (ETS) and autoregressive integrated moving average (ARIMA) models. There is an Introduction and a vignette on transformations.

nsarfima v0.1.0.0: Provides routines for fitting and simulating data under autoregressive fractionally integrated moving average (ARFIMA) models, without the constraint of stationarity. Two fitting methods are implemented: a pseudo-maximum likelihood method and a minimum distance estimator. See Mayoral (2007) and Beran (1995) for reference.

Utilities

nc v2019.9.16: Provides functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions. Patterns are defined using a readable syntax that makes it easy to build complex patterns in terms of simpler, re-usable sub-patterns. There is a vignette on capture first match and another on capture all match.

pins v0.2.0: Provides functions that "pin" remote resources into a local cache in order to work offline, improve speed, avoid recomputing, and discover and share resources in local folders, GitHub, Kaggle and RStudio Connect. There is a Getting Started Guide and vignettes on Extending Boards, Using GitHub Boards, Using Kaggle Boards, Using RStudio Connect Boards, Using Website Boards, Using Pins in RStudio, Understanding Boards, and Extending Pins.

queryparser v0.1.1: Provides functions to translate SQL SELECT statements into lists of R expressions.

rawr v0.1.0: Retrieves pure R code from popular R websites, including github, kaggle, datacamp, and R blogs made using blogdown.

Visualization

FunnelPlotR v0.2.1: Implements Spiegelhalter (2005) Funnel plots for reporting standardized ratios, with overdispersion adjustment. The vignette offers examples.

ggBubbles v0.1.4: Implements mini bubble plots to display more information for discrete data than traditional bubble plots do. The vignette provides examples.

gghalves v0.0.1: Implements a ggplot2 extension for easy plotting of half-half geom combinations: think half boxplot and half jitterplot, or half violinplot and half dotplot.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Mocking is catching

Posted: 28 Oct 2019 05:00 PM PDT

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When writing unit tests for a package, you might find yourself wondering about how to best test the behaviour of your package

when the data it's supposed to munge has this or that quirk,
when the operating system is Windows,
when a package enhancing its functionality is not there,
when a web API returns an error;

or you might even wonder how to test at least part of that package of yours that calls a web API or local database… without accessing the web API or local database during testing.

In some of these cases, the programming concept you're after is mocking, i.e. making a function act as if something were a certain way! In this blog post we shall offer a round-up of resources around mocking, or not mocking, when unit testing an R package.

also YES you know i'm trying to hit 100% but once i read i would have to do something called "mocking" my brain was like ok let's go on twitter instead

— Sharla Gelfand (@sharlagelfand) August 1, 2019

Please keep reading, do not flee to Twitter! (The talented Sharla did end up using mocking for her package!)

Packages for mocking

General mocking

Nowadays, when using testthat for testing, the recommended tool for mocking is the mockery package, not testthat's own with_mock() function. To read how they differ in their implementation of mocking, refer to this issue and that section of mockery README. In brief, with mockery you can stub (i.e. replace) a function in a given environment e.g. the environment of a function. Let's create a small toy example to illustrate that.

# a function that says encoding is a pain  # when the OS is Windows  is_encoding_a_pain <- function(){    if (Sys.info()[["sysname"]] == "Windows"){      return("YES")    } else {      return("no")    }  }    # The post was rendered on Ubuntu  Sys.info()[["sysname"]]

## [1] "Linux"

# So, is encoding a pain?  is_encoding_a_pain()

## [1] "no"

# stub/replace Sys.info() in is_encoding_a_pain()  # with a mock that answers "Windows"  mockery::stub(where = is_encoding_a_pain,                what = "Sys.info",                 how = c(sysname = "Windows"))    # Different output  is_encoding_a_pain()

## [1] "YES"

# NOT changed  Sys.info()[["sysname"]]

## [1] "Linux"

Let's also look at a real life example, from keyring tests:

test_that("auto windows", {    mockery::stub(default_backend_auto, "Sys.info", c(sysname = "Windows"))    expect_equal(default_backend_auto(), backend_wincred)  })

What happens after the call to mockery::stub() is that inside the test, when default_backend_auto() is called, it won't use the actual Sys.info() but instead a mock that returns c(sysname = "Windows") so the test can assess what default_backend_auto() returns on Windows… without the test being run on a Windows machine.

Instead of directly defining the return value as is the case in this example, one could stub the function with a function, as seen in one of the tests for the remotes package.

To find more examples of how to use mockery in tests, you can use GitHub search in combination with R-hub's CRAN source code mirror: https://github.com/search?l=&q=%22mockery%3A%3Astub%22+user%3Acran&type=Code

Web mocking

In the case of a package doing HTTP requests, you might want to test what happens when an error code is received for instance. To do that, you can use either httptest or webmockr (compatible with both httr and crul).

Temporarily modify the global state

To test what happens when, say, an environment variable has a particular value, one can set it temporarily within a test using the withr package. You could argue it's not technically mocking, but it's an useful trick. You can see it in action in keyring's tests.

To mock or… not to mock

Sometimes, you might not need mocking and can resort to an alternative approach instead, using the real thing/situation. You could say it's a less "unit" approach and requires more work.

Fake input data

For say a plotting or modelling library, you can tailor-make data. Comparing approaches or packages for creating fake data are beyond the scope of this post, so let's just name a few packages:

charlatan,
fakir,
salty (for "salting" clean data i.e. adding errors in it).

Stored data from a web API / a database

As explained in this discussion about testing web API packages, when testing a package accessing and munging web data you might want to separate testing of the data access and of the data munging, on the one hand because failures will be easier to trace back to a problem in the web API vs. your code, on the other hand to be able to selectively turn off some tests based on internet connection, the presence of an API key, etc. Storing and replaying HTTP requests is supported by:

vcr with webmockr,
httptest.

What about applying the same idea to packages using a database connection?

The test suite of dbplyr could be a good inspiration.
There is a nascent package called dbtest that is similar to httptest, but for databases.

Different operating systems

Say you want to be sure your packages builds correctly on another operating system… you can use R-hub package builder or maybe a continuous integration service.

Different system configurations or libraries

Regarding the case where you want to test your package when a suggested dependency is or is not installed, you can use the configuration script of a continuous integration service to have at least one build without that dependency:

I'd suggest skip_if_not_installed() plus some Travis wrangling so you have a build that doesn't have the package installed (probably by running remove.packages() in before_script)

— Hadley Wickham (@hadleywickham) October 4, 2019

Conclusion

In this post we offered a round-up of resources around mocking when unit testing R packages, as well as around not mocking. To learn about more packages for testing your package, refer to the list published on Locke Data's blog. Now, what if you're not sure about the best approach for that quirky thing you want to test, mocking or not mocking, and how exactly? Well, you can fall back on two methods: Reading the source code of other packages, and Asking for help! Good luck!

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Any one interested in a function to quickly generate data with many predictors?

Posted: 28 Oct 2019 05:00 PM PDT

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A couple of months ago, I was contacted about the possibility of creating a simple function in simstudy to generate a large dataset that could include possibly 10's or 100's of potential predictors and an outcome. In this function, only a subset of the variables would actually be predictors. The idea is to be able to easily generate data for exploring ridge regression, Lasso regression, or other "regularization" methods. Alternatively, this can be used to very quickly generate correlated data (with one line of code) without going through the definition process.

I'm presenting a new function here as a work-in-progress. I am putting it out there in case other folks have opinions about what might be most useful; feel free to let me know if you do. If not, I am likely to include something very similar to this in the next iteration of simstudy, which will be version 0.1.16.

Function genMultPred

In its latest iteration, the new function has three interesting arguments. The first two are predNorm and predBin, which are each vectors of length 2. The first value indicates the number of predictors to generate with either a standard normal distribution or a binary distribution, respectively. The second value in each vector represents the number of variables that will actually be predictive of the outcome. (Obviously, the second value cannot be greater than the first value.)

The third interesting argument is corStrength, which is a non-negative number indicating the overall strength of the correlation between the predictors. When corStrength is set to 0 (which is the default), the variables are generated assuming independence. When corStrength is non-zero, a random correlation matrix is generated using package clusterGeneration [Weiliang Qiu and Harry Joe. (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation).] The corStrength value is passed on to the argument ratioLambda in the function genPositiveDefMat. As the value of corStrength increases, higher levels of correlation are induced in the random correlation matrix for the predictors.

Currently, the outcome can only have one of three distributions: normal, binomial, or Poisson.

One possible enhancement would be to allow the distributions of the predictors to have more flexibility. However, I'm not sure the added complexity would be worth it. Again, you could always take the more standard simstudy approach of function genData if you wanted more flexibility.

Here's the function, in case you want to take a look under the hood:

genMultPred <- function(n, predNorm, predBin,                           dist = "normal", sdy = 1, corStrength = 0) {        normNames <- paste0("n", 1:predNorm[1])    binNames <- paste0("b", 1:predBin[1])        ## Create the definition tables to be used by genData        defn <- data.table(varname = normNames,                       formula = 0,                       variance = 1,                       dist = "normal",                       link = "identity")        defb <- data.table(varname = binNames,                       formula = 0.5,                       variance = NA,                       dist = "binary",                       link = "identity")        defx <- rbind(defn, defb)    attr(defx, which = "id") <- "id"        ## Create the coefficient values - all normally distributed        ncoefs <- rnorm(predNorm[1], 0, 1)    setzero <- sample(1:predNorm[1], (predNorm[1] - predNorm[2]),                       replace = FALSE)    ncoefs[setzero] <- 0        bcoefs <- rnorm(predBin[1], 0, 1)    setzero <- sample(1:predBin[1], (predBin[1] - predBin[2]),                       replace = FALSE)    bcoefs[setzero] <- 0        coefs <- c(ncoefs, bcoefs)    names(coefs) <- c(normNames, binNames)        ## Generate the predictors        if (corStrength <= 0) {     # predictors are independent            dx <- genData(n, defx)          } else {            rLambda <- max(1, corStrength)      covx <- cov2cor(genPositiveDefMat(nrow(defx),                           lambdaLow = 1, ratioLambda = rLambda)$Sigma)      dx <- genCorFlex(n, defx, corMatrix = covx)          }        ## Generate the means (given the predictors)        mu <- as.matrix(dx[,-"id"]) %*% coefs    dx[, mu := mu]        ## Generate the outcomes based on the means        if (dist == "normal") {      dx[, y := rnorm(n, mu, sdy)]    } else if (dist == "binary") {      dx[, y := rbinom(n, 1, 1/(1 + exp(-mu)))]  # link = logit    } else if (dist == "poisson") {      dx[, y := rpois(n, exp(mu))]               # link = log    }         dx[, mu := NULL]        return(list(data = dx[], coefs = coefs))  }

A brief example

Here is an example with 7 normally distributed covariates and 4 binary covariates. Only 3 of the continuous covariates and 2 of the binary covariates will actually be predictive.

library(simstudy)  library(clusterGeneration)    set.seed(732521)    dx <- genMultPred(250, c(7, 3), c(4, 2))

The function returns a list of two objects. The first is a data.table containing the generated predictors and outcome:

round(dx$data, 2)

##       id    n1    n2    n3    n4    n5    n6    n7 b1 b2 b3 b4     y  ##   1:   1  0.15  0.12 -0.07 -1.38 -0.05  0.58  0.57  1  1  0  1 -1.07  ##   2:   2  1.42 -0.64  0.08  0.83  2.01  1.18  0.23  1  1  0  0  4.42  ##   3:   3 -0.71  0.77  0.94  1.59 -0.53 -0.05  0.26  0  0  0  0  0.09  ##   4:   4  0.35 -0.80  0.90 -0.79 -1.72 -0.16  0.09  0  0  1  1 -0.58  ##   5:   5 -0.22 -0.72  0.62  1.40  0.17  2.21 -0.45  0  1  0  1 -2.18  ##  ---                                                                  ## 246: 246 -1.04  1.62  0.40  1.46  0.80 -0.77 -1.27  0  0  0  0 -1.19  ## 247: 247 -0.85  1.56  1.39 -1.25 -0.82 -0.63  0.13  0  1  0  0 -0.70  ## 248: 248  0.72 -0.83 -0.04 -1.38  0.61 -0.71 -0.06  1  0  1  1  0.74  ## 249: 249 -0.15  1.62 -1.01 -0.79 -0.53  0.44 -0.46  1  1  1  1  0.95  ## 250: 250 -0.59  0.34 -0.31  0.18 -0.86 -0.90  0.22  1  0  1  0 -1.90

The second object is the set of coefficients that determine the average response conditional on the predictors:

round(dx$coefs, 2)

##    n1    n2    n3    n4    n5    n6    n7    b1    b2    b3    b4   ##  2.48  0.62  0.28  0.00  0.00  0.00  0.00  0.00  0.00  0.53 -1.21

Finally, we can "recover" the original coefficients with linear regression:

lmfit <- lm(y ~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + b1 + b2 + b3 + b4,               data = dx$data)

Here's a plot showing the 95% confidence intervals of the estimates along with the true values. The yellow lines are covariates where there is truly no association.

Addendum: correlation among predictors

Here is a pair of examples using the corStrength argument. In the first case, the observed correlations are close to 0, whereas in the second case, the correlations range from -0.50 to 0.25. The impact of corStrength will vary depending on the number of potential predictors.

set.seed(291212)    # Case 1    dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 0)  round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)

##       n1    n2    n3    n4    b1    b2  ## n1  1.00 -0.02  0.02  0.03 -0.01 -0.01  ## n2 -0.02  1.00 -0.01  0.03 -0.03  0.00  ## n3  0.02 -0.01  1.00  0.00 -0.04 -0.01  ## n4  0.03  0.03  0.00  1.00  0.06 -0.01  ## b1 -0.01 -0.03 -0.04  0.06  1.00 -0.01  ## b2 -0.01  0.00 -0.01 -0.01 -0.01  1.00

# Case 2    dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 50)  round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)

##       n1    n2    n3    n4    b1    b2  ## n1  1.00  0.09  0.08 -0.32  0.25  0.04  ## n2  0.09  1.00 -0.29 -0.47 -0.05 -0.02  ## n3  0.08 -0.29  1.00 -0.46 -0.01 -0.01  ## n4 -0.32 -0.47 -0.46  1.00 -0.20 -0.05  ## b1  0.25 -0.05 -0.01 -0.20  1.00 -0.04  ## b2  0.04 -0.02 -0.01 -0.05 -0.04  1.00

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Dogs of New York

Posted: 28 Oct 2019 10:18 AM PDT

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other week I took a few publicly-available datasets that I use for teaching data visualization and bundled them up into an R package called nycdogs. The package has datasets on various aspects of dog ownership in New York City, and amongst other things you can draw maps with it at the zip code level. The package homepage has installation instructions and an example.

Using this data, I made a poster called Dogs of New York. It's a small multiple map of the relative prevalence of the twenty five most common dog breeds in the city, based on information from the city's dog license database. This morning, reflecting on a series of conversations I had with people about the original poster, I tweaked it a little further. In the original version, I used a discrete version of one of the Viridis palettes, initially the "Inferno" variant, and subsequently the "Plasma" one. These palettes are really terrific in everyday use for visualizations of all kinds, because they are vivid, colorblind-friendly, and perceptually uniform across their scale. I use them all the time, for instance with the Mortality in France poster I made last year.

When it came to using this palette in a map, though, I ran into an interesting problem. Here's a detail from the "Plasma" palette version of the poster.

Dogs of New York - Plasma version — Detail from the Plasma version of Dogs of New York

Now, these colors are vivid. But when I showed it to people, opinion was pretty evenly split between people who intuitively saw the darker, purplish areas as signifying "more", and people who intuitively saw the warmer, yellowish areas as signifying "more". So, for example, a number of people asked if I could make the map with the colors range flipped, with yellow meaning "more" or "high" (and indeed, in the very first version of the map I originally had done this). A friend with conflicting intuitions incisively noted that she associated darker colors with "more", in contrast to lighter colors, but also associated warmer colors with "more". The fact that the scale moved from a cooler, darker color through to a different, warmer, lighter color was thus confusing. The warm and vivid quality of the yellow end of the Plasma spectrum seemed to be particularly prone to this confusion.

I think the small-multiple character of the graph exacerbated this confusion. It shows the selected dog breeds from most (top left) to least (bottom right) common, whereas the guide to the scale (in the top right) showed the scale running from low to high values, or least to most common.

In the end, after a bit of experimentation I decided to redo the figure in a one-hue HCL scale, the "Oranges" palette from the Colorspace package. I also reversed the guide and added a few other small details to aid the viewer in interpreting the graph. Here's the final version.

Dogs of New York Poster — Dogs of New York

The palette isn't quite as immediately vivid as the Viridis version, but it seems to solve the problems of interpretation, and that's more important. The image is made to be blown up to quite a large size, which I plan on doing myself. To that end, here's a PDF version of the poster as well.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Spelunking macOS ‘ScreenTime’ App Usage with R

Posted: 28 Oct 2019 07:33 AM PDT

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Apple has brought Screen Time to macOS for some time now and that means it has to store this data somewhere. Thankfully, Sarah Edwards has foraged through the macOS filesystem for us and explained where these bits of knowledge are in her post, Knowledge is Power! Using the macOS/iOS knowledgeC.db Database to Determine Precise User and Application Usage, which ultimately reveals the data lurks in ~/Library/Application Support/Knowledge/knowledgeC.db. Sarah also has a neat little Python utility dubbed APOLLO (Apple Pattern of Life Lazy Output'er) which has a smattering of knowledgeC.db canned SQL queries that cover a myriad of tracked items.

Today, we'll show how to work with this database in R and the {tidyverse} to paint our own pictures of application usage.

There are quite a number of tables in the knowledgeC.db SQLite 3 database:

That visual schema was created in OmniGraffle via a small R script that uses the OmniGraffle automation framework. The OmniGraffle source files are also available upon request.

Most of the interesting bits (for any tracking-related spelunking) are in the ZOBJECT table and to get a full picture of usage we'll need to join it with some other tables that are connected via a few foreign keys:

There are a few ways to do this in {tidyverse} R. The first is an extended straight SQL riff off of one of Sarah's original queries:

library(hrbrthemes) # for ggplot2 machinations  library(tidyverse)    # source the knowledge db  kdb <- src_sqlite("~/Library/Application Support/Knowledge/knowledgeC.db")    tbl(    kdb,     sql('  SELECT    ZOBJECT.ZVALUESTRING AS "app",       (ZOBJECT.ZENDDATE - ZOBJECT.ZSTARTDATE) AS "usage",        CASE ZOBJECT.ZSTARTDAYOFWEEK         WHEN "1" THEN "Sunday"        WHEN "2" THEN "Monday"        WHEN "3" THEN "Tuesday"        WHEN "4" THEN "Wednesday"        WHEN "5" THEN "Thursday"        WHEN "6" THEN "Friday"        WHEN "7" THEN "Saturday"      END "dow",      ZOBJECT.ZSECONDSFROMGMT/3600 AS "tz",      DATETIME(ZOBJECT.ZSTARTDATE + 978307200, \'UNIXEPOCH\') as "start_time",       DATETIME(ZOBJECT.ZENDDATE + 978307200, \'UNIXEPOCH\') as "end_time",      DATETIME(ZOBJECT.ZCREATIONDATE + 978307200, \'UNIXEPOCH\') as "created_at",       CASE ZMODEL        WHEN ZMODEL THEN ZMODEL        ELSE "Other"      END "source"    FROM      ZOBJECT       LEFT JOIN        ZSTRUCTUREDMETADATA       ON ZOBJECT.ZSTRUCTUREDMETADATA = ZSTRUCTUREDMETADATA.Z_PK       LEFT JOIN        ZSOURCE       ON ZOBJECT.ZSOURCE = ZSOURCE.Z_PK       LEFT JOIN        ZSYNCPEER      ON ZSOURCE.ZDEVICEID = ZSYNCPEER.ZDEVICEID    WHERE      ZSTREAMNAME = "/app/usage"'    )) -> usage    usage  ## # Source:   SQL [?? x 8]  ## # Database: sqlite 3.29.0 [/Users/johndoe/Library/Application Support/Knowledge/knowledgeC.db]  ##    app                      usage dow         tz start_time          end_time            created_at         source         ##                                                                                    ##  1 com.bitrock.appinstaller    15 Friday      -4 2019-10-05 01:11:27 2019-10-05 01:11:42 2019-10-05 01:11:… MacBookPro13…  ##  2 com.tinyspeck.slackmacg…  4379 Tuesday     -4 2019-10-01 13:19:24 2019-10-01 14:32:23 2019-10-01 14:32:… Other          ##  3 com.tinyspeck.slackmacg…  1167 Tuesday     -4 2019-10-01 18:19:24 2019-10-01 18:38:51 2019-10-01 18:38:… Other          ##  4 com.tinyspeck.slackmacg…  1316 Tuesday     -4 2019-10-01 19:13:49 2019-10-01 19:35:45 2019-10-01 19:35:… Other          ##  5 com.tinyspeck.slackmacg… 12053 Thursday    -4 2019-10-03 12:25:18 2019-10-03 15:46:11 2019-10-03 15:46:… Other          ##  6 com.tinyspeck.slackmacg…  1258 Thursday    -4 2019-10-03 15:50:16 2019-10-03 16:11:14 2019-10-03 16:11:… Other          ##  7 com.tinyspeck.slackmacg…  2545 Thursday    -4 2019-10-03 16:24:30 2019-10-03 17:06:55 2019-10-03 17:06:… Other          ##  8 com.tinyspeck.slackmacg…   303 Thursday    -4 2019-10-03 17:17:10 2019-10-03 17:22:13 2019-10-03 17:22:… Other          ##  9 com.tinyspeck.slackmacg…  9969 Thursday    -4 2019-10-03 17:33:38 2019-10-03 20:19:47 2019-10-03 20:19:… Other          ## 10 com.tinyspeck.slackmacg…  2813 Thursday    -4 2019-10-03 20:19:52 2019-10-03 21:06:45 2019-10-03 21:06:… Other          ## # … with more rows

Before explaining what that query does, let's rewrite it {dbplyr}-style:

tbl(kdb, "ZOBJECT") %>%     mutate(      created_at = datetime(ZCREATIONDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),      start_dow = case_when(        ZSTARTDAYOFWEEK == 1 ~ "Sunday",        ZSTARTDAYOFWEEK == 2 ~ "Monday",        ZSTARTDAYOFWEEK == 3 ~ "Tuesday",        ZSTARTDAYOFWEEK == 4 ~ "Wednesday",        ZSTARTDAYOFWEEK == 5 ~ "Thursday",        ZSTARTDAYOFWEEK == 6 ~ "Friday",        ZSTARTDAYOFWEEK == 7 ~ "Saturday"      ),      start_time = datetime(ZSTARTDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),      end_time = datetime(ZENDDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),      usage = (ZENDDATE - ZSTARTDATE),      tz = ZSECONDSFROMGMT/3600     ) %>%     left_join(tbl(kdb, "ZSTRUCTUREDMETADATA"), c("ZSTRUCTUREDMETADATA" = "Z_PK")) %>%     left_join(tbl(kdb, "ZSOURCE"), c("ZSOURCE" = "Z_PK")) %>%     left_join(tbl(kdb, "ZSYNCPEER"), "ZDEVICEID") %>%     filter(ZSTREAMNAME == "/app/usage")  %>%     select(      app = ZVALUESTRING, created_at, start_dow, start_time, end_time, usage, tz, source = ZMODEL    ) %>%     mutate(source = ifelse(is.na(source), "Other", source)) %>%     collect() %>%     mutate_at(vars(created_at, start_time, end_time), as.POSIXct) -> usage

What we're doing is pulling out the day of week, start/end usage times & timezone info, app bundle id, source of the app interactions and the total usage time for each entry along with when that entry was created. We need to do some maths since Apple stores time-y whime-y info in its own custom format, plus we need to convert numeric DOW to labeled DOW.

The bundle ids are pretty readable, but they're not really intended for human consumption, so we'll make a translation table for the bundle id to app name by using the mdls command.

list.files(    c("/Applications", "/System/Library/CoreServices", "/Applications/Utilities", "/System/Applications"), # main places apps are stored (there are potentially more but this is sufficient for our needs)    pattern = "\\.app$",     full.names = TRUE  ) -> apps    x <- sys::exec_internal("mdls", c("-name", "kMDItemCFBundleIdentifier", "-r", apps))    # mdls null (\0) terminates each entry so we have to do some raw surgery to get it into a format we can use  x$stdout[x$stdout == as.raw(0)] <- as.raw(0x0a)    tibble(    name = gsub("\\.app$", "", basename(apps)),    app = read_lines(x$stdout)   ) -> app_trans    app_trans  ## # A tibble: 270 x 2  ##    name                    app                                      ##                                                           ##  1 1Password 7             com.agilebits.onepassword7               ##  2 Adium                   com.adiumX.adiumX                        ##  3 Agenda                  com.momenta.agenda.macos                 ##  4 Alfred 4                com.runningwithcrayons.Alfred            ##  5 Amazon Music            com.amazon.music                         ##  6 Android File Transfer   com.google.android.mtpviewer             ##  7 Awsaml                  com.rapid7.awsaml                        ##  8 Bartender 2             com.surteesstudios.Bartender             ##  9 BBEdit                  com.barebones.bbedit                     ## 10 BitdefenderVirusScanner com.bitdefender.BitdefenderVirusScanner  ## # … with 260 more rows

The usage info goes back ~30 days, so let's do a quick summary of the top 10 apps and their total usage (in hours):

usage %>%     group_by(app) %>%     summarise(first = min(start_time), last = max(end_time), total = sum(usage, na.rm=TRUE)) %>%     ungroup() %>%     mutate(total = total / 60 / 60) %>% # hours    arrange(desc(total)) %>%     left_join(app_trans) -> overall_usage    overall_usage %>%     slice(1:10) %>%     left_join(app_trans) %>%    mutate(name = fct_inorder(name) %>% fct_rev()) %>%    ggplot(aes(x=total, y=name)) +     geom_segment(aes(xend=0, yend=name), size=5, color = ft_cols$slate) +    scale_x_comma(position = "top") +    labs(      x = "Total Usage (hrs)", y = NULL,      title = glue::glue('App usage in the past {round(as.numeric(max(usage$end_time) - min(usage$start_time), "days"))} days')    ) +    theme_ft_rc(grid="X")

There's a YUGE flaw in the current way macOS tracks application usage. Unlike iOS where apps really don't run simultaneously (with iPadOS they kinda can/do, now), macOS apps are usually started and left open along with other apps. Apple doesn't do a great job identifying only active app usage activity so many of these usage numbers are heavily inflated. Hopefully that will be fixed by macOS 10.15.

We have more data at our disposal, so let's see when these apps get used. To do that, we'll use segments to plot individual usage tracks and color them by weekday/weekend usage (still limiting to top 10 for blog brevity):

usage %>%     filter(app %in% overall_usage$app[1:10]) %>%     left_join(app_trans) %>%    mutate(name = factor(name, levels = rev(overall_usage$name[1:10]))) %>%     ggplot() +    geom_segment(      aes(        x = start_time, xend = end_time, y = name, yend = name,         color = ifelse(start_dow %in% c("Saturday", "Sunday"), "Weekend", "Weekday")      ),      size = 10,    ) +    scale_x_datetime(position = "top") +    scale_colour_manual(      name = NULL,      values = c(        "Weekend" = ft_cols$light_blue,         "Weekday" = ft_cols$green      )    ) +    guides(      colour = guide_legend(override.aes = list(size = 1))    ) +    labs(      x = NULL, y = NULL,      title = glue::glue('Top 10 App usage on this Mac in the past {round(as.numeric(max(usage$end_time) - min(usage$start_time), "days"))} days'),      subtitle = "Each segment represents that app being 'up' (Open to Quit).\nUnfortunately, this is what Screen Time uses for its calculations on macOS"    ) +    theme_ft_rc(grid="X") +    theme(legend.position = c(1, 1.25)) +    theme(legend.justification = "right")

I'm not entirely sure "on this Mac" is completely accurate since I think this syncs across all active Screen Time devices due to this (n is in seconds):

count(usage, source, wt=usage, sort=TRUE)  ## # A tibble: 2 x 2  ##   source               n  ##                 ## 1 Other          4851610  ## 2 MacBookPro13,2 1634137

The "Other" appears to be the work-dev Mac but it doesn't have the identifier mapped so I think that means it's the local one and that the above chart is looking at Screen Time across all devices. I literally (right before this sentence) enabled Screen Time on my iPhone so we'll see if that ends up in the database and I'll post a quick update if it does.

We'll take one last look by day of week and use a heatmap to see the results:

count(usage, start_dow, app, wt=usage/60/60) %>%     left_join(app_trans) %>%    filter(app %in% overall_usage$app[1:10]) %>%     mutate(name = factor(name, levels = rev(overall_usage$name[1:10]))) %>%     mutate(start_dow = factor(start_dow, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>%     ggplot() +    geom_tile(aes(start_dow, name, fill = n), color = "#252a32", size = 0.75) +    scale_x_discrete(expand = c(0, 0.5), position = "top") +    scale_y_discrete(expand = c(0, 0.5)) +    scale_fill_viridis_c(direction = -1, option = "magma", name = "Usage (hrs)") +    labs(      x = NULL, y = NULL,      title = "Top 10 App usage by day of week"    ) +    theme_ft_rc(grid="")

I really need to get into the habit of using the RStudio Server access features of RSwitch over Chrome so I can get RSwitch into the top 10, but some habits (and bookmarks) die hard.

FIN

Apple's Screen Time also tracks "category", which is something we can pick up from each application's embedded metadata. We'll do that in a follow-up post along with seeing whether we can capture iOS usage now that I've enabled Screen Time on those devices as well.

Keep spelunking the knowledgeC.db table(s) and blog about or reply in the comments with any interesting nuggets you find.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

The Chaos Game: an experiment about fractals, recursivity and creative coding

Posted: 28 Oct 2019 06:00 AM PDT

[This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mathematics, rightly viewed, possesses not only truth, but supreme beauty (Bertrand Russell)

You have a pentagon defined by its five vertex. Now, follow these steps:

Step 0: take a point inside the pentagon (it can be its center if you want to do it easy). Keep this point in a safe place.
Step 1: choose a vertex randomly and take the midpoint between both of them (the vertex and the original point). Keep also this new point. Repeat Step 1 one more time.
Step 2: compare the last two vertex that you have chosen. If they are the same, choose another with this condition: if it's not a neighbor of the last vertex you chose, keep it. If it is a neighbor, choose another vertex randomly until you choose a not-neighbor one. Then, take the midpoint between the last point you obtained and this new vertex. Keep also this new point.
Step 3: Repeat Step 2 a number of times and after that, do a plot with the set of points that you obtained.

If you repeat these steps 10 milion times, you will obtain this stunning image:

I love the incredible ability of maths to create beauty. More concretely, I love the fact of how repeating extremely simple operations can bring you to unexpected places. Would you expect that the image created with the initial naive algorithm would be that? I wouldn't. Even knowing the result I cannot imagine how those simple steps can produce it.

The image generated by all the points repeat itself at different scales. This characteristic, called self-similarity, is property of fractals and make them extremely attractive. Step 2 is the key one to define the shape of the image. Apart of comparing two previous vertex as it's defined in the algorithm above, I implemented two other versions:

one version where the currently chosen vertex cannot be the same as the previously chosen vertex.
another one where the currently chosen vertex cannot neighbor the previously chosen vertex if the three previously chosen vertices are the same (note that this implementation is the same as the original but comparing with three previous vertex instead two).

These images are the result of applying the three versions of the algorithm to a square, a pentagon, a hexagon and a heptagon (a row for each polygon and a column for each algorithm):

From a technical point of view I used Rcppto generate the set of points. Since each iteration depends on the previous one, the loop cannot easily vectorised and C++ is a perfect option to avoid the bottleneck if you use another technique to iterate. In this case, instead of writing the C++ directly inside the R file with cppFunction(), I used a stand-alone C++ file called chaos_funcs.cpp to write the C++ code that I load into R using sourceCpp().

Some days ago, I gave a tutorial at the coding club of the University Carlos III in Madrid where we worked with the integration of C++ and R to create beautiful images of strange attractors. The tutorial and the code we developed is here. You can also find the code of this experiment here. Enjoy!

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now