[R-bloggers] Data Science on Rails: Analyzing Customer Churn (and 3 more aRticles)

Data Science on Rails: Analyzing Customer Churn
Spatial Data Analysis with INLA
tidync: scientific array data from NetCDF in R
November Thanksgiving – Data Science Style!

Data Science on Rails: Analyzing Customer Churn

Posted: 04 Nov 2019 11:00 PM PST

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Customer Relationship Management (CRM) is not only about acquiring new customers but especially about retaining existing ones. That is because acquisition is often much more expensive than retention. In this post, we learn how to analyze the reasons of customer churn (i.e. customers leaving the company). We do this with a very convenient point-and-click interface for doing data science on top of R, so read on!

The most popular Graphical User Interface (GUI) for R is of course RStudio, but there are other interfaces as well. If you fear the terror of the blank screen you can use the very convenient and sophisticated data science GUI Rattle. Under R you just install rattle (on CRAN), it might be that you have to install the RGtk2 package too (also on CRAN). After that you load it with library(rattle) and start it with rattle():

The dataset we will be using is from the new excellent book Quantitative Methods for Management. A Practical Approach (Authors: Canela, Miguel Angel; Alegre, Inés; Ibarra, Alberto) and publicly available, you can load the data directly from the Github repository churn.csv and save it to a directory of your choice.

The data set is a random sample of 5,000 customers of a mobile phone services company with the following attributes:

  ID, a customer ID (the phone number).  ACLENGTH, the number of days the account had been active at the beginning of the period monitored (September 30th).  INTPLAN, a dummy for having an international plan during at least one month of the third quarter.  DATAPLAN, the same for a data plan.  DATAGB, the data allowance of the data plan (either 0, 100M, 250M, 500M, 1G, 1.5G or 2G).  OMMIN, the total minutes in calls to Omicron mobile phone numbers during the third quarter.  OMCALL, the total number of calls to Omicron mobile phone numbers during the third quarter.  OTMIN, the total minutes in calls to other mobile networks during the third quarter.  OTCALL, the total number of calls to other networks during the third quarter.  NGMIN, the total minutes in calls to non-geographic numbers, typically used by helplines and call centers, during the third quarter.  NGCALL, the total number of calls to non-geographic numbers during the third quarter.  IMIN, the total minutes in international calls during the third quarter.  ICALL, the total number of international calls during the third quarter.  CUSCALL, the total number of calls to the customer service, up to September 30th.  CHURN, a dummy for churning during the period monitored.

The last attribute CHURN is the target variable we want to predict. The tabs of the interface of Rattle are ordered according to a typical data science workflow. First we are going to load the data set by clicking on the directory symbol and selecting the file churn.csv where we saved it. After that we click on Execute:

CHURN is correctly identified as Target. First we want to get some overview of the data. We switch to the tab Explore and select Box Plot and Histogram under Distributions for the first two variables (for illustrative purposes only two in this case):

To get the same result deactivate Advanced Graphics under Settings in the menu and click on Execute:

We can clearly see that the number of days the account had been active at the beginning of the period monitored (ACLENGTH) has almost no power to predict CHURN while ACLENGTH (a dummy for having an international plan during at least one month of the third quarter) clearly has. it seems that customers are especially dissatisfied with this plan and are a lot more likely to churn. There is a lot of more information contained in those graphics but that is beyond the scope of this post.

We also test the categorical variable DATAGB (i.e. the data allowance of the data plan, either 0, 100M, 250M, 500M, 1G, 1.5G or 2G) by clicking Box Plot, Dot Plot and Mosaic and Execute again after that:

Again, it doesn't seem that DATAGB can do anything useful for us (this can e.g. be seen well in the mosaic plot where the proportions for CHURN stay more or less the same). Now, we want to build a real data science model, a decision tree, which has the advantage of a good interpretability. For that matter we switch to the Model tab and just click on Execute:

The result is not very readable, we want a clearer representation as a tree diagram. For that matter we activate Advanced Graphics again and click on Draw and the following tree should appear (in case that it only occupies the upper half of the whole area please use the following piece of code to reset the tiling to one row and one column only: par(mfrow = c(1, 1))):

We can see that customers with no international plan have a very low probability of churning while with customers with such a plan it depends on other variables whether they will churn or not. We can get the exact rules by clicking on Rules:

In a real data science project, the marketing team would now go through all those rules to see whether they make sense and whether additional actions can be inferred. For example rules 14 and 15 in combination might be interesting: in both rules customers have an international plan, yet the one group doesn't really need it which leads to their not churning while the other group who uses it churns with high probability (remember IMIN is the total minutes in international calls). This could mean that the plan as such is not perceived as bad but that there might be some problems in its operation… in any case, this warrants further investigation!

As a last step we want to check the quality of our model. We switch to the tab Evaluate and click on Testing and Execute:

We see that by using this tree model we e.g. have an overall error of 17.5%, which is not too bad. We can try to drive this error rate further down by building more sophisticated models in the Model tab, like Random Forest, Boost, SVM (for Support Vector Machine) and a Neural Net. We will not go into those here but you can read more about some of them in these posts:

In a practical setting, it might also be that it is more expensive to lose a customer who was predicted to stay than to keep a customer who was predicted to churn. In the first case, you lose a whole customer and all of her revenue, in the second case you might only have offered a better plan with a smaller profit margin. Different models will make different kinds of errors, it depends on the setting which errors are preferable.

One very nice feature of Rattle is that your whole session is being recorded as real, executable R code. You can use that to see what is really going on under the hood and to reproduce and reuse your analyses! Just go to the last tab Log:

You can export the whole script by just clicking Export.

To conclude this post, as a sanity check and benchmark we let the data set run through the OneR package (on CRAN, for more info see also the vignette):

  library(OneR)  #churn <- read.csv("https://raw.githubusercontent.com/quants-book/CSV_Files/master/churn.csv")  churn <- read.csv("data/churn.csv")    data <- maxlevels(optbin(churn))  ## Warning in optbin.data.frame(churn): target is numeric    model <- OneR(data, verbose = TRUE)  ##   ##     Attribute Accuracy  ## 1 * INTPLAN   83.38%    ## 2   ACLENGTH  80.64%    ## 2   DATAPLAN  80.64%    ## 2   DATAGB    80.64%    ## 2   OMMIN     80.64%    ## 2   OMCALL    80.64%    ## 2   OTMIN     80.64%    ## 2   OTCALL    80.64%    ## 2   NGMIN     80.64%    ## 2   NGCALL    80.64%    ## 2   IMIN      80.64%    ## 2   ICALL     80.64%    ## 2   CUSCALL   80.64%    ## ---  ## Chosen attribute due to accuracy  ## and ties method (if applicable): '*'    summary(model)  ##   ## Call:  ## OneR.data.frame(x = data, verbose = TRUE)  ##   ## Rules:  ## If INTPLAN = 0 then CHURN = 0  ## If INTPLAN = 1 then CHURN = 1  ##   ## Accuracy:  ## 4169 of 5000 instances classified correctly (83.38%)  ##   ## Contingency table:  ##      INTPLAN  ## CHURN      0     1  Sum  ##   0   * 3839   193 4032  ##   1      638 * 330  968  ##   Sum   4477   523 5000  ## ---  ## Maximum in each column: '*'  ##   ## Pearson's Chi-squared test:  ## X-squared = 712.58, df = 1, p-value < 2.2e-16    plot(model)

  prediction <- predict(model, data)  eval_model(prediction, data)  ##   ## Confusion matrix (absolute):  ##           Actual  ## Prediction    0    1  Sum  ##        0   3839  638 4477  ##        1    193  330  523  ##        Sum 4032  968 5000  ##   ## Confusion matrix (relative):  ##           Actual  ## Prediction    0    1  Sum  ##        0   0.77 0.13 0.90  ##        1   0.04 0.07 0.10  ##        Sum 0.81 0.19 1.00  ##   ## Accuracy:  ## 0.8338 (4169/5000)  ##   ## Error rate:  ## 0.1662 (831/5000)  ##   ## Error rate reduction (vs. base rate):  ## 0.1415 (p-value = 3.256e-07)

Again, the problem with the international plan is corroborated with high accuracy…

Dear Marketing team: Immediate action is required!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Spatial Data Analysis with INLA

Posted: 04 Nov 2019 04:00 PM PST

[This article was first published on R on Coding Club UC3M, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Virgilio Gómez Rubio

Introduction

In this session I will focus on Bayesian inference using the integrated nested
Laplace approximation (INLA) method. As described in Rue et al. (2009), INLA
can be used to estimate the posterior marginal distribution of Bayesian
hierarchical models. This method is implemented in the INLA package available
for the R programming language. Given that the types of models that INLA can
fit are quite wide, we will focus on spatial models for the analysis of lattice
data. Hence,

Bayesian inference

A Bayesian hierarchical model can be defined as follows:

\[
y_i \sim f(y_i \mid \mathbf{x}, \theta_1)
\]

\[
\mathbf{x} \sim GMRF(\theta_2)
\]

\[
\theta \sim \pi(\theta)
\]

Here, $\mathbf{y} = (y_1, \ldots, y_n)$ is a vector of responses, $\mathbf{x}$
a vector of latent effects and $\theta = (\theta_1, \theta_2)$ a vector of
hyperparameters. $GMRF(\theta_2)$ is a Gaussian Markov random field (GMRF) with
zero mean, which can be regarded as a multivariate Gaussian distribution
with zero mean and sparse precision matrix that depends on hyperparameters
$\theta_2$.

Fitting a Bayesian model means computing the posterior distribution of
$\mathbf{x}$ and $\theta$ $\pi(\theta, \mathbf{x} \mid y)$. This can be obtained
by using Bayes' rule as follows:

\[
\pi(\theta, \mathbf{x} \mid y) \propto \pi(y \mid \theta, \mathbf{x})\pi(\theta)
\]

Once the joint posterior distribution is known, we could compute posterior
probabilities of linear predictors, random effects, sums of random effects,
etc. Depending on the likelihood and the prior distribution computing
$\pi(\theta, \mathbf{x} \mid y)$ can be very difficult.

Inference with Markov chain Monte Carlo

In the last 20-30 years some computational approaches have been proposed to
estimate $\pi(\theta, \mathbf{x} \mid y)$ with Monte Carlo methods. In
particular, Markov chain Monte Carlo (MCMC) methods provide simulations from
the ensemble of model parameters, i.e., a multivariate distribution. This will
allow us to estimate the joint posterior distribution.

However, we may be interested in a single parameter or a subset of the
parameters. Inference for this subset of parameters can be done by simply
ignoring the samples for the other parameters

Using the samples it is possible to compute the posterior distribution of any
function on the model parameters. Notice that MCMC may require lots of simulations for
valid inference. Also, we must check that the burn-in period has ended, i.e.,
we have reached the posterior distribution

Integrated Nested Laplace Approximation

Sometimes we only need marginal inference on some parameters, i.e., we
need $\pi(\theta_i \mid y)$. Rue et al. (2009) propose a way of approximating
the marginal distributions using different approximations to the distributions
involved and using the Laplace approximation for the integrals.

Now we are dealing with (many) univariate distributions. This is
computationally faster because numerical integration techniques are used
instead of Monte Carlo sampling. The marginal distributions for the latent
effects and hyper-parameters can be written as

\[
\pi(x_i \mid \mathbf{y}) = \int \pi(x_i \mid \mathbf{\theta}, \mathbf{y}) \pi(\mathbf{\theta}\mid \mathbf{y}) d\mathbf{\theta}
\]

and

\[
\pi(\theta_j\mid \mathbf{y}) = \int \pi(\mathbf{\theta}\mid \mathbf{y}) d\mathbf{\theta}_{-j}
\]

The posterior distribution of the hyperparameters $\pi(\mathbf{\theta}\mid \mathbf{y})$ is approximated using different methods.

`R-INLA` package

The INLA method is implemented in the INLA (also known as R-INLA) package,
which is available from http://www.r-inla.org.
This package relies on the inla()-function to fit the models. This functions
works in a similar way as, for example, glm(). The model is defined in a and there is a flexible way of defining the likelihood, priors and
latent effects in the model.

The output provides the posterior marginals of the model parameters and latent
effects, linear predictors, and some other derived quantities (such as linear
combinations of the model latent effects). In addition, INLA provides tools
to manipulate $\pi(\cdot \mid y)$ to compute $\pi(f(\cdot) \mid y)$. INLA can compute some model assessment/choice: Marginal likelihood, DIC, CPO, …

Spatial latent effects

The following table summarizes some of the spatial latent effects available
in INLA:

Name in `f()`	Model
`generic0`	$\Sigma=\frac{1}{\tau}Q^{-1}$
`generic1`	$\Sigma=\frac{1}{\tau}(I_n-\frac{\rho}{\lambda_{max}}C)^{-1}$
`besag`	Intrinsic CAR
`besagproper`	Proper CAR
`bym`	Convolution model
`rw2d`	Random walk 2-D
`matern2d`	Matèrn correlation (discrete)
`slm`	Spatial lag model
`spde`	Matèrn correlation (continuous)

Dataset: Leukemia in upstate New York

To illustrate how spatial models are fitted with INLA, the New York leukemia
dataset will be used. This has been widely analyzed in the literature (see, for
example, Waller and Gotway, 2004) and it is available in the DClusterm
package. The dataset records a number of cases of leukemia in upstate New York
at the census tract level. Some of the variables in the dataset are:

Cases: Number of leukemia cases in the period 1978-1982.
POP8: Population in 1980.
PCTOWNHOME: Proportion of people who own their home.
PCTAGE65P: Proportion of people aged 65 or more.
AVGIDIST: Average inverse distance to the nearest Trichloroethene (TCE) site.

Note that the interest is on the exposure to TCE, using AVGIDIST as a proxy
for exposure. Variables PCTOWNHOME and PCTAGE65P will act as possible
confounders that must be included in the model. However, we will not do it
here because we want to test how the spatial latent effects capture residual
spatial variation.

The dataset can be loaded as follows:

library(spdep)  library(DClusterm)  data(NY8)

Given that the interest is in studying the risk of leukemia in upstate New York,
the expected number of cases is computed first. This is done by computing the
overall mortality rate (total cases divided by total population) and
multiplying the population by it:

rate <- sum(NY8$Cases) / sum(NY8$POP8)  NY8$Expected <- NY8$POP8 * rate

Once the expected number of cases is obtained, a raw estimate of risk can be
obtained with the standardized mortality ratio (SMR), which is computed as
the number of observed cases divided by the number of expected cases:

NY8$SMR <- NY8$Cases / NY8$Expected

Disease mapping

In Epidemiology it is important to produce maps to show the spatial distribution
of the relative risk. In this example, we will focus on Syracuse city to reduce
the computation time to produce the map. Hence, we create an index
with the areas in Syracuse city:

# Subset Syracuse city  syracuse <- which(NY8$AREANAME == "Syracuse city")

A disease map can be simply created with function spplot (in package
sp):

library(viridis)

## Loading required package: viridisLite

spplot(NY8[syracuse, ], "SMR", #at = c(0.6, 0.9801, 1.055, 1.087, 1.125, 13),     col.regions = rev(magma(16))) #gray.colors(16, 0.9, 0.4))

Interactive maps can be easily created with the tmap package:

library(tmap)  tmap_mode("view")    SMR_map <- tm_shape(NY8[syracuse, ]) +    tm_fill(col = "SMR", alpha = 0.35) +    tm_borders() +    tm_shape(TCE) + tm_dots(col = "red") # Add TCE sites  widgetframe::frameWidget(print(SMR_map))

Note that the previous map also includes the locations of the 11
TCE-contaminated sites and that this can be seen by zooming out.

Mixed-effects models

Poisson regression

The first model that we will consider is a Poisson model with no latent random
effects as this will provide a baseline to compare to other models. The model
to fit is:

\[
O_i|\mu_i \sim Po(\mu_i)
\]

\[
\mu_i = E_i \theta_i
\]

\[
\log(\theta_i) = \beta_0 + \beta_1 AVGIDIST_i
\]

In order to fit the model with INLA, function inla is used:

library(INLA)  m1 <- inla(Cases ~ 1 + AVGIDIST,     data = as.data.frame(NY8),    family = "poisson",    E = NY8$Expected, control.predictor = list(compute = TRUE),    control.compute = list(dic = TRUE, waic = TRUE))

Note that it works similarly to the glm function. Here, argument
E is used for the expected number of cases. Alternatively, they could
enter the linear predictor in the log-scale as an offset.
Other arguments are set to compute the marginals of the model parameters
(with control.predictor) and to compute some model choice criteria
(with control.compute).

Next, the summary of the model can be obtained:

summary(m1)

##   ## Call:  ##    c("inla(formula = Cases ~ 1 + AVGIDIST, family = \"poisson\", data   ##    = as.data.frame(NY8), ", " E = NY8$Expected, control.compute =   ##    list(dic = TRUE, waic = TRUE), ", " control.predictor =   ##    list(compute = TRUE))")   ## Time used:  ##     Pre = 0.368, Running = 0.0968, Post = 0.0587, Total = 0.524   ## Fixed effects:  ##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld  ## (Intercept) -0.065 0.045     -0.155   -0.065      0.023 -0.064   0  ## AVGIDIST     0.320 0.078      0.160    0.322      0.465  0.327   0  ##   ## Expected number of effective parameters(stdev): 2.00(0.00)  ## Number of equivalent replicates : 140.25   ##   ## Deviance Information Criterion (DIC) ...............: 948.12  ## Deviance Information Criterion (DIC, saturated) ....: 418.75  ## Effective number of parameters .....................: 2.00  ##   ## Watanabe-Akaike information criterion (WAIC) ...: 949.03  ## Effective number of parameters .................: 2.67  ##   ## Marginal log-Likelihood:  -480.28   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

Poisson regression with random effects

Latent random effects can be added to the model to account for overdispersion
by including i.i.d. Gaussian random effects in the linear predictor, as
follows:

\[
O_i|\mu_i \sim Po(\mu_i)
\]

\[
\mu_i = E_i \theta_i
\]

\[
\log(\theta_i) = \beta_0 + \beta_1 AVGIDIST_i + u_i
\]

\[
u_i \sim N(0, \tau)
\]

In order to fit the model with INLA an index to identify
the random effects (ID) is created first. Latent random effects
are specified with the f-function in INLA:

NY8$ID <- 1:nrow(NY8)  m2 <- inla(Cases ~ 1 + AVGIDIST + f(ID, model = "iid"),    data = as.data.frame(NY8), family = "poisson",     E = NY8$Expected,    control.predictor = list(compute = TRUE),    control.compute = list(dic = TRUE, waic = TRUE))

Now the summary of the mode includes information about the random effects:

summary(m2)

##   ## Call:  ##    c("inla(formula = Cases ~ 1 + AVGIDIST + f(ID, model = \"iid\"),   ##    family = \"poisson\", ", " data = as.data.frame(NY8), E =   ##    NY8$Expected, control.compute = list(dic = TRUE, ", " waic =   ##    TRUE), control.predictor = list(compute = TRUE))" )   ## Time used:  ##     Pre = 0.236, Running = 0.315, Post = 0.0744, Total = 0.625   ## Fixed effects:  ##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld  ## (Intercept) -0.126 0.064     -0.256   -0.125     -0.006 -0.122   0  ## AVGIDIST     0.347 0.105      0.139    0.346      0.558  0.344   0  ##   ## Random effects:  ##   Name     Model  ##     ID IID model  ##   ## Model hyperparameters:  ##                     mean       sd 0.025quant 0.5quant 0.975quant mode  ## Precision for ID 3712.34 11263.70       3.52     6.94   39903.61 5.18  ##   ## Expected number of effective parameters(stdev): 54.95(30.20)  ## Number of equivalent replicates : 5.11   ##   ## Deviance Information Criterion (DIC) ...............: 926.93  ## Deviance Information Criterion (DIC, saturated) ....: 397.56  ## Effective number of parameters .....................: 61.52  ##   ## Watanabe-Akaike information criterion (WAIC) ...: 932.63  ## Effective number of parameters .................: 57.92  ##   ## Marginal log-Likelihood:  -478.93   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

Add point estimates for mapping

The posterior means estimated with these two models can be added to
the SpatialPolygonsDataFrame NY8 to be plotted later:

NY8$FIXED.EFF <- m1$summary.fitted[, "mean"]  NY8$IID.EFF <- m2$summary.fitted[, "mean"]

spplot(NY8[syracuse, ], c("SMR", "FIXED.EFF", "IID.EFF"),    col.regions = rev(magma(16)))

Note how the model smooths the relative risk.

Spatial Models for Lattice Data

Lattice data involves data measured at different areas, e.g.,
neighborhoods, cities, provinces, states, etc. Spatial dependence appears because neighbor areas will show similar
values of the variable of interest.

Models for lattice data

We have observations $y=\{y_i\}_{i=1}^n$ from the $n$ areas. $\mathbf{y}$ is
assigned a multivariate distribution that accounts for spatial dependence. A
common way of describing spatial proximity in lattice data is by means of an
adjacency matrix $W$. Element $W_{i, j}$ is non-zero if areas $i$ and $j$ are
neighbors. Usually, two areas are neighbors if they share a common boundary.
There are other definitions of neighborhood (see Bivand et al., 2013).

Adjacency matrix

The adjacency matrix can be computed using function poly2nb in package
spdep. This function will consider two areas as neighbors if their borders
touch at least in one point (i.e., queen adjacency):

NY8.nb <- poly2nb(NY8)

This will return an nb object with the definition of the neighborhood structure:

NY8.nb

## Neighbour list object:  ## Number of regions: 281   ## Number of nonzero links: 1624   ## Percentage nonzero weights: 2.056712   ## Average number of links: 5.779359

In addition, nb objects can be plot when the centroids of the polygons are
known:

plot(NY8)   plot(NY8.nb, coordinates(NY8), add = TRUE, pch = ".", col = "gray")

Regression models

It is often the case that, in addition to $y_i$, we have a number of covariates
$X_i$. Hence, we may want to regress $y_i$ on $X_i$. In addition to the
covariates we may want to account for the spatial structure of the data.
Different types of regression models can be used to model lattice data:

Generalized Linear Models (with spatial random effects).
Spatial econometrics models.

Linear Mixed Models

A common approach (for Gaussian data) is to use a linear
regression with random effects:

\[
Y = X \beta+ Zu +\varepsilon
\]

The vector of random effects $u$ is modeled as a multivariate Normal distribution:

\[
u \sim N(0, \sigma^2_u \Sigma)
\]

$\Sigma$ is defined such as it induces higher correlation with adjacent areas, $Z$ is a design matrix for the random effects and
$\varepsilon_i \sim N(0, \sigma^2), i=1, \ldots, n$ is an error term.

Generalized Linear Mixed Models can be defined in a similar way by using a
different likelihood and linking the appropriate parameter to the linear
predictor.

Structure of spatial random effects

There are many different ways of including spatial dependence
in $\Sigma$:

Simultaneous autoregressive (SAR):

\[
\Sigma^{-1} = [(I-\rho W)' (I-\rho W)]
\]

Conditional autoregressive (CAR):

\[
\Sigma^{-1} = (I-\rho W)
\]

Intrinsic CAR (ICAR):

\[
\Sigma^{-1} = diag(n_i) – W
\]

$n_i$ is the number of neighbors of area $i$.
$\Sigma_{i,j}$ depends on a function of $d(i,j)$. For example:

\[
\Sigma_{i,j} = \exp\{-d(i,j)/\phi\}
\]

'Mixture' of matrices (Leroux et al.'s model):

\[
\Sigma = [(1 – \lambda) I_n + \lambda M]^{-1};\ \lambda \in (0,1)
\]

$M$ precision of intrinsic CAR specification

CAR and ICAR specifications have been proposed within the Statistics field,
while the SAR specification was coined within Spatial Econometrics. Regardless
of its origin, all specifications presented here can be regarded as Gaussian
latent effects with a particular precision matrix.

ICAR model

The first example will be based on the ICAR specification. Note that the
spatial latent effect is defined using the f-function. This will require
an index to identify the random effects in each area, the type of model
and the adjacency matrix. For this, a sparse matrix will be used.

# Create sparse adjacency matrix  NY8.mat <- as(nb2mat(NY8.nb, style = "B"), "Matrix")  # Fit model  m.icar <- inla(Cases ~ 1 +  AVGIDIST +       f(ID, model = "besag", graph = NY8.mat),    data = as.data.frame(NY8), E = NY8$Expected, family ="poisson",    control.predictor = list(compute = TRUE),    control.compute = list(dic = TRUE, waic = TRUE))

summary(m.icar)

##   ## Call:  ##    c("inla(formula = Cases ~ 1 + AVGIDIST + f(ID, model = \"besag\",   ##    ", " graph = NY8.mat), family = \"poisson\", data =   ##    as.data.frame(NY8), ", " E = NY8$Expected, control.compute =   ##    list(dic = TRUE, waic = TRUE), ", " control.predictor =   ##    list(compute = TRUE))")   ## Time used:  ##     Pre = 0.298, Running = 0.305, Post = 0.0406, Total = 0.644   ## Fixed effects:  ##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld  ## (Intercept) -0.122 0.052     -0.226   -0.122     -0.022 -0.120   0  ## AVGIDIST     0.318 0.121      0.075    0.320      0.551  0.324   0  ##   ## Random effects:  ##   Name     Model  ##     ID Besags ICAR model  ##   ## Model hyperparameters:  ##                  mean   sd 0.025quant 0.5quant 0.975quant mode  ## Precision for ID 3.20 1.67       1.41     2.79       7.56 2.27  ##   ## Expected number of effective parameters(stdev): 46.68(12.67)  ## Number of equivalent replicates : 6.02   ##   ## Deviance Information Criterion (DIC) ...............: 904.12  ## Deviance Information Criterion (DIC, saturated) ....: 374.75  ## Effective number of parameters .....................: 48.31  ##   ## Watanabe-Akaike information criterion (WAIC) ...: 906.77  ## Effective number of parameters .................: 44.27  ##   ## Marginal log-Likelihood:  -685.70   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

BYM model

The Besag, York and Mollié model includes two latent random effects: an ICAR
latent effect and a Gaussian iid latent effect. The linear predictor $\eta_i$
is:

\[
\eta_i = \alpha + \beta AVGIDIST_i + u_i + v_i
\]

$u_i$ is an i.i.d. Gaussian random effect
$v_i$ is an intrinsic CAR random effect

There is no need
to define these two latent effects if model is set to "bym" when
the latent random effect is defined with the f function.

m.bym = inla(Cases ~ 1 +  AVGIDIST +       f(ID, model = "bym", graph = NY8.mat),    data = as.data.frame(NY8), E = NY8$Expected, family ="poisson",    control.predictor = list(compute = TRUE),    control.compute = list(dic = TRUE, waic = TRUE))

summary(m.bym)

##   ## Call:  ##    c("inla(formula = Cases ~ 1 + AVGIDIST + f(ID, model = \"bym\",   ##    graph = NY8.mat), ", " family = \"poisson\", data =   ##    as.data.frame(NY8), E = NY8$Expected, ", " control.compute =   ##    list(dic = TRUE, waic = TRUE), control.predictor = list(compute =   ##    TRUE))" )   ## Time used:  ##     Pre = 0.294, Running = 1, Post = 0.0652, Total = 1.36   ## Fixed effects:  ##               mean    sd 0.025quant 0.5quant 0.975quant   mode kld  ## (Intercept) -0.123 0.052     -0.227   -0.122     -0.023 -0.121   0  ## AVGIDIST     0.318 0.121      0.075    0.320      0.551  0.324   0  ##   ## Random effects:  ##   Name     Model  ##     ID BYM model  ##   ## Model hyperparameters:  ##                                         mean      sd 0.025quant 0.5quant  ## Precision for ID (iid component)     1790.06 1769.02     115.76  1265.24  ## Precision for ID (spatial component)    3.12    1.36       1.37     2.82  ##                                      0.975quant   mode  ## Precision for ID (iid component)        6522.28 310.73  ## Precision for ID (spatial component)       6.58   2.33  ##   ## Expected number of effective parameters(stdev): 47.66(12.79)  ## Number of equivalent replicates : 5.90   ##   ## Deviance Information Criterion (DIC) ...............: 903.41  ## Deviance Information Criterion (DIC, saturated) ....: 374.04  ## Effective number of parameters .....................: 48.75  ##   ## Watanabe-Akaike information criterion (WAIC) ...: 906.61  ## Effective number of parameters .................: 45.04  ##   ## Marginal log-Likelihood:  -425.65   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

Leroux et al. model

This model is defined using a 'mixture' of matrices (Leroux et al.'s model)
to define the precision matrix of the latent effect:

\[
\Sigma^{-1} = [(1 – \lambda) I_n + \lambda M];\ \lambda \in (0,1)
\]

Here, $M$ precision of intrinsic CAR specification.

This model is implemented using the generic1 latent effect, which
uses the following precision matrix:

\[
\Sigma^{-1} = \frac{1}{\tau}(I_n-\frac{\rho}{\lambda_{max}}C); \rho \in [0,1)
\]

Here, $C$ is a matrix and $\lambda_{max}$ its maximum eigenvalue.

In order to define the right model, we should take matrix $C$ as follows:

\[
C = I_n – M;\ M = diag(n_i) – W
\]

Then, $\lambda_{max} = 1$ and

\[
\Sigma^{-1} =
\frac{1}{\tau}(I_n-\frac{\rho}{\lambda_{max}}C) =
\frac{1}{\tau}(I_n-\rho(I_n – M)) = \frac{1}{\tau}((1-\rho) I_n + \rho M)
\]

To fit the model, the first step is to create matrix $M$:

ICARmatrix <- Diagonal(nrow(NY8.mat), apply(NY8.mat, 1, sum)) - NY8.mat  Cmatrix <- Diagonal(nrow(NY8), 1) -  ICARmatrix

We can check that the maximum eigenvalue, $\lambda_{max}$, is one:

max(eigen(Cmatrix)$values)

## [1] 1

The model is fit as usual with function inla. Note that the $C$ matrix
is passed to the f function using argument Cmatrix:

m.ler = inla(Cases ~ 1 +  AVGIDIST +      f(ID, model = "generic1", Cmatrix = Cmatrix),    data = as.data.frame(NY8), E = NY8$Expected, family ="poisson",    control.predictor = list(compute = TRUE),    control.compute = list(dic = TRUE, waic = TRUE))

summary(m.ler)

##   ## Call:  ##    c("inla(formula = Cases ~ 1 + AVGIDIST + f(ID, model =   ##    \"generic1\", ", " Cmatrix = Cmatrix), family = \"poisson\", data   ##    = as.data.frame(NY8), ", " E = NY8$Expected, control.compute =   ##    list(dic = TRUE, waic = TRUE), ", " control.predictor =   ##    list(compute = TRUE))")   ## Time used:  ##     Pre = 0.236, Running = 0.695, Post = 0.0493, Total = 0.98   ## Fixed effects:  ##               mean    sd 0.025quant 0.5quant 0.975quant   mode   kld  ## (Intercept) -0.128 0.448      -0.91   -0.128      0.656 -0.126 0.075  ## AVGIDIST     0.325 0.122       0.08    0.327      0.561  0.330 0.000  ##   ## Random effects:  ##   Name     Model  ##     ID Generic1 model  ##   ## Model hyperparameters:  ##                   mean    sd 0.025quant 0.5quant 0.975quant  mode  ## Precision for ID 2.720 1.098       1.27    2.489      5.480 2.106  ## Beta for ID      0.865 0.142       0.47    0.915      0.997 0.996  ##   ## Expected number of effective parameters(stdev): 52.25(13.87)  ## Number of equivalent replicates : 5.38   ##   ## Deviance Information Criterion (DIC) ...............: 903.14  ## Deviance Information Criterion (DIC, saturated) ....: 373.77  ## Effective number of parameters .....................: 53.12  ##   ## Watanabe-Akaike information criterion (WAIC) ...: 906.20  ## Effective number of parameters .................: 48.19  ##   ## Marginal log-Likelihood:  -474.94   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

Spatial Econometrics Models

Spatial econometrics have been developed following a slightly different
approach to spatial modeling. Instead of using latent effects, spatial
dependence is modeled explicitly. Autoregressive models are used to make the
response variable to depend on the values of its neighbors.

Simultaneous Autoregressive Model (SEM)

This model includes covariates and an autoregressive on the error term:

\[
y= X \beta+u; u=\rho Wu+e; e\sim N(0, \sigma^2)
\]

\[
y= X \beta + \varepsilon; \varepsilon\sim N(0, \sigma^2 (I-\rho W)^{-1} (I-\rho W')^{-1})
\]

Spatial Lag Model (SLM)

This model includes covariates and an autoregressive on the response:

\[
y = \rho W y + X \beta + e; e\sim N(0, \sigma^2)
\]

\[
y = (I-\rho W)^{-1}X\beta+\varepsilon;\ \varepsilon \sim N(0, \sigma^2(I-\rho W)^{-1} (I-\rho W')^{-1})
\]

New `slm` Latent Class

R-INLA includes now an experimental new latent effect called slm to
fit the following model:

\[
\mathbf{x}= (I_n-\rho W)^{-1} (X\beta +e)
\]

The elements of this model are:

$W$ is a row-standardized adjacency matrix.
$\rho$ is a spatial autocorrelation parameter.
$X$ is a matrix of covariates, with coefficients $\beta$.
$e$ are Gaussian i.i.d. errors with variance $\sigma^2$.

The slm latent effect is experimental and it can be
combined with other effects in the linear predictor

Spatial econometrics models can be fit with the slm latent
effect by noting that the SME and SLM can be defined as:

SEM

\[
y= X \beta + (I-\rho W)^{-1} (0+e);\ e \sim N(0, \sigma^2 I)
\]

SLM

\[
y = (I-\rho W)^{-1}(X\beta+e);\ e \sim N(0, \sigma^2 I)
\]

Model definition

In order to define a model, we need:

X: Matrix of covariates
W: Row-standardized adjacency matrix
Q: Precision matrix of coefficients $\beta$
Range of $\rho$, often defined by the eigenvalues

#X  mmatrix <- model.matrix(Cases ~ 1 + AVGIDIST, NY8)    #W  W <- as(nb2mat(NY8.nb, style = "W"), "Matrix")    #Q  Q.beta = Diagonal(n = ncol(mmatrix), x = 0.001)    #Range of rho  rho.min<- -1  rho.max<- 1

The argument of the slm latent effects are passed through argument
args.sm. Here, we have created a list with the same name to keep
all the required values together:

#Arguments for 'slm'  args.slm = list(     rho.min = rho.min ,     rho.max = rho.max,     W = W,     X = mmatrix,     Q.beta = Q.beta  )

In addition, the prior for the precision parameter$\tau$ and the spatial
autocorrelation parameter $\rho$ are set:

#Prior on rho  hyper.slm = list(     prec = list(        prior = "loggamma", param = c(0.01, 0.01)),        rho = list(initial=0, prior = "logitbeta", param = c(1,1))  )

The prior definition uses a named list with different arguments. Argument
prior defines the prior to use, and param its parameters. Here, the
precision is assigned a gamma prior with parameters $0.01$ and $0.01$, while
the spatial autocorrelation parameter is given a beta prior with parameters $1$
and $1$ (i.e., a uniform prior in the interval $(1, 1)$).

Model fitting

#SLM model  m.slm <- inla( Cases ~ -1 +       f(ID, model = "slm", args.slm = args.slm, hyper = hyper.slm),     data = as.data.frame(NY8), family = "poisson",     E = NY8$Expected,     control.predictor = list(compute = TRUE),     control.compute = list(dic = TRUE, waic = TRUE)  )

## Warning in inla.model.properties.generic(inla.trim.family(model), (mm[names(mm) == : Model 'slm' in section 'latent' is marked as 'experimental'; changes may appear at any time.  ##   Use this model with extra care!!! Further warnings are disabled.

summary(m.slm)

##   ## Call:  ##    c("inla(formula = Cases ~ -1 + f(ID, model = \"slm\", args.slm =   ##    args.slm, ", " hyper = hyper.slm), family = \"poisson\", data =   ##    as.data.frame(NY8), ", " E = NY8$Expected, control.compute =   ##    list(dic = TRUE, waic = TRUE), ", " control.predictor =   ##    list(compute = TRUE))")   ## Time used:  ##     Pre = 0.265, Running = 1.2, Post = 0.058, Total = 1.52   ## Random effects:  ##   Name     Model  ##     ID SLM model  ##   ## Model hyperparameters:  ##                   mean    sd 0.025quant 0.5quant 0.975quant  mode  ## Precision for ID 8.989 4.115      3.709    8.085     19.449 6.641  ## Rho for ID       0.822 0.073      0.653    0.832      0.936 0.854  ##   ## Expected number of effective parameters(stdev): 62.82(15.46)  ## Number of equivalent replicates : 4.47   ##   ## Deviance Information Criterion (DIC) ...............: 902.32  ## Deviance Information Criterion (DIC, saturated) ....: 372.95  ## Effective number of parameters .....................: 64.13  ##   ## Watanabe-Akaike information criterion (WAIC) ...: 905.19  ## Effective number of parameters .................: 56.12  ##   ## Marginal log-Likelihood:  -477.30   ## Posterior marginals for the linear predictor and  ##  the fitted values are computed

Estimates of the coefficients appear as part of the random effects:

round(m.slm$summary.random$ID[47:48,], 4)

##    ID   mean     sd 0.025quant 0.5quant 0.975quant   mode kld  ## 47 47 0.4634 0.3107    -0.1618   0.4671     1.0648 0.4726   0  ## 48 48 0.2606 0.3410    -0.4583   0.2764     0.8894 0.3063   0

Spatial autocorrelation is reported in the internal scale (i.e., between
0 and 1) and needs to be re-scaled:

marg.rho.internal <- m.slm$marginals.hyperpar[["Rho for ID"]]  marg.rho <- inla.tmarginal( function(x) {    rho.min + x * (rho.max - rho.min)  }, marg.rho.internal)    inla.zmarginal(marg.rho, FALSE)

## Mean            0.644436   ## Stdev           0.145461   ## Quantile  0.025 0.309507   ## Quantile  0.25  0.556851   ## Quantile  0.5   0.663068   ## Quantile  0.75  0.752368   ## Quantile  0.975 0.869702

plot(marg.rho, type = "l", main = "Spatial autocorrelation")

Summary of results

NY8$ICAR <- m.icar$summary.fitted.values[, "mean"]  NY8$BYM <- m.bym$summary.fitted.values[, "mean"]  NY8$LEROUX <- m.ler$summary.fitted.values[, "mean"]  NY8$SLM <- m.slm$summary.fitted.values[, "mean"]

spplot(NY8[syracuse, ],     c("FIXED.EFF", "IID.EFF", "ICAR", "BYM", "LEROUX", "SLM"),    col.regions = rev(magma(16))  )

Note how spatial models produce smoother estimates of the relative risk.

In order to choose the best model, the model selection criteria computed
above can be used:

	Marg. Lik.	DIC	WAIC
FIXED	-480.2814	948.1198	949.0287
IID	-478.6043	926.9295	932.6333
ICAR	-685.6478	904.1253	906.7689
BYM	-425.2479	903.4114	906.6138
LEROUX	-474.4091	903.1453	906.1948
SLM	-476.8145	902.3201	905.1892

References

Bivand, R., E. Pebesma and V. Gómez-Rubio (2013). Applied spatial data
analysis with R. Springer-Verlag. New York.

Gómez-Rubio, V. (2020). Bayesian inference with INLA. CRC Press. Boca Raton,
FL. https://becarioprecario.bitbucket.io/inla-gitbook

Rue, H., S. Martino and N. Chopin (2009). Approximate bayesian inference for
latent Gaussian models by using integrated nested Laplace approximations.
Journal of the royal statistical society: Series B (statistical methodology)
71(2): 319–392.

Waller, L. A. and C. A. Gotway (2004). Applied spatial statistics for public
health data. John Wiley & Sons, Inc. Hoboken, New Jersey.

To leave a comment for the author, please follow the link and comment on their blog: R on Coding Club UC3M.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

tidync: scientific array data from NetCDF in R

Posted: 04 Nov 2019 04:00 PM PST

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In May 2019 version 0.2.0 of tidync was approved by rOpenSci and accepted to CRAN. Here we provide a quick overview of the typical workflow with some pseudo-code for the main functions in tidync. This overview is enough to read if you just want to try out the package on your own data. The tidync package is focussed on efficient data extraction for developing your own software, and this somewhat long post takes the time to explain the concepts in detail.

There is a section about the NetCDF data model itself. Then there is a detailed illustration of a raster data set in R including some of the challenges faced by R users. This is followed by sections on how tidync sees metadata and coordinates in NetCDF, how we can slice a dataset and control the format of the output. We then discuss some limitations and future work , and then (most importantly) reflect on the rOpenSci process of package review.

NetCDF in R

NetCDF is a very widely used system for storing and distributing scientific array data. A NetCDF data source typically stores one or more arrays of data, along with metadata that describe the data array space (grid), and any metadata describing array coordinates, units, and interpretation. A NetCDF source may be a file or an online URL. If you want to automate your own workflow around a series of NetCDF data sources then tidync provides all the flexibility and power required with as little pain as possible.

The tidyverse has had an enormous impact on the use of R with a strict approach to variables and observations (in short, tidy data are tabular, with each variable having its own column and each observation having its own row). This tidy-data-frame form can be used for a wide range of data, but it does have some shortcomings. It can be inefficient in terms of storage, which may be problematic with large data. If the data are accompanied by additional metadata (as is generally the case with NetCDF data) there is often no neat way to store this information in the same table, and these inherent properties of the original data can be lost.

There is a tension between the tidyverse and scientific array data that comes down to this difference in data storage, and the intermediate forms used to get between one form and another. We also think this tension has been exaggerated in unproductive ways.

tidync

The official website for tidync is https://docs.ropensci.org/tidync/ and the latest release can be found on CRAN.

The tidync package provides a compromise position, allowing efficient interaction with NetCDF files. It will produce native-array or tidy-data-frame output as desired. It delays any data-reading activity until after the output format is chosen. In particular, tidync exists in order to reduce the amount of plumbing code required to get to the data. It allows an interactive way to convert between the different spaces (coordinates and indices) in which the data can be referenced.

In pseudo-code, there are only a few simple steps, at each step we can save the result and explore a summary.

Connect to a data source and retrieve metadata, and read a summary:
```
src <- tidync()  print(src)  
```
By default the largest array-space (grid) is activated and usually this will be the right choice – if required we can nominate a different grid using activate().
```
src <- src %>% activate()  
```

Apply subsetting to slice arrays by coordinate or index, this step is optional but very important for large and complicated data sources.

## lazy subsetting by value or index  src_slc <- src %>% hyper_filter()

Finally, choose an output format – list of arrays, a data frame, or a tbl_cube.

src_slc %>% hyper_array()    src_slc %>% hyper_tibble()    src_slc %>% hyper_tbl_cube()

There are various other packages for NetCDF in R, the main ones being RNetCDF and ncdf4. These are both lower-level tools than tidync – they are interfaces to the underlying NetCDF library, and tidync uses both to read information and data. The raster and stars packages provide quite different approaches and stars is more general than raster, but is similarly higher-level than tidync.

To follow along with the code below requires all of the following packages, and we assume that recent versions are in use, particularly ncmeta (>= 0.2.0),
tidync (>= 0.2.2), and tidyr (>= 1.0.0).

install.packages(c("ncmeta", "tidync", "maps", "stars", "ggplot2", "devtools",                      "stars", "RNetCDF", "raster", "dplyr", "tidyr"))

NetCDF

NetCDF is a very widely used file format for storing array-based data as
variables with dimensions and attributes.

The space (or grid) occupied by a variable is defined by its dimensions and their attributes. Dimensions are by definition
one-dimensional arrays (i.e. an atomic vector in R of length 1 or more). An array can include coordinate metadata, units, type, and interpretation; attributes define all of this extra information for dimensions and variables. The space of a variable (the grid it lives in) is defined by one or more of the dimensions in the file.

A given variable won't necessarily use all the available dimensions and no dimensions are mandatory or particularly special. We consider the existence of a dimension within a grid to be an instance of that dimension and call that an axis, subtly different to the dimension on its own.

NetCDF is very general and used in many different ways. It is quite common to see subcultures that rally around the way their particular domain's data are used and stored while ignoring many other valid ways of using NetCDF. The tidync approach is to be as general as possible, sacrificing high level interpretations for lower-level control if that generality is at risk.

Raster data in NetCDF

NetCDF can be used to store raster data, and very commonly data are provided as a global grid of scientific data. Here we use a snapshot of global ocean surface temperature for a single day. The file used is called reduced.nc in the stars package, derived from the daily OISSTV2 product.

We will explore this data set in detail to give an introduction to the tidync summary and functions. The data set is very commonly used in marine research as it includes a very long time series of daily global maps of sea surface temperatures. The spatial resolution is 0.25 degree (1/4°) and the coverage is complete for the entire world's oceans because it is a blend of remote sensing and direct observations. We call it OISST (or oisst) after its official description.

The NOAA 1/4° daily Optimum Interpolation Sea Surface Temperature (or daily OISST) is an analysis constructed by combining observations from different platforms (satellites, ships, buoys) on a regular global grid. A spatially complete SST map is produced by interpolating to fill in gaps.

The file we use is a simplified version from 1981-12-31 (the series started in 1981-09-01). This has been reduced in resolution so that it can be stored in an R package. There are four variables in the data sst (sea surface temperature, in Celsius), anom (anomaly of sst of this day from the long term mean), err (estimated error standard deviation of sst), and ice (sea ice concentration). The ice concentration acts as a mask, if there is sea ice present in the pixel then the temperature is undefined.

oisstfile <- system.file("nc/reduced.nc", package = "stars")

To connect to this file use tidync().

library(tidync)  oisst <- tidync(oisstfile)

Note: this is not a file connection, like that used by ncdf4 or RNetCDF; tidync functions always open the file in read-only mode, extract information and/or data, and then close the open file connection.

To see the available data in the file print a summmary of the source.

print(oisst)

##   ## Data Source (1): reduced.nc ...  ##   ## Grids (5)  :    ##   ## [1]   D0,D1,D2,D3 : sst, anom, err, ice    **ACTIVE GRID** ( 16200  values per variable)  ## [2]   D0          : lon  ## [3]   D1          : lat  ## [4]   D2          : zlev  ## [5]   D3          : time  ##   ## Dimensions 4 (all active):   ##     ##   dim   name  length   min   max start count  dmin  dmax unlim coord_dim   ##                     ## 1 D0    lon      180     0   358     1   180     0   358 FALSE TRUE        ## 2 D1    lat       90   -89    89     1    90   -89    89 FALSE TRUE        ## 3 D2    zlev       1     0     0     1     1     0     0 FALSE TRUE        ## 4 D3    time       1  1460  1460     1     1  1460  1460 TRUE  TRUE

There are three kinds of information

one (1) Data Source, our one file
five (5) Grids, available spaces in the source
four (4) Dimensions, orthogonal axes from which Grids are composed

There is only one grid available for multidimensional data in this file, the first one "D0,D1,D2,D3" – all other grids are one-dimensional. This 4D grid has four variables sst, anom, ice, and err and each 1D grid has a single variable.

Note: it's only a coincidence that this 4D grid also has 4 variables.

The 1D grids have a corresponding dimension dim and variable name, making these coordinate dimensions (see coord_dim in the dimensions table). It's not necessarily true that a 1D grid will have a single 1D variable, it may have more than one variable, and it may only have an index variable, i.e. only the position values 1:length(dimension).

Each dimension's name, length, valid minimum and maximum values are seen in the Dimensions table and these values can never change, also see flags coord_dim and unlim. This refers to an unlimited dimension, used when a data time series is spread across multiple files.

The other Dimensions columns start, count, dmin, dmax apply when we slice into data variables with hyper_filter().

Metadata in tidync: ncmeta

NetCDF is such a general form for storing data that there are many ways to approach its use. We wanted to focus on a tidy approach to NetCDF and so built on existing packages to do the lower level tasks. tidync relies on the package ncmeta to extract information about NetCDF sources. There are functions to find available variables, dimensions, attributes, grids, and axes in ncmeta.

We also want functions that return information about each kind of entity in a straightforward way. Since there is a complex relationship between variables, dimensions and grids we cannot store this information well in a single structure.

See that there are 5 grids and 8 variables, with a row for each.

ncmeta::nc_grids(oisstfile)

## # A tibble: 5 x 4  ##   grid        ndims      variables nvars  ##           >   ## 1 D0,D1,D2,D3     4        [4 × 1]     4  ## 2 D0              1        [1 × 1]     1  ## 3 D1              1        [1 × 1]     1  ## 4 D2              1        [1 × 1]     1  ## 5 D3              1        [1 × 1]     1

ncmeta::nc_vars(oisstfile)

## # A tibble: 8 x 5  ##      id name  type     ndims natts  ##            ## 1     0 lon   NC_FLOAT     1     4  ## 2     1 lat   NC_FLOAT     1     4  ## 3     2 zlev  NC_FLOAT     1     4  ## 4     3 time  NC_FLOAT     1     5  ## 5     4 sst   NC_SHORT     4     6  ## 6     5 anom  NC_SHORT     4     6  ## 7     6 err   NC_SHORT     4     6  ## 8     7 ice   NC_SHORT     4     6

Each grid has a name, dimensionality (ndims), and set of variables. Each grid is listed only once, an important pattern when we are programming, and the same applies to variables. The relationship to the tidyverse starts here with the metadata; there are five grids observed and we have four columns of information for each grid. These are the grid's name, number of dimensions, the NetCDF-variables defined with it, and their number. When dealing with metadata we can also use tidy principles as we do with the data itself.

Some grids have more than one variable, so they are nested in the grid rows – use tidyr::unnest() to see all variables with their parent grid.

ncmeta::nc_grids(oisstfile) %>% tidyr::unnest(cols = c(variables))

## # A tibble: 8 x 4  ##   grid        ndims variable nvars  ##                 ## 1 D0,D1,D2,D3     4 sst          4  ## 2 D0,D1,D2,D3     4 anom         4  ## 3 D0,D1,D2,D3     4 err          4  ## 4 D0,D1,D2,D3     4 ice          4  ## 5 D0              1 lon          1  ## 6 D1              1 lat          1  ## 7 D2              1 zlev         1  ## 8 D3              1 time         1

Similar functions exist for dimensions and variables.

ncmeta::nc_dims(oisstfile)

## # A tibble: 4 x 4  ##      id name  length unlim  ##         ## 1     0 lon      180 FALSE  ## 2     1 lat       90 FALSE  ## 3     2 zlev       1 FALSE  ## 4     3 time       1 TRUE

ncmeta::nc_vars(oisstfile)

## # A tibble: 8 x 5  ##      id name  type     ndims natts  ##            ## 1     0 lon   NC_FLOAT     1     4  ## 2     1 lat   NC_FLOAT     1     4  ## 3     2 zlev  NC_FLOAT     1     4  ## 4     3 time  NC_FLOAT     1     5  ## 5     4 sst   NC_SHORT     4     6  ## 6     5 anom  NC_SHORT     4     6  ## 7     6 err   NC_SHORT     4     6  ## 8     7 ice   NC_SHORT     4     6

There are corresponding functions to find out more about individual variables, dimensions and attributes by name or by index.

Note that we can use the internal index (a zero-based count) of a variable as well as its name (anom is
the variable at the 5-index as shown by nc_vars() above).

ncmeta::nc_var(oisstfile, "anom")

## # A tibble: 1 x 5  ##      id name  type     ndims natts  ##            ## 1     5 anom  NC_SHORT     4     6

ncmeta::nc_var(oisstfile, 5)

## # A tibble: 1 x 5  ##      id name  type     ndims natts  ##            ## 1     5 anom  NC_SHORT     4     6

Similarly we can use name or index for dimensions and attributes, but attributes for a variable can only be found by name.

ncmeta::nc_dim(oisstfile, "lon")

## # A tibble: 1 x 4  ##      id name  length unlim  ##         ## 1     0 lon      180 FALSE

ncmeta::nc_dim(oisstfile, 0)

## # A tibble: 1 x 4  ##      id name  length unlim  ##         ## 1     0 lon      180 FALSE

ncmeta::nc_atts(oisstfile)

## # A tibble: 50 x 4  ##       id name          variable value      ##                       ##  1     0 standard_name lon        ##  2     1 long_name     lon        ##  3     2 units         lon        ##  4     3 axis          lon        ##  5     0 standard_name lat        ##  6     1 long_name     lat        ##  7     2 units         lat        ##  8     3 axis          lat        ##  9     0 long_name     zlev       ## 10     1 units         zlev       ## # … with 40 more rows

ncmeta::nc_atts(oisstfile, "zlev")

## # A tibble: 4 x 4  ##      id name         variable value      ##                     ## 1     0 long_name    zlev       ## 2     1 units        zlev       ## 3     2 axis         zlev       ## 4     3 actual_range zlev

We can find the internal metadata for each variable by expanding the value.

ncmeta::nc_atts(oisstfile, "time") %>% tidyr::unnest(cols = c(value))

## # A tibble: 5 x 4  ##      id name          variable value                           ##                                            ## 1     0 standard_name time     time                            ## 2     1 long_name     time     Center time of the day          ## 3     2 units         time     days since 1978-01-01 00:00:00  ## 4     3 calendar      time     standard                        ## 5     4 axis          time     T

With this information we may now apply the right interpretation to the time values. In the print of the tidync object above we see the value 1460, which is given without context in the dimensions table.

We can get that value by activating the right grid and extracting, the time is
a single integer value.

oisst <- tidync(oisstfile)  time_ex <- oisst %>% activate("D3") %>% hyper_array()  time_ex$time

## [1] 1460

As mentioned, tidync considers time and its metadata a bit dangerous. A record about these can often be wrong, inconsistent, include time zone issues, sometimes extra seconds to account for … and so we prefer to leave these interpretations to be validated manually, before automation.

Obtain the time units information and then use it to convert the raw value (1460) into a date-time understood by R.

tunit <- ncmeta::nc_atts(oisstfile, "time") %>% tidyr::unnest(cols = c(value)) %>% dplyr::filter(name == "units")    print(tunit)

## # A tibble: 1 x 4  ##      id name  variable value                           ##                                    ## 1     2 units time     days since 1978-01-01 00:00:00

time_parts <- RNetCDF::utcal.nc(tunit$value, time_ex$time)    ## convert to date-time  ISOdatetime(time_parts[,"year"],               time_parts[,"month"],               time_parts[,"day"],               time_parts[,"hour"],               time_parts[,"minute"],               time_parts[,"second"])

## [1] "1981-12-31 UTC"

Alternatively we can do this by hard-coding. We find that different cases are best handled in different ways, and especially after some careful checking.

as.POSIXct("1978-01-01 00:00:00", tz = "UTC") + time_ex$time * 24 * 3600

## [1] "1981-12-31 UTC"

Finally, we can check that other independent systems provide the same information.

raster::brick(oisstfile, varname = "anom")

## class      : RasterBrick   ## dimensions : 90, 180, 16200, 1  (nrow, ncol, ncell, nlayers)  ## resolution : 2, 2  (x, y)  ## extent     : -1, 359, -90, 90  (xmin, xmax, ymin, ymax)  ## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0   ## source     : /usr/local/lib/R/site-library/stars/nc/reduced.nc   ## names      : X1981.12.31   ## Date       : 1981-12-31   ## varname    : anom   ## level      : 1

stars::read_stars(oisstfile)

## sst, anom, err, ice,

## stars object with 4 dimensions and 4 attributes  ## attribute(s):  ##     sst [°C]       anom [°C]          err [°C]     ice [percent]    ##  Min.   :-1.80   Min.   :-10.160   Min.   :0.110   Min.   :0.010    ##  1st Qu.:-0.03   1st Qu.: -0.580   1st Qu.:0.160   1st Qu.:0.470    ##  Median :13.65   Median : -0.080   Median :0.270   Median :0.920    ##  Mean   :12.99   Mean   : -0.186   Mean   :0.263   Mean   :0.718    ##  3rd Qu.:24.81   3rd Qu.:  0.210   3rd Qu.:0.320   3rd Qu.:0.960    ##  Max.   :32.97   Max.   :  2.990   Max.   :0.840   Max.   :1.000    ##  NA's   :4448    NA's   :4448      NA's   :4448    NA's   :13266    ## dimension(s):  ##      from  to         offset delta  refsys point values      ## x       1 180             -1     2      NA    NA   NULL [x]  ## y       1  90             90    -2      NA    NA   NULL [y]  ## zlev    1   1          0 [m]    NA      NA    NA   NULL      ## time    1   1 1981-12-31 UTC    NA POSIXct    NA   NULL

In terms of interpreting the meaning of stored metadata, tidync shies away from doing this automatically. There are simply too many ways for automatic tools to get intentions wrong. So, used in combination the ncmeta and tidync packages provide the tools to program around the vagaries presented by NetCDF sources. If your data is pretty clean and standardized there is higher-level software that can easily interpret these things automatically. Some examples are stars the R package, and outside of R itself there is also xarray, GDAL, ferret, and Panoply.

Axes versus dimensions

Previously we mentioned the concept of an axis as an instance of a dimension. This distinction arose early on, sometimes the dimension on its own is important, at other times we want to know where it occurs. The functions nc_axes() and nc_dims() make this clear, every instance of a dimension across variables is listed as an axis, but they are derived from only four dimensions.

ncmeta::nc_axes(oisstfile)

## # A tibble: 20 x 3  ##     axis variable dimension  ##               ##  1     1 lon              0  ##  2     2 lat              1  ##  3     3 zlev             2  ##  4     4 time             3  ##  5     5 sst              0  ##  6     6 sst              1  ##  7     7 sst              2  ##  8     8 sst              3  ##  9     9 anom             0  ## 10    10 anom             1  ## 11    11 anom             2  ## 12    12 anom             3  ## 13    13 err              0  ## 14    14 err              1  ## 15    15 err              2  ## 16    16 err              3  ## 17    17 ice              0  ## 18    18 ice              1  ## 19    19 ice              2  ## 20    20 ice              3

ncmeta::nc_dims(oisstfile)

## # A tibble: 4 x 4  ##      id name  length unlim  ##         ## 1     0 lon      180 FALSE  ## 2     1 lat       90 FALSE  ## 3     2 zlev       1 FALSE  ## 4     3 time       1 TRUE

Degenerate dimensions

See that both zlev and time are listed as dimensions but have length 1, and also their min and max values are constants. The zlev tells us that this grid exists at elevation = 0 (the sea surface) and time that the data applies to time = 1460. The time is not expressed as a duration even though it presumably applies to the entire day. These are degenerate dimensions, i.e. the data are really 2D but we have a record of a 4D space from which they are expressed as a slice. For time, we know that the neighbouring days exist in other OISST files, but for zlev it only records sea-level. This can cause problems as we would usually treat this data as a matrix in R, and so the ncdf4 and RNetCDF package read functions have arguments that are analogous to R's array indexing argument drop = TRUE. If a dimension of length 1 is encountered the 'to drop' means to ignore it. tidync will also drop dimensions by default when reading data, see the drop argument in ?hyper_array.

Reading the OISST data

At this point only metadata has been read, so let's read some sea surface temperatures!

The fastest way to get all the data is to call the function hyper_array, this is the lowest level and is very close to using the ncdf4 or RNetCDF package directly.

(oisst_data <- oisst %>% hyper_array())

## Class: tidync_data (list of tidync data arrays)  ## Variables (4): 'sst', 'anom', 'err', 'ice'  ## Dimension (2): lon,lat,zlev,time (180, 90)  ## Source: /usr/local/lib/R/site-library/stars/nc/reduced.nc

What happened there? We got a classed object tidync_data; this is a list with arrays.

length(oisst_data)

## [1] 4

names(oisst_data)

## [1] "sst"  "anom" "err"  "ice"

dim(oisst_data[[1]])

## [1] 180  90

image(oisst_data[[1]])

This is exactly the data provided by ncdf4::ncvar_get() or RNetCDF::var.get.nc() but we can do it in a single line of code. Without tidync we must find the variable names and loop over them. We automatically get variables from the largest grid that is available, which was activate()-d by default.

oisst_data <- tidync(oisstfile) %>% hyper_array()  op <- par(mfrow = n2mfrow(length(oisst_data)))  pals <- c("YlOrRd", "viridis", "Grays", "Blues")  for (i in seq_along(oisst_data)) {    image(oisst_data[[i]], main = names(oisst_data)[i], col = hcl.colors(20, pals[i], rev = i ==1))  }

par(op)

Transforms

In this context transform means the conversion between index and geographic coordinate for grid cells, and in this data set this means the longitude and latitude assigned to the centre of each cell.

We have done nothing with the spatial side of these data, ignoring the lon and lat values completely.

oisst_data

## Class: tidync_data (list of tidync data arrays)  ## Variables (4): 'sst', 'anom', 'err', 'ice'  ## Dimension (2): lon,lat,zlev,time (180, 90)  ## Source: /usr/local/lib/R/site-library/stars/nc/reduced.nc

lapply(oisst_data, dim)

## $sst  ## [1] 180  90  ##   ## $anom  ## [1] 180  90  ##   ## $err  ## [1] 180  90  ##   ## $ice  ## [1] 180  90

The print summary of the oisst_data object shows that it knows there are four variables and that they each have two dimensions (zlev and time were dropped). This is now stored as a list of native R arrays, but there is also the transforms attribute available with hyper_transforms().

The values on each transform table may be used directly.

(trans <- attr(oisst_data, "transforms"))

## $lon  ## # A tibble: 180 x 6  ##      lon index    id name  coord_dim selected  ##                  ##  1     0     1     0 lon   TRUE      TRUE      ##  2     2     2     0 lon   TRUE      TRUE      ##  3     4     3     0 lon   TRUE      TRUE      ##  4     6     4     0 lon   TRUE      TRUE      ##  5     8     5     0 lon   TRUE      TRUE      ##  6    10     6     0 lon   TRUE      TRUE      ##  7    12     7     0 lon   TRUE      TRUE      ##  8    14     8     0 lon   TRUE      TRUE      ##  9    16     9     0 lon   TRUE      TRUE      ## 10    18    10     0 lon   TRUE      TRUE      ## # … with 170 more rows  ##   ## $lat  ## # A tibble: 90 x 6  ##      lat index    id name  coord_dim selected  ##                  ##  1   -89     1     1 lat   TRUE      TRUE      ##  2   -87     2     1 lat   TRUE      TRUE      ##  3   -85     3     1 lat   TRUE      TRUE      ##  4   -83     4     1 lat   TRUE      TRUE      ##  5   -81     5     1 lat   TRUE      TRUE      ##  6   -79     6     1 lat   TRUE      TRUE      ##  7   -77     7     1 lat   TRUE      TRUE      ##  8   -75     8     1 lat   TRUE      TRUE      ##  9   -73     9     1 lat   TRUE      TRUE      ## 10   -71    10     1 lat   TRUE      TRUE      ## # … with 80 more rows  ##   ## $zlev  ## # A tibble: 1 x 6  ##    zlev index    id name  coord_dim selected  ##                 ## 1     0     1     2 zlev  TRUE      TRUE      ##   ## $time  ## # A tibble: 1 x 6  ##    time index    id name  coord_dim selected  ##                 ## 1  1460     1     3 time  TRUE      TRUE

image(trans$lon$lon, trans$lat$lat,  oisst_data[[1]])  maps::map("world2", add = TRUE)

In this case these transforms are somewhat redundant, there is a value stored for every step in lon and every step in lat. They are completely regular series whereas the usual approach in graphics is to store an offset and scale rather than each step's coordinate. Sometimes these coordinate values are not reducible this way and we would call them rectilinear, we would have to store the sequence of each 1D coordinate step.

Slicing

We can slice into these dimensions using a tidyverse approach. For example, to slice out only the data for the waters of the Pacific Ocean, we need a range in longitude and in latitude.

Old style slicing

This section illustrates the old laborious way to access a subset of data from NetCDF, a subset shown in this plot.

lonrange <- c(144, 247)  latrange <- c(-46, 47)    image(trans$lon$lon, trans$lat$lat,  oisst_data[[1]])  rect(lonrange[1], latrange[1], lonrange[2], latrange[2])

It's common on the internet to see posts that explain how to drive the NetCDF library with start and count indices, to do that we need to compare our ranges with the transforms of each dimension.

xs <- findInterval(lonrange, trans$lon$lon)  ys <- findInterval(latrange, trans$lat$lat)  print(xs)

## [1]  73 124

print(ys)

## [1] 22 69

start <- c(xs[1], ys[1])  count <- c(diff(xs), diff(ys))    print(start)

## [1] 73 22

print(count)

## [1] 51 47

The idea here is that xs and ys tell us the columns and rows of interest, based on our geographic input in longitude latitude values that we understand.

Let's try to read with NetCDF. Hmmm …. what goes wrong.

con <- RNetCDF::open.nc(oisstfile)  try(sst_matrix <- RNetCDF::var.get.nc(con, "sst", start = start, count = count))

## Error in RNetCDF::var.get.nc(con, "sst", start = start, count = count) :   ##   length(start) == ndims is not TRUE

We have been bitten by thinking that this source data are 2D! So we just add start and count of 1 for each extra dimension. (Consider that it could 3D, or 5D, and maybe with different dimension order; all of these things complicate the general case for these otherwise simple solutions).

start <- c(start, 1, 1)  count <- c(count, 1, 1)  sst_matrix <- RNetCDF::var.get.nc(con, "sst", start = start, count = count)

And we're good! Except, we now don't have the coordinates for the mapping. We have to slice the lon and lat values as well, but let's cut to the chase and go back to tidync.

tidync style slicing

Rather than slice the arrays read into memory, we can filter the object that understands the source and it does not do any data slicing at all, but records slices to be done in future. This is the lazy beauty of the tidyverse, applied to NetCDF.

Here we use standard R inequality syntax for lon and lat. We don't have to specify the redundant slice into zlev or time.

library(dplyr)  oisst_slice <- oisst %>% hyper_filter(lon = lon > lonrange[1] & lon <= lonrange[2],                          lat = lat > latrange[1] & lat <= latrange[2])    oisst_slice

##   ## Data Source (1): reduced.nc ...  ##   ## Grids (5)  :    ##   ## [1]   D0,D1,D2,D3 : sst, anom, err, ice    **ACTIVE GRID** ( 16200  values per variable)  ## [2]   D0          : lon  ## [3]   D1          : lat  ## [4]   D2          : zlev  ## [5]   D3          : time  ##   ## Dimensions 4 (all active):   ##     ##   dim   name  length   min   max start count  dmin  dmax unlim coord_dim   ##                     ## 1 D0    lon      180     0   358    74    51   146   246 FALSE TRUE        ## 2 D1    lat       90   -89    89    23    47   -45    47 FALSE TRUE        ## 3 D2    zlev       1     0     0     1     1     0     0 FALSE TRUE        ## 4 D3    time       1  1460  1460     1     1  1460  1460 TRUE  TRUE

The print summary has updated the start and count columns now to match our laboriously acquired versions above.

The dmin and dmax (data-min, data-max) columns are also updated, reporting the coordinate value at the start and end of the slice we have specified.

Now we can break the lazy chain and call for the data.

oisst_slice_data <- oisst_slice %>% hyper_array()  trans <- attr(oisst_slice_data, "transforms")

One unfortunate issue here is that we cannot use the transforms directly, they have been updated by changing the value of the selected column from TRUE to FALSE. Then we have to be aware of using only the values that remain selected (i.e. not filtered out).

First filter the lon and lat transforms based on the selected column.

lon <- trans$lon %>% dplyr::filter(selected)  lat <- trans$lat %>% dplyr::filter(selected)    image(lon$lon, lat$lat, oisst_slice_data[[1]])  maps::map("world2", add = TRUE)

We do have to do extra work with hyper_array() but it gives total control over what we get.

It's much easier to use other output types.

tcube <- tidync(oisstfile) %>%     hyper_filter(lon = between(lon, lonrange[1], lonrange[2]),                          lat = lat > latrange[1] & lat <= latrange[2]) %>%     hyper_tbl_cube()    tcube

## Source: local array [2,444 x 4]  ## D: lon [dbl, 52]  ## D: lat [dbl, 47]  ## D: zlev [dbl, 1]  ## D: time [dbl, 1]  ## M: sst [dbl[,47]]  ## M: anom [dbl[,47]]  ## M: err [dbl[,47]]  ## M: ice [dbl[,47]]

We can also read our slice in directly as a tibble data frame, and plot with geom_raster().

tdata <- tidync(oisstfile) %>%     hyper_filter(lon = between(lon, lonrange[1], lonrange[2]),                          lat = lat > latrange[1] & lat <= latrange[2]) %>%     hyper_tibble()    library(ggplot2)  ggplot(tdata, aes(lon, lat, fill = sst)) + geom_raster()

By default, all variables are available but we can limit with select_var.

tidync(oisstfile) %>%     hyper_filter(lon = between(lon, lonrange[1], lonrange[2]),                          lat = lat > latrange[1] & lat <= latrange[2]) %>%     hyper_tibble(select_var = c("err", "ice"))

## # A tibble: 2,377 x 6  ##      err   ice   lon   lat  zlev  time  ##           ##  1 0.230    NA   144   -45     0  1460  ##  2 0.320    NA   146   -45     0  1460  ##  3 0.280    NA   148   -45     0  1460  ##  4 0.210    NA   150   -45     0  1460  ##  5 0.200    NA   152   -45     0  1460  ##  6 0.240    NA   154   -45     0  1460  ##  7 0.230    NA   156   -45     0  1460  ##  8 0.250    NA   158   -45     0  1460  ##  9 0.260    NA   160   -45     0  1460  ## 10 0.290    NA   162   -45     0  1460  ## # … with 2,367 more rows

slicing into multidimensional time series

As a further example, now open a time-series NetCDF file. We apply a spatial subset on the lon and lat dimensions, convert to tidy data frame and plot the tos variable over time.

tos <- tidync(system.file("nc/tos_O1_2001-2002.nc", package = "stars"))  library(dplyr)  stos <- tos %>% hyper_filter(lon = between(lon, 140, 220),                        lat = between(lat, -60, 0)) %>% hyper_tibble()    library(ggplot2)  ggplot(stos, aes(lon, lat, fill = tos)) + geom_raster() + facet_wrap(~time)

We can alternatively choose the middle value of longitude (it lies at index = 90) and plot the tos variable as a function of latitude over time. We can easily re-orient our approach to this data set and it works as well with more complicated multi-dimensional sources as well.

lon180 <- tos %>%  hyper_filter(lon = index == 90,                        lat = between(lat, -60, 0)) %>% hyper_tibble()  ggplot(lon180, aes(time, lat, fill = tos)) + geom_raster()

Limitations

There are some limitations, specific to the tidync R package that are unrelated to the capabilities of the latest NetCDF library.

No groups, a group can be specified by providing the group-within-a-source as a source.
No compound types.
No attribute metadata, coordinates of 1D axes are stored as transform tables, but coordinates of pairs (or higher sets) of axes are not explicitly linked to their array data.
Curvilinear coordinates are not automatically expanded, this is because they exist (usually) on a different grid to the active one.
Unknowns about what is supported on what platforms. This is surprisingly tricky and unstable, there are a lot of things that are possible on one operating system at a given time, but not on others. The situation changes fairly slowly but is always changing due to library versions and releases, package and tooling support on CRAN, and operating system details.

If you have problems with a given source please get in touch (open an issue on Github issues, chat on twitter) so we can learn more about the overall landscape.

Future helpers

Coordinate expansion

A feature being considered for an upcoming version is to expand out all available linked coordinates. This occurs when an array has a dimension but only stores its index. When a dimension stores values directly this is known as a dim-coord, and usually occurs for time values. One way to expand this out would be to include an expand_coords argument to hyper_tibble() and have it run the following code:

#' Expand coordinates stored against dimensions  #'  #' @param x tidync object  #' @param ... ignored  #'  #' @return data frame of all variables and any linked-coordinates   #' @noRd  #'  #' @examples    full_expand <- function(x, ...) {    ad <- active(x)    spl <- strsplit(ad, ",")[[1L]]    out <- hyper_tibble(x)        for (i in seq_along(spl)) {      out <- dplyr::inner_join(out, activate(x, spl[i]) %>% hyper_tibble())    }     out  }

It's not clear how consistently this fits in the wider variants found in the NetCDF world, so any feedback is welcome.

A real world example is available in the ncdfgeom package. This package provides much more in terms of storing geometry within a NetCDF file, but here we only extract the lon, lat and station name that hyper_tibble() isn't seeing by default.

huc <- system.file('extdata','example_huc_eta.nc', package = 'ncdfgeom')    full_expand(tidync(huc))

## Joining, by = "time"

## Joining, by = "station"

## # A tibble: 50 x 5  ##       et  time station   lat   lon  ##            ##  1    10 10957       1  36.5 -80.4  ##  2    19 10988       1  36.5 -80.4  ##  3    21 11017       1  36.5 -80.4  ##  4    36 11048       1  36.5 -80.4  ##  5   105 11078       1  36.5 -80.4  ##  6   110 11109       1  36.5 -80.4  ##  7   128 11139       1  36.5 -80.4  ##  8   121 11170       1  36.5 -80.4  ##  9    70 11201       1  36.5 -80.4  ## 10    25 11231       1  36.5 -80.4  ## # … with 40 more rows

hyper_tibble(tidync(huc))

## # A tibble: 50 x 3  ##       et  time station  ##          ##  1    10 10957       1  ##  2    19 10988       1  ##  3    21 11017       1  ##  4    36 11048       1  ##  5   105 11078       1  ##  6   110 11109       1  ##  7   128 11139       1  ##  8   121 11170       1  ##  9    70 11201       1  ## 10    25 11231       1  ## # … with 40 more rows

Tidy approaches to other data sources

This approach could be applied to other array-based data systems, such as the ff package, the matter package GDAL raster or multi-dimensional data sources, and HDF5 or GRIB sources.

We have experimented with this for non-NetCDF formats, please get in touch (open an issue on Github issues, chat on twitter) if you are interested.

The stars project takes another perspective on a tidy approach to scientific array data. It is very high-level and may be a drop-in solution for well-behaved data so it's recommended to try that as well.

rOpenSci package review

The tidync package made it to CRAN after a fairly long review process on rOpenSci. The package itself was inspired by many years of experience and discussions with Tom Remenyi, Simon Wotherspoon, Sophie Bestley, and Ben Raymond. In early 2018 I really wasn't sure if it could be finished at all in a neat way and was a bit overwhelmed, but thanks to very helpful reviewers and also some key insights about obscure types it was done. The package benefitted greatly from the review feedback provided by Jakub Nowosad and Tim Lucas. I really appreciated the clarity provided by these reviews, it helped to finalize some design decisions on the naming of functions and their intended use. There are various aspects that I thought were obstacles in completing the project, and having reviews that did not share my concerns and also gave positive feedback and suggestions for more relevant changes was extremely helpful.

Thanks also to rOpenSci community members for encouragement and support!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

November Thanksgiving – Data Science Style!

Posted: 04 Nov 2019 06:53 AM PST

[This article was first published on R programming – Journey of Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello All,

November is the month of Thanksgiving, and vacations and of course deals galore! As part of saying thanks to my loyal readers, here are some deals specific to data science professionals and students, that you should definitely not miss on.

Book deals:

If you are exploring Data Science careers or preparing for interviews for a winter graduation, then take a look at my ebook "Data Science Jobs". It is currently part of a Kindle countdown deal and priced 80% off from its normal price. Currently only $0.99 and prices will keep increasing until Friday morning when it goes back to full price.
Want a FREE book on Statistics, as related to R-programming and machine learning algorithms? I am currently looking to giveaway FREE advanced reviewer copies (ARC) . You can look at the book contents here, and if it seems interesting then please sign up here to be a reviewer.
If you are deploying machine learning models on the cloud, then chances are you work with Kubernetes or have at least heard of it. If you haven't and you are an aspiring data scientist/ engineer, then you should compulsorily learn about those concepts. And the best guide on this topic, from the creators of Kubernetes is now available for FREE, thanks to a partnership between O'REilly Publishers and Microsoft Azure. Get the book here.

Nov projects:

The R-programming project for November is a sentiment analysis on song lyrics by different artists. There is lots of data wrangling involved to aggregate different lyrics, and compare the lyrics favored by 2 different artists. The code repository is added to the Projects page here. I've written the main code in R, and used Tableau to generate some of the visuals, but this can be easily tweaked to create an awesome Shiny dashboard to add to a data science portfolio.

Until next time, Adieu for now!

The post November Thanksgiving – Data Science Style! appeared first on Journey of Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R programming – Journey of Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now