[R-bloggers] Predicting the next decade in the stock market (and 4 more aRticles)

[R-bloggers] Predicting the next decade in the stock market (and 4 more aRticles)

Link to R-bloggers

Predicting the next decade in the stock market

Posted: 31 Dec 2019 03:17 AM PST

[This article was first published on Data based investing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Making accurate predictions using the vast amount of data produced by the stock markets and the economy itself is difficult. In this post we will examine the performance of five different machine learning models and predict the future ten-year returns for the S&P 500 using state of the art libraries such as caret, xgboostExplainer and patchwork. We will use data from Shiller, Goyal and BLS. The training data is between the years 1948 and 1991, and the test data set is from 1991 and only until 2009, because the target variable is lagged by ten years.
Different investing strategies tend to work at different times, and you should expect the accuracy of the model you are using to move in cycles; sometimes the connection with returns is very strong, and sometimes very weak. Value investing strategies are a great example of a strategy that has not really worked for the past twelve years (source, pdf). Spurious correlations are another cause of trouble, since for example two stocks might move in tandem by just random chance. This highlights the need for some manual feature selection of intuitive features.
We will use eight different predictors; P/E, P/D, P/B, the CAPE ratio, total return CAPE, inflation, unemployment rate and the 10-year US government bond rate. All five of the valuation measures are calculated for the entire S&P 500. Let's start by inspecting the correlation clusters of the different predictors and the future ten-year return (without dividends), which is used as the target.
The different valuation measures are strongly correlated with each other as expected. All expect P/B have a very strong negative correlation with the future 10-year returns. CAPE and total return CAPE, which is a new measure that considers also reinvested dividends, are very strongly correlated with each other. Total return CAPE is also slightly less correlated with the future ten-year return than the normal CAPE.

The machine learning models

First, we will create a naïve model which predicts the future return to be same as the average return in the training set. After training the five models we will also make one ensemble model of them to see if it can reach a higher accuracy as any of the five models, which is usually the case.
The models we are going to use are quite different from each other. The glmnet model is just like the linear model, except it shrinks the coefficients according to a penalty to avoid overfitting. It therefore has a very low flexibility and also performs automated feature selection (except if the alpha hyperparameter is exactly zero as in ridge regression). K-nearest-neighbors makes its predictions by comparing the observation to similar observations. MARS on the other hand takes into account nonlinearities in the data, and also considers interactions between the features. XGBoost is a tree model, which also takes into account both nonlinearities and interactions. It however improves each tree by building it based on the residuals of the previous tree (boosting), which may lead to better accuracies. Both MARS and SVM (support vector machines) are really flexible and therefore may overfit quite easily, especially if the data size is small enough. The XGBoost model is also quite flexible but does not overfit easily since it performs regularization and pruning.
Finally, we have the ensemble model which simply gives the mean of the predictions of all the models. Ensemble models are a quite popular strategy in machine learning competitions to reach accuracies beyond the accuracy of any single model.
The models will be built using the caret wrapper, and the optimal hyperparameters are chosen using time slicing, which is a cross validation technique that is suitable for time series. We will use five timeslices to capture as many periods while having enough observations in each of them. We will do the cross validation on training data consists of 70 percent of the data, while keeping the remaining 30 percent as a test set. The results are shown below:

Results

Click to enlarge images
The predictions are less accurate after the red line, which separates the training and test sets. The model has not seen the data on the right side of the line, so its accuracy can be thought as a proxy for how well the model would perform in the future.

We will examine the model accuracies on the test set by using two measures; mean absolute error (MAE) and R-squared (R²). The results are shown in the table below:

Model MAE
Naive model 5,16 %
Ensemble 2,15 % 48,2 %
GLMNET 3,00 % 29,7 %
KNN 3,37 % 10,6 %
MARS 10,70 % 90,2 %
SVM 10,80 % 13,1 %
XGBoost 2,17 % 60,1 %

The two most flexible models, MARS and SVM, behave wildly on the test set and show signs of overfitting. Both of them have mean absolute errors that are about twice as high when compared to the naïve model. Even though MARS has a high R-squared, the mean absolute error is high. This is why you cannot trust R-squared alone. Glmnet has quite plausible predictions until the year 2009, most likely because of the rapid growth of the P/E ratio. K-nearest-neighbors has not reacted to the data too much but still achieves a quite low MAE. Out of the single models, the XGBoost has performed the best. The ensemble model however has performed slightly better as measured by the MAE. It also seems to be the most stable model, which is expected since it combines the predictions of the other models.

Let's then look at the feature importances. They are calculated in different ways for the different model types but should still be somewhat comparable. The plotting is done using the library patchwork, which allows plotting to be done by just adding the plots together using a plus sign.
Upon closer inspection of the feature importances, we see that the MARS model uses just the CAPE ratio as a feature, while rest of the models use the features more evenly. Most of the models perform some sort of feature selection, which can also be seen from the plot.

Future predictions

Lastly, we will predict the next ten years in the stock market and compare the predictions of the different models. We will also look closer at the best performing single model, XGBoost, by inspecting the composition of the prediction. The current values of the features are mostly obtained from the sources listed in the first chapter, but also from Trading Economics and multpl.

Model 10-year CAGR prediction
Ensemble 2,20%
GLMNET 1,47 %
KNN 4,04%
MARS -9,85%
SVM 6,46%
XGBoost 8,86%


The MARS model is the most pessimistic, with a return prediction that is quite strongly negative. The model should however not be trusted too much since it uses only one variable and does not behave well on the test data. The XGBoost model is surprisingly optimistic, with a prediction of almost nine percent per year. The prediction of the ensemble model is quite low but would be three percentage points higher without the MARS model.

Let's then look at the XGBoost model more closely by using the xgboostExplainer library. The resulting plot is a waterfall chart which shows the composition of a single prediction, in this case the predicted CAGR (plus one) for the next ten years. The high CAPE ratio reduces the predicted CAGR by seven percentage points, but the P/B ratio increases it by six percentage points. This is because the model contains interactions between the CAPE and P/B ratios. The effect of the interest rate level is just a bit positive at two percentage points, but the currently high P/E ratio reduces it back to the same level. The rest of the features have a very small effect on the prediction.

The benefit of predicting the returns of a single stock market is mostly limited to the fact that you can adjust your expectations for the future. However, predicting the returns of multiple stock markets and investing in the ones with the highest return predictions is most likely a very profitable strategy. Klement (2012) has shown that the CAPE ratio alone does a quite good job at predicting the returns of different stock markets. Adding more variables that are sensible to the model is likely to make the model more stable and perhaps better at predicting the outcome.

Be sure to follow me on Twitter for updates about new blog posts like this!

The R code used in the analysis can be found here.

To leave a comment for the author, please follow the link and comment on their blog: Data based investing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

von Bertalanffy Growth Plots I

Posted: 30 Dec 2019 10:00 PM PST

[This article was first published on fishR Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Introduction

library(FSAdata) # for data  library(FSA)     # for vbFuns(), vbStarts(), confint.bootCase()  library(car)     # for Boot()  library(dplyr)   # for filter(), mutate()  library(ggplot2)

I am continuing to learn ggplot2 for elegant graphics. I often make a plot to illustrate the fit of a von Bertalanffy growth function to data. In general, I want this plot to have:

  • Transparent points to address over-plotting of fish with the same length and age.
  • A fitted curve with a confidence polygon over the range of observed ages.
  • A fitted curve (without a confidence polygon) over a larger range than the observed ages (this often helps identify problematic fits).

Here I demonstrate how to produce such plots with lengths and ages of Lake Erie Walleye (Sander vitreus) captured during October-November, 2003-2014. These data are available in my FSAdata package and formed many of the examples in Chapter 12 of the Age and Growth of Fishes: Principles and Techniques book. My primary interest here is in the tl (total length in mm) and age variables (see here for more details about the data). I focus on female Walleye from location "1" captured in 2014 in this example.

data(WalleyeErie2)  wf14T <- dplyr::filter(WalleyeErie2,year==2014,sex=="female",loc==1)

The workflow below requires understanding the minimum and maximum observed ages.

agesum <- group_by(wf14T,sex) %>%    summarize(minage=min(age),maxage=max(age))  agesum
## # A tibble: 1 x 3  ##   sex    minage maxage  ##          ## 1 female      0     11

 

Fitting a von Bertalanffy Growth Function

Methods for fitting a von Bertalannfy growth function (VBGF) are detailed in my Introductory Fisheries Analyses with R book and in Chapter 12 of Age and Growth of Fishes: Principles and Techniques book. Briefly, a function for the typical VBGF is constructed with vbFuns()1.

( vb <- vbFuns(param="Typical") )
## function(t,Linf,K=NULL,t0=NULL) {  ##   if (length(Linf)==3) { K <- Linf[[2]]  ##                          t0 <- Linf[[3]]  ##                          Linf <- Linf[[1]] }  ##   Linf*(1-exp(-K*(t-t0)))  ##   }  ##   ## 

Reasonable starting values for the optimization algorithm may be obtained with vbStarts(), where the first argument is a formula of the form lengths~ages where lengths and ages are replaced with the actual variable names containing the observed lengths and ages, respectively, and data= is set to the data.frame containing those variables.

( f.starts <- vbStarts(tl~age,data=wf14T) )
## $Linf  ## [1] 645.2099  ##   ## $K  ## [1] 0.3482598  ##   ## $t0  ## [1] -1.548925

The nls() function is typically used to estimate parameters of the VBGF from the observed data. The first argument is a formula that has lengths on the left-hand-side and the VBGF function created above on the right-hand-side. The VBGF function has the ages variable as its first argument and then Linf, K, and t0 as the remaining arguments (just as they appear here). Again, the data.frame with the observed lengths and ages is given to data= and the starting values derived above are given to start=.

f.fit <- nls(tl~vb(age,Linf,K,t0),data=wf14T,start=f.starts)

The parameter estimates are extracted from the saved nls() object with coef().

coef(f.fit)
##       Linf          K         t0   ## 648.208364   0.361540  -1.283632

Bootstrapped confidence intervals for the parameter estimates are computed by giving the saved nls() object to Boot() and giving the saved Boot() object to confint().

f.boot1 <- Boot(f.fit)  # Be patient! Be aware of some non-convergence  confint(f.boot1)
## Bootstrap bca confidence intervals  ##   ##           2.5 %      97.5 %  ## Linf 619.519302 686.5927399  ## K      0.297934   0.4317571  ## t0    -1.548261  -1.0503317

 

Preparing Predicted Values for Plotting

Predicted lengths-at-age from the fitted VBGF is needed to plot the fitted VBGF curve. The predict() function may be used to predict mean lengths at ages from the saved nls() object.

predict(f.fit,data.frame(age=2:7))
## [1] 450.4495 510.4490 552.2448 581.3599 601.6415 615.7698

What is need, however, is the predicted mean lengths at ages for each bootstrap sample, so that bootstrapped confidence intervals for each mean length-at-age can be derived. To do this with Boot(), predict() needs to be embedded into another function. For example, the function below does the same as predict() but is in a form that will work with Boot().

predict2 <- function(x) predict(x,data.frame(age=ages))  ages <- 2:7  predict2(f.fit)  # demonstrates same result as predict() above
## [1] 450.4495 510.4490 552.2448 581.3599 601.6415 615.7698

Predicted mean lengths-at-age, with bootstrapped confidence intervals, can then be constructed by giving Boot() the saved nls() object AND the new prediction function in f=. The Boot() code will thus compute the predicted mean length at all ages between -1 and 12 in increments of 0.22. I extended the age range outside the observed range of ages as I want to see the shape of the curve nearer t0 and at older ages (to better see L).

ages <- seq(-1,12,by=0.2)  f.boot2 <- Boot(f.fit,f=predict2)  # Be patient! Be aware of some non-convergence

The vector of ages, the predicted mean lengths-at-age (from predict()), and the associated bootstrapped confidence intervals (from confint()) are placed into a data.frame for later use.

preds1 <- data.frame(ages,                       predict(f.fit,data.frame(age=ages)),                       confint(f.boot2))  names(preds1) <- c("age","fit","LCI","UCI")  headtail(preds1)
##      age       fit       LCI      UCI  ## V1  -1.0  63.17547  12.18055 102.3627  ## V2  -0.8 103.98483  62.48577 136.5450  ## V3  -0.6 141.94750 108.01521 168.4213  ## V64 11.6 642.05952 615.02952 672.9536  ## V65 11.8 642.48843 615.36045 673.8122  ## V66 12.0 642.88743 615.56480 674.5265

For my purposes below, I also want predicted mean lengths only for observed ages. To make the code below cleaner, a new data.frame restricted to the observed ages is made here.

preds2 <- filter(preds1,age>=agesum$minage,age<=agesum$maxage)  headtail(preds2)
##     age      fit      LCI      UCI  ## 1   0.0 240.6728 224.2408 253.8395  ## 2   0.2 269.1007 256.7356 278.7312  ## 3   0.4 295.5456 286.5712 302.3211  ## 54 10.6 639.3815 613.6163 668.0091  ## 55 10.8 639.9972 614.0103 669.1005  ## 56 11.0 640.5700 614.2978 670.1147

 

Constructing the Plot

A ggplot2 often starts by defining data= and aes()thetic mappings in ggplot(). However, the data and aesthetics should not be set in ggplot in this application because information will be drawn from three data.frames – wf14T, preds, and preds2. Thus, the data and aesthetics will be set within specific geoms.

The plot begins with a polygon that encases the lower and upper confidence interval values for mean length at each age. This polygon is constructed with geom_ribbon() using preds2 (the confidence polygon will only cover observed ages) where the x-axis will be age and the minimum part of the y-axis will be LCI and the maximum part of the y-axis will be UCI. The fill color of the polygon is set with fill=.3

ggplot() +     geom_ribbon(data=preds2,aes(x=age,ymin=LCI,ymax=UCI),fill="gray90")

plot of chunk vbFit1a

Observed lengths and ages in the wf14T data.frame were then added to this plot with geom_point(). The points are slightly larger than the default (with size=) and also with a fairly low transparency value to handle considerable over-plotting.

ggplot() +     geom_ribbon(data=preds2,aes(x=age,ymin=LCI,ymax=UCI),fill="gray90") +    geom_point(data=wf14T,aes(y=tl,x=age),size=2,alpha=0.1)

plot of chunk vbFit1b

The fitted curve over the entire range of ages used above (i.e., using preds1) is added with geom_line(). A slightly thicker than default (size=) dashed (linetype=) line was used.

ggplot() +     geom_ribbon(data=preds2,aes(x=age,ymin=LCI,ymax=UCI),fill="gray90") +    geom_point(data=wf14T,aes(y=tl,x=age),size=2,alpha=0.1) +    geom_line(data=preds1,aes(y=fit,x=age),size=1,linetype=2)

plot of chunk vbFit1c

The fitted curve for just the observed range of ages (i.e., using preds2) is added using a solid line so that the dashed line for the observed ages is covered.

ggplot() +     geom_ribbon(data=preds2,aes(x=age,ymin=LCI,ymax=UCI),fill="gray90") +    geom_point(data=wf14T,aes(y=tl,x=age),size=2,alpha=0.1) +    geom_line(data=preds1,aes(y=fit,x=age),size=1,linetype=2) +    geom_line(data=preds2,aes(y=fit,x=age),size=1)

plot of chunk vbFit1d

The y- and x-axes are labelled (name=), expansion factor for the axis limits is removed (expand=c(0,0)) so that the point (0,0) is in the corner of the plot, and the axis limits (limits=) and breaks (breaks=) are controlled using scale_y_continuous() and scale_x_continuous().

ggplot() +     geom_ribbon(data=preds2,aes(x=age,ymin=LCI,ymax=UCI),fill="gray90") +    geom_point(data=wf14T,aes(y=tl,x=age),size=2,alpha=0.1) +    geom_line(data=preds1,aes(y=fit,x=age),size=1,linetype=2) +    geom_line(data=preds2,aes(y=fit,x=age),size=1) +    scale_y_continuous(name="Total Length (mm)",limits=c(0,700),expand=c(0,0)) +    scale_x_continuous(name="Age (years)",expand=c(0,0),                       limits=c(-1,12),breaks=seq(0,12,2))

plot of chunk vbFit1e

Finally, the classic black-and-white theme (primarily to remove the gray background) was used (theme_bw() and the grid lines were removed (panel.grid=).

vbFitPlot <- ggplot() +     geom_ribbon(data=preds2,aes(x=age,ymin=LCI,ymax=UCI),fill="gray90") +    geom_point(data=wf14T,aes(y=tl,x=age),size=2,alpha=0.1) +    geom_line(data=preds1,aes(y=fit,x=age),size=1,linetype=2) +    geom_line(data=preds2,aes(y=fit,x=age),size=1) +    scale_y_continuous(name="Total Length (mm)",limits=c(0,700),expand=c(0,0)) +    scale_x_continuous(name="Age (years)",expand=c(0,0),                       limits=c(-1,12),breaks=seq(0,12,2)) +    theme_bw() +    theme(panel.grid=element_blank())  vbFitPlot

plot of chunk vbFit1f

 

BONUS – Equation on Plot

Below is an undocumented bonus for how to put the equation of the best-fit VBGM on the plot. This is hacky so I would not expect it to be very general (e.g., it likely will not work across facets).

makeVBEqnLabel <- function(fit) {    # Isolate coefficients (and control decimals)    cfs <- coef(fit)    Linf <- formatC(cfs[["Linf"]],format="f",digits=1)    K <- formatC(cfs[["K"]],format="f",digits=3)    # Handle t0 differently because of minus in the equation    t0 <- cfs[["t0"]]    t0 <- paste0(ifelse(t0<0,"+","-"),formatC(abs(t0),format="f",digits=3))    # Put together and return    paste0("TL==",Linf,"~bgroup('(',1-e^{-",K,"~(age",t0,")},')')")  }    vbFitPlot + annotate(geom="text",label=makeVBEqnLabel(f.fit),parse=TRUE,                       size=4,x=Inf,y=-Inf,hjust=1.1,vjust=-0.5)

plot of chunk vbFit1g

 

Final Thoughts

This post is likely not news to those of you that are familiar with ggplot2. However, I am trying to post some examples here as I learn ggplot2 in hopes that it will help others. My first post was here. In my next post I will demonstrate how to show von Bertalanffy curves for two or more groups.

 

 

 

Footnotes

  1. Other parameterizations of the VBGF can be used with param= in vbFuns(). Parameterizations of the Gompertz, Richards, and Logistic growth functions are available in GompertzFuns(), RichardsFuns(), and logisticFuns() of the FSA package. See here for documentation. The Schnute four-parameter growth model is available in Schnute() and the Schnute-Richards five-parameter growth model is available in SchnuteRichards()↩

  2. Reduce the value of by= in seq() to make for a smoother VBGF curve when plotting later. ↩

  3. This polygon will look better in the final plot when the gray background is removed. Also note that the polygon could be outlined by setting color= to a color other than what is given in fill=↩

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

RStudio Blogs 2019

Posted: 30 Dec 2019 04:00 PM PST

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you are lucky enough to have some extra time for discretionary reading during the holiday season, you may find it interesting (and rewarding) to sample some of the nearly two hundred posts written across the various RStudio blogs.

R Views

R Views, our blog devoted to the R Community and the R Language, published over sixty posts in 2019. Many of these were contributed by guest authors from the R Community who volunteered to share some outstanding work. Among my favorites are the multi-part posts that explored data science modeling issues in some detail. These include Roland Stevenson's three-part series on Multiple Hypothesis Testing and A/B Testing, the four-part series on Analyzing the HIV pandemic by Andrie de Vries and Armand Bester, and Jonathan Reginstein's two-part series on Tech Dividends.

RStudio Blog

The RStudio blog is the place to go for official information on RStudio. It includes posts on open-source and commercial products, events, and company news. Just scanning the summary paragraphs will give you a good overview of what went on at RStudio this past year. Among my favorite posts for the year is Lou Bajuk's take on the complementary roles of R and Python: R vs. Python: What's the best language for Data Science?.

TensorFlow for R Blog

The TensorFlow for R Blog provides "nuts and bolts" reading on building TensorFlow models that ought to be on the list of every data scientist working in R. The posts cover an amazingly wide range of cutting edge topics. For example, see Sigrid Keydana's recent posts Differential Privacy with TensorFlow, and Getting started with Keras from R – the 2020 edition.

Tidyverse Blog

The Tidyverse Blog offers insight into Tidyverse packages and capabilities at all levels. Scan the summaries like you would a bookshelf in your favorite technical bookstore, and pick out something new like Davis Vaughan's exposition of the new hardhat package which provides tools for developing new modeling packages, or take a deep dive into task queues with Gábor Csárdi's Multi Process Task Queue in 100 Lines of R Code.

Ursa Labs Blog

Ursa Labs is a project devoted to open source data science and cross-language software sponsored by RStudio along with several other organizations for which we have great hope. Wes McKinney's post
Ursa Labs Team Report August to December 2019 provides an overview of the progress made in 2019.

Happy Reading!
and
Happy New Year!
from all of us at RStudio.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Introduction to Data Science in R, Free for 3 days

Posted: 30 Dec 2019 10:30 AM PST

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To celebrate the new year and the recent release of Practical Data Science with R 2nd Edition, we are offering a free coupon for our video course "Introduction to Data Science."

The following URL and code should get you permanent free access to the video course, if used between now and January 1st 2020:

https://www.udemy.com/course/introduction-to-data-science/ code: PDSWR2

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Can Genealogical data be tidy?

Posted: 29 Dec 2019 04:00 PM PST

[This article was first published on R on R-house, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy

Like families, tidy datasets are all alike but every messy dataset is messy in its own way — Hadley Wickham

In this post, I'll be exploring how genealogical data stored in the de-facto standard format, GEDCOM, could be made tidy, and arguing that this is not really ideal.

About 6 years ago, long before I got involved with Data Science and when R was just the 18th letter of the alphabet, I started researching my family history. It was really interesting, hugely rewarding, and I rapidly found myself inundated with various pieces of information – a lot of it conflicting – from various sources. Desperate to organise it all, I discovered the Genealogical Data Communication (GEDCOM) format. I used this format to record all I had found and used some special freeware to generate family tree diagrams in PDF format.

Fast forward to today.

I now find myself in a situation where I'm keen to dig out my old GEDCOM file and see what R can do with it! I searched on GitHub for repos that manipulate GEDCOM files in R, and perhaps the most promising was one by Peter Prevos who had written a short article describing the format of the file and its limitations. I highly recommend you give it a read.

For all its faults, the GEDCOM data format has been the standard for decades, so a fundamental constraint here is that I'm not going to try to invent a whole new format, I'm just going to try to deal with the standard we have. Files contain data on more than one type of observational unit, including individuals, families, and data sources. It's inappropriate to try to fit all of that in one big dataframe, so I'll just be focusing on individuals in this post.

Peter has not only written some code to read GEDCOM files, but also code to do some simple analysis and generate some visualisations using the tidyverse. This takes data which is inherently more like a nested list structure, and creates a tidy dataframe, with a row for each individual, and fields that include name, birth date, mother and father. On the face of it, this seems intuitive, but when dealing with detailed genealogical data, this isn't entirely suitable. Part of the problem comes down to conflicting data.

One of the strengths of the GEDCOM format is the ability to record several possible values of an individual's attribute. For example, if one source tells you an ancestor was born in 1900, and another tells you they were born in 1901, you don't have to choose one as correct and dismiss the other – you can record both and capture the uncertainty – which is an absolutely crucial capability of any genealogical data format. If we were to try to capture these possible values using the dataframe format, one might imagine having a row for every combination of possible values, e.g.

ID Name DOB Place_of_death
I56 Joe Bloggs 12 December 1900 Somerset, UK
I56 Joe Bloggs 12 December 1901 Somerset, UK
I56 Joe Bloggs 12 December 1900 Devon, UK
I56 Joe Bloggs 12 December 1901 Devon, UK

Unfortunately, this has two drawbacks; you could feasibly end up with hundreds of rows for a single individual as the different possibilities for dozens of fields multiply up – with only one row being 'correct' – resulting in a lot of unnecessary data duplication. You could employ nested list columns to get around this, but this would make the dataframe complex to deal with and difficult to share with non-R users. It also wouldn't solve the second issue – being able to record the data source for each conflicting piece of data.

These limitations rapidly lead you down a path of considering an 'ultra-tidy' dataframe instead, where each row records a possible value for an individual attribute and a source can be recorded for each, e.g.

ID attribute value source
I56 Name Joe Bloggs A
I56 DOB 12 December 1900 A
I56 Place_of_death Somerset, UK A
I56 DOB 12 December 1901 B
I56 Place_of_death Devon, UK B

This is a lot better, especially considering you could add a 'notes' column (which is one of the tags in a GEDCOM file), that you could attach to any data value. Unfortunately, uncertainty isn't the only reason why a field would have more than one value. Fields like occupation and address could have several values as an individual may have had several over their lifetime. So, we might consider adding further fields to the above capturing instants or periods of time for which the value applies.

Now we encounter a real problem. There is a very good reason why the GEDCOM data structure is nested in nature – in order to handle things like name and address. The NAME field may contain the individual's full name, but child fields may decompose this into given name (GIVN) and surname (SURN), as well as other child fields not found in the parent NAME field, such as nicknames (FONE). Similarly, the address field has child fields for city, state, and country.

I have considered having something like three attribute columns (for 3 levels of nesting), but we lose the benefit of having one row per attribute, and it feels like a fudge too far.

I've therefore abandoned my intention of converting my GEDCOM files to tidy dataframes and have looked for alternatives. I know Peter has begun exploring network data structures and I can certainly see why.

I have since discovered an open source genealogy project called Gramps which seems to rely on XML data structures. Sounds promising. I intend to try installing this and seeing how it fares with converting my existing GEDCOM files.

To leave a comment for the author, please follow the link and comment on their blog: R on R-house.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Comments

  1. Want to become a Facebook Ads Master? Learn to create high converting facebook marketing course in hindi, lookalike audiences, set up pixels & more. Facebook ad is the best and cost-effective way to sell a product and generate leads for your website. To learn complete best digital marketing course in india visit our most trending courses:
    graphic design classes
    masters in UX design india

    ReplyDelete

Post a Comment