[R-bloggers] Why Machine Learning is more Practical than Econometrics in the Real World (and 4 more aRticles)

Why Machine Learning is more Practical than Econometrics in the Real World
No visible binding for global variable
Mueller Report Volume 1: Network Analysis
Dash with golem: The beginning
noaastorms R package now supports NOAA IBTrACS v4

Why Machine Learning is more Practical than Econometrics in the Real World

Posted: 18 Aug 2019 08:54 PM PDT

[This article was first published on R – Remix Institute, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

I've read several studies and articles that claim Econometric models are still superior to machine learning when it comes to forecasting. In the article, "Statistical and Machine Learning forecasting methods: Concerns and ways forward", the author mentions that,

"After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined."

In many business environments a data scientist is responsible for generating hundreds or thousands (possibly more) forecasts for an entire company, opposed to a single series forecast. While it appears that Econometric methods are better at forecasting a single series (which I generally agree with), how do they compare at forecasting multiple series, which is likely a more common requirement in the real world? Some other things to consider when digesting the takeaways from that study:

Do the ML models benefit from building a single model to forecast all series at once, which most time series models cannot do?
What are the run-time differences with both approaches?
The author in the linked article above states that the Econometrics models outperform machine learning models across all forecast horizons but is that really the case?

Approach

In this article, I am going to show you an experiment I ran that compares machine learning models and Econometrics models for time series forecasting on an entire company's set of stores and departments.

Before I kick this off, I have to mention that I've come across several articles that describe how one can utilize ML for forecasting (typically with deep learning models) but I haven't seen any that truly gives ML the best chance at outperforming traditional Econometric models. On top of that, I also haven't seen too many legitimate attempts to showcase the best that Econometric models can do either. That's where this article and evaluation differ. The suite of functions I tested are near-fully optimized versions of both ML models and Econometric models (list of models and tuning details are below). The functions come from the R open source package RemixAutoML, which is a suite of functions for automated machine learning (AutoML), automated forecasting, automated anomaly detection, automated recommender systems, automated feature engineering, and more. I provided the R script at the bottom of this article so you can replicate this experiment. You can also utilize the functions in Python via the r2py package and Julia via the RCall package.

The Data

The data I'm utilizing comes from Kaggle — weekly Walmart sales by store and department. I'm only using the store and department combinations that have complete data to minimize the noise added to the experiment, which leaves me with a total of 2,660 individual store and department time series. Each store & dept combo has 143 records of weekly sales. I also removed the "IsHoliday" column that was provided.

Preview of Walmart Store Sales Kaggle Data Set

The Experiment

Given the comments from the article linked above, I wanted to test out several forecast horizons. The performance for all models are compared on n-step ahead forecasts, for n = {1,5,10,20,30}, with distinct model builds used for each n-step forecast test. For each run, I have 2,660 evaluation time series for comparison, represented by each store and department combination. In the Results section you can find the individual results for each of those runs.

The Models

In the experiment I used the AutoTS() function for testing out Econometric models and I used the RemixAutoML CARMA suite (Calendar-Auto-Regressive-Moving-Average) for testing out Machine Learning. The AutoTS() function tests out every model from the list below in in several ways (similar to grid tuning in ML). The ML suite contains 4 different tree-based algorithms. As a side note, the Econometric models all come from the forecast package in R. You can see a detailed breakdown of how each model is optimized below the Results section in this article.

Econometrics Models used in AutoTS()

DSHW — Double-Seasonal Holt-Winters
ARIMA — Autoregressive, integrated, moving average
ARFIMA — Fractionally differenced ARIMA
ETS — Exponential Smoothing State-Space Model
NN — Feed-forward neural network with a single hidden layer and lagged inputs. I'm counting this towards Econometrics because it came from the Forecast package in R which is an Econometrics package along with the fact that it's not as customizable as a TensorFlow or PyTorch model. (Besides, I've seen authors state that linear regression is machine learning which would imply that all the Econometrics methods are Machine Learning but I don't want to debate that here). If you want to consider the NN as a Machine Learning model, just factor that into the results data below.
TBATS (Exponential smoothing state-space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components)
TSLM — time series linear model with trend and seasonal components

Example Plot from AutoTS(): Single Store Forecast with 80% & 95% Prediction Intervals

Machine Learning Algorithms

AutoCatBoostCARMA() — CatBoost
AutoXGBoostCARMA() — XGBoost
AutoH2oGBMCARMA() — H2O Gradient Boosting Machine
AutoH2oDRFCARMA() — H2O Distributed Random Forest

Example Plot from AutoCatBoostCARMA(): Aggregated Forecast

Results

The table outputs below shows the ranks of 11 models (7 Econometric and 4 Machine Learning) when it comes to lowest mean absolute error (MAE) for every single store and department combination (2,660 individual time series) across five different forecast horizons.

For example, in the 1-step ahead forecast table below, NN was the most accurate model on 666 of the 2,660 time series. TBATS was the most accurate 414 times out of the 2,660.

Still looking at the 1-step ahead forecast table below, the NN was the second most accurate on 397 out of 2,660 time series. TBATS was the second most accurate on 406 out of the 2,660 time series. TBATS ranked last place (11th) 14 times.

Winner in Single Model Accuracy — TBATS is the winner of the competition (Econometrics model) with a mean rank of 1.6 across all five forecast horizons.
Runner-Up in Single Model Accuracy — Catboost is the runner up of the competition (Machine Learning model) with a mean rank of 3.6 across all five forecast horizons.
Winner in Run time — ML is winner: For a single run (there were 5 total, 1 for each forecast horizon) the Econometrics automated forecasting took an average of 33 hours! to run while the automated ML models took an average of 3.5 hours, where each run included a grid tune of 6 comparisons, (1 hour for CatBoost, 1 hour for XGBoost, 30 minutes for H2O GBM and 1 hour for H2O Distributed Random Forest).
Winner in Shorter-Horizon Accuracy — The Econometrics models dominate on shorter forecast horizons.
Winner in Longer Horizon Accuracy — The Machine Learning models dominate on longer forecast horizons.

Aggregate Summaries

Ranks: Long Term = {20,30} Period-Ahead & Short Term = {1,5,10} Period-Ahead

The histograms below were derived from selecting the best Econometrics models for each individual store and department time series (essentially the ensemble results) and the best Machine Learning models for each individual store and department time series (ensemble). You can see that as the forecast horizon grows, the Machine Learning models catch up and overcome (slightly) the Econometrics models. With the shorter forecast horizon, the Econometrics models outperform the Machine Learning models by a larger amount than the converse.

Forecast MAE by Econometrics(Blue) and Machine Learning(Gold): 1-Period, 5-Period, 20-Period, 30-Period

Individual Forecast Horizon Summaries by Model

1- Period Ahead Forecast Model Counts by Rank (based on lowest MAE)

5- Period Ahead Forecast Model Counts by Rank (based on lowest MAE)

10- Period Ahead Forecast Model Counts by Rank (based on lowest MAE)

20- Period Ahead Forecast Model Counts by Rank (based on lowest MAE)

30- Period Ahead Forecast Model Counts by Rank (based on lowest MAE)

Conclusion

While the short term horizon forecasts are more accurate via the Econometrics models I tend to have a greater need for longer term forecasts for planning purposes and the Machine Learning models exceed the Econometrics in that category. On top of that, the run-time is a pretty significant factor for me.

If your business needs are the opposite, the Econometrics models are probably your best bet, assuming the run times are not a concern.

If I had enough resources available I'd run both functions and utilize the individual models that performed best for each series, which means I'd be utilizing all 11 models.

Algorithm Tuning Details

Econometrics Model Details:

Each of the individual Econometrics models in AutoTS() are optimized based on the following treatments.

Global Optimizations (applies to all models):

A) Optimal Box-Cox Transformations are used in every run where data is strictly positive. The optimal transformation could be no transformation (artifact of Box-Cox).

B) Four different treatments are tested for each model:

user-specified time frequency + no outlier smoothing & no imputation
model-based time frequency + no outlier smoothing & no imputation
user-specified time frequency + outlier smoothing & imputation
model-based time frequency + outlier smoothing & imputation

The treatment of outlier smoothing and imputation sometimes has a beneficial effect on forecasts; sometimes it doesn't. You really need to test out both to see what generates more accurate predictions out-of-sample. Same goes with manually defining the frequency of the data. If you have daily data, you specify "day" in the AutoTS arguments. Alternatively, if specified, spectral analysis is done to find the frequency of the data based on the dominant trend and seasonality. Sometimes this approach works better, sometimes it doesn't. That's why I test all the combinations for each model.

Individual Model Optimizations:

C) For the ARIMA and ARFIMA, I used up to 25 lags and moving averages, algorithmically determined how many differences and seasonal differences to use, and up to a single difference and seasonal difference can be used, all determined in the stepwise procedure (all combinations can be tested and run in parallel but it's too time consuming for my patience).

D) For the Double Seasonal Holt-Winters model, alpha, beta, gamma, omega, and phi are determined using least-squares and the forecasts are adjusted using an AR(1) model for the errors.

E) The Exponential Smoothing State-Space model runs through an automatic selection of the error type, trend type, and season type, with the options being "none", "additive", and "multiplicative", along with testing of damped vs. non-damped trend (either additive or multiplicative). Alpha, beta, and phi are estimated.

F) The Neural Network is set up to test out every combination of lags and seasonal lags (25 lags, 1 seasonal lag) and the version with the best holdout score is selected.

G) The TBATS model utilizes 25 lags and moving averages for the errors, damped trend vs. non-damped trend are tested, trend vs. non-trend are also tested, and the model utilizes parallel processing.

H) The TSLM model utilizes simple time trend and season depending on the frequency of the data.

Machine Learning Model Details:

The CARMA suite utilizes several features to ensure proper models are built to generate the best possible out-of-sample forecasts.

A) Feature engineering: I use a time trend, calendar variables, holiday counts, and 25 lags and moving averages along with 51, 52, and 53-week lags and moving averages (all specified as arguments in the CARMA function suite). Internally, the CARMA functions utilize several RemixAutoML functions, all written using data.table for fast and memory efficient processing:

DT_GDL_Feature_Engineering() — creates lags and moving average features by grouping variables (also creates lags and moving averages off of time between records)
Scoring_GDL_Feature_Engineering() — creates lags and moving average features for a single record by grouping variables (along with the time between features)
CreateCalendarVariables() — creates numeric features identifying various time units based on date columns
CreateHolidayFeatures() — creates count features based on the specified holiday groups you want to track and the date columns you supply

B) Optimal transformations: the target variable along with the associated lags and moving average features were transformed. This is really useful for regression models with categorical features that have associated target values that significantly differ from each other. The transformation options that are tested (using a Pearson test for normality) include:

YeoJohnson, Box-Cox, Arcsinh, Identity,
arcsin(sqrt(x)), logit(x) — for proportion data, not used in experiment

The functions used to create the transformations throughout the process and then back-transform them after the forecasts have been generated come from RemixAutoML :

AutoTransformationCreate()
AutoTransformationScore()

C) Models: there are four CARMA functions and each use a different algorithm for the model fitting. The models used to fit the time series data come from RemixAutoML and include:

AutoCatBoostRegression()
AutoXGBoostRegression()
AutoH2oDRFRegression()
AutoH2oGBMRegression()

You can view all of the 21 process steps in those functions on my GitHub page README under the section titled, "Supervised Learning Models" in the "Regression" sub-section (you can also view the source code directly of course).

D) GPU: With the CatBoost and XGBoost functions, you can build the models utilizing GPU (I ran them with a GeForce 1080ti) which results in an average 10x speedup in model training time (compared to running on CPU with 8 threads). I should also note, the lags and moving average features by store and department and pretty intensive to compute and are built with data.table exclusively which means that if you have a CPU with a lot of threads then those calculations will be faster as data.table is parallelized.

E) One model for all series: I built the forecasts for all the store and department combinations with a single model by simply specifying c("Store","Dept") in the GroupVariables argument, which provides superior results compared to building a single model for each series. The group variables are used as categorical features and do not require one-hot-encoding before hand as CatBoost and H2O handle those internally. The AutoXGBoostCARMA() version utilizes the DummifyDT() function from RemixAutoML to handle the categorical features.

F) The max number of trees used for each model was (early stopping is used internally):

AutoCatBoostCARMA() = 20,000
AutoXGBoostCARMA() = 5,000
AutoH2oDRFCARMA() = 2,000
AutoH2oGBMCARMA() = 2,000

G) Grid tuning: I ran a 6 model random hyper-parameter grid tune for each algorithm. Essentially, a baseline model is built and then 5 other models are built and compared with the lowest MAE model being selected. This is all done internally in the CARMA function suite.

H) Data partitioning: for creating the training, validation, and test data, the CARMA functions utilize the RemixAutoML::AutoDataPartition()function and utilizes the "timeseries" option for the PartitionTypeargument which ensures that the train data reflects the furthest data points back in time, followed by the validation data, and then the test data which is the most recent data points in time. For the experiment, I used 10/143 as the percent holdout for validation data. The test data varied by which n-step ahead holdout was being tested, and the remaining data went to the training set.

I) Forecasting: Once the regression model is built, the forecast process replicates an ARIMA process. First, a single step-ahead forecast is made. Next, the lags and moving average features are updated, making use of the predicted values from the previous step. Next, the other features are updated (trend, calendar, holiday). Then the next forecast step is made; rinse and repeat for remaining forecasting steps. This process utilizes the RemixAutoML functions:

AutoCatBoostScoring()
AutoXGBoostScoring()
AutoH2oMLScoring()

Contact

If anyone is interested in testing out other models, utilizing different data sets, or just need to set up automated forecasts for their company, contact me on LinkedIn.

If you'd like to learn how to utilize the RemixAutoML package check out the free course on Remyx Courses.

P.S.

I have plans to continue enhancing and adding capabilities to the automated time series functions discussed above. For example, I plan to:

Test Fourier features for both AutoTS() and the CARMA suite
Test other ML algorithms
Ensemble methods for combining forecasts
Add Croston Econometric model for intermittent demand forecasting
Create machine learning based intermittent demand forecasting models (in a similar fashion to the Croston method) utilizing RemixAutoML Generalized Hurdle Models

R Script

Code to reproduce: https://gist.github.com/AdrianAntico/8e1bbf63f26835756348d7c67a930227

  library(RemixAutoML)  library(data.table)    ###########################################  # Prepare data for AutoTS()----  ###########################################    # Load Walmart Data ----  # link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/  data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)      # Subset for Stores / Departments with Full Series Available: (143 time points each)----  data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]    # Subset Columns (remove IsHoliday column)----  keep <- c("Store","Dept","Date","Weekly_Sales")  data <- data[, ..keep]    # Group Concatenation----  data[, GroupVar := do.call(paste, c(.SD, sep = " ")), .SDcols = c("Store","Dept")]  data[, c("Store","Dept") := NULL]    # Grab Unique List of GroupVar----  StoreDept <- unique(data[["GroupVar"]])    ###########################################  # AutoTS() Builds----  ###########################################    for(z in c(1,5,10,20,30)) {    TimerList <- list()    OutputList <- list()    l <- 0    for(i in StoreDept) {      l <- l + 1      temp <- data[GroupVar == eval(i)]      temp[, GroupVar := NULL]      TimerList[[i]] <- system.time(        OutputList[[i]] <- tryCatch({          RemixAutoML::AutoTS(            temp,            TargetName       = "Weekly_Sales",            DateName         = "Date",            FCPeriods        = 1,            HoldOutPeriods   = z,            EvaluationMetric = "MAPE",            TimeUnit         = "week",            Lags             = 25,            SLags            = 1,            NumCores         = 4,            SkipModels       = NULL,            StepWise         = TRUE,            TSClean          = TRUE,            ModelFreq        = TRUE,            PrintUpdates     = FALSE)},          error = function(x) "Error in AutoTS run"))      print(l)    }        # Save Results When Done and Pull Them in After AutoCatBoostCARMA() Run----    save(TimerList, file = paste0(getwd(),"/TimerList_FC_",z,"_.R"))    save(OutputList, file = paste0(getwd(),"/OutputList_FC_",z,".R"))    rm(OutputList, TimerList)  }    ###########################################  # Prepare data for AutoCatBoostCARMA()----  ###########################################    # Load Walmart Data----  # link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/  data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)      # Subset for Stores / Departments With Full Series (143 time points each)----  data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]    # Subset Columns (remove IsHoliday column)----  keep <- c("Store","Dept","Date","Weekly_Sales")  data <- data[, ..keep]    # Build AutoCatBoostCARMA Models----  for(z in c(1,5,10,20,30)) {    CatBoostResults <- RemixAutoML::AutoCatBoostCARMA(      data,      TargetColumnName = "Weekly_Sales",      DateColumnName = "Date",      GroupVariables = c("Store","Dept"),      FC_Periods = 10,      TimeUnit = "week",      TargetTransformation = TRUE,      Lags = c(1:25,51,52,53),      MA_Periods = c(1:25,51,52,53),      CalendarVariables = TRUE,      TimeTrendVariable = TRUE,       HolidayVariable = TRUE,      DataTruncate = FALSE,      SplitRatios = c(1 - 60/143, 30/143, 30/143),      TaskType = "GPU",      EvalMetric = "RMSE",      GridTune = FALSE,      GridEvalMetric = "r2",      ModelCount = 2,      NTrees = 1500,      PartitionType = "timeseries",      Timer = TRUE)        # Output----    CatBoostResults$TimeSeriesPlot    CatBoost_Results <- CatBoostResults$ModelInformation$EvaluationMetricsByGroup    data.table::fwrite(CatBoost_Results, paste0(getwd(),"/CatBoost_Results_",30,".csv"))    rm(CatBoost_Results,CatBoostResults)    }    ###########################################  # Prepare data for AutoXGBoostCARMA()----  ###########################################    # Load Walmart Data ----  # link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/  data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)      # Subset for Stores / Departments With Full Series (143 time points each)----  data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]    # Subset Columns (remove IsHoliday column)----  keep <- c("Store","Dept","Date","Weekly_Sales")  data <- data[, ..keep]    for(z in c(1,5,10,20,30)) {    XGBoostResults <- RemixAutoML::AutoXGBoostCARMA(      data,      TargetColumnName = "Weekly_Sales",      DateColumnName = "Date",      GroupVariables = c("Store","Dept"),      FC_Periods = 2,      TimeUnit = "week",      TargetTransformation = TRUE,      Lags = c(1:25, 51, 52, 53),      MA_Periods = c(1:25, 51, 52, 53),      CalendarVariables = TRUE,      HolidayVariable = TRUE,      TimeTrendVariable = TRUE,      DataTruncate = FALSE,      SplitRatios = c(1 - (30+z)/143, 30/143, z/143),      TreeMethod = "hist",      EvalMetric = "MAE",      GridTune = FALSE,      GridEvalMetric = "mae",      ModelCount = 1,      NTrees = 5000,      PartitionType = "timeseries",      Timer = TRUE)        XGBoostResults$TimeSeriesPlot    XGBoost_Results <- XGBoostResults$ModelInformation$EvaluationMetricsByGroup    data.table::fwrite(XGBoost_Results, paste0(getwd(),"/XGBoost_Results",z,".csv"))    rm(XGBoost_Results)  }    ###########################################  # Prepare data for AutoH2oDRFCARMA()----  ###########################################    # Load Walmart Data ----  # link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/  data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)      # Subset for Stores / Departments With Full Series (143 time points each)----  data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]    # Subset Columns (remove IsHoliday column)----  keep <- c("Store","Dept","Date","Weekly_Sales")  data <- data[, ..keep]    for(z in c(1,5,10,20,30)) {    H2oDRFResults <- AutoH2oDRFCARMA(      data,      TargetColumnName = "Weekly_Sales",      DateColumnName = "Date",      GroupVariables = c("Store","Dept"),      FC_Periods = 2,      TimeUnit = "week",      TargetTransformation = TRUE,      Lags = c(1:5, 51,52,53),      MA_Periods = c(1:5, 51,52,53),      CalendarVariables = TRUE,      HolidayVariable = TRUE,      TimeTrendVariable = TRUE,      DataTruncate = FALSE,      SplitRatios = c(1 - (30+z)/143, 30/143, z/143),      EvalMetric = "MAE",      GridTune = FALSE,      ModelCount = 1,      NTrees = 2000,      PartitionType = "timeseries",      MaxMem = "28G",      NThreads = 8,      Timer = TRUE)        # Plot aggregate sales forecast (Stores and Departments rolled up into Total)----    H2oDRFResults$TimeSeriesPlot    H2oDRF_Results <- H2oDRFResults$ModelInformation$EvaluationMetricsByGroup    data.table::fwrite(H2oDRF_Results, paste0(getwd(),"/H2oDRF_Results",z,".csv"))    rm(H2oDRF_Results)    }    ###########################################  # Prepare data for AutoH2OGBMCARMA()----  ###########################################    # Load Walmart Data ----  # link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/  data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)      # Subset for Stores / Departments With Full Series (143 time points each)----  data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]    # Subset Columns (remove IsHoliday column)----  keep <- c("Store","Dept","Date","Weekly_Sales")  data <- data[, ..keep]    for(z in c(1,5,10,20,30)) {    H2oGBMResults <- AutoH2oGBMCARMA(      data,      TargetColumnName = "Weekly_Sales",      DateColumnName = "Date",      GroupVariables = c("Store","Dept"),      FC_Periods = 2,      TimeUnit = "week",      TargetTransformation = TRUE,      Lags = c(1:5, 51,52,53),      MA_Periods = c(1:5, 51,52,53),      CalendarVariables = TRUE,      HolidayVariable = TRUE,      TimeTrendVariable = TRUE,      DataTruncate = FALSE,      SplitRatios = c(1 - (30+z)/143, 30/143, z/143),      EvalMetric = "MAE",      GridTune = FALSE,      ModelCount = 1,      NTrees = 2000,      PartitionType = "timeseries",      MaxMem = "28G",      NThreads = 8,      Timer = TRUE)        # Plot aggregate sales forecast (Stores and Departments rolled up into Total)----    H2oGBMResults$TimeSeriesPlot    H2oGBM_Results <- H2oGBMResults$ModelInformation$EvaluationMetricsByGroup    data.table::fwrite(H2oGBM_Results, paste0(getwd(),"/H2oGBM_Results",z,".csv"))    rm(H2oGBM_Results)  }    ##################################################  # AutoTS() and AutoCatBoostCARMA() Comparison----  ##################################################    # Gather results----  for(i in c(1,5,10,20,30)) {    load(paste0("C:/Users/aantico/Desktop/Work/Remix/RemixAutoML/TimerList_",i,"_.R"))      load(paste0("C:/Users/aantico/Desktop/Work/Remix/RemixAutoML/OutputList_",i,"_.R"))            # Assemble TS Data    TimeList <- names(TimerList)    results <- list()    for(j in 1:2660) {      results[[j]] <- cbind(        StoreDept = TimeList[j],        tryCatch({OutputList[[j]]$EvaluationMetrics[, .(ModelName,MAE)][          , ModelName := gsub("_.*","",ModelName)          ][            , ID := 1:.N, by = "ModelName"            ][              ID == 1              ][                , ID := NULL                ]},          error = function(x) return(            data.table::data.table(              ModelName = "NONE",              MAE = NA))))    }        # AutoTS() Results----    Results <- data.table::rbindlist(results)        # Remove ModelName == NONE    Results <- Results[ModelName != "NONE"]        # Average out values: one per store and dept so straight avg works----    Results <- Results[, .(MAE = mean(MAE, na.rm = TRUE)), by = c("StoreDept","ModelName")]        # Group Concatenation----    Results[, c("Store","Dept") := data.table::tstrsplit(StoreDept, " ")][, StoreDept := NULL]    data.table::setcolorder(Results, c(3,4,1,2))        ##################################    # Machine Learning Results----    ##################################        # Load up CatBoost Results----    CatBoost_Results <- data.table::fread(paste0(getwd(),"/CatBoost_Results_",i,".csv"))    CatBoost_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]    data.table::setnames(CatBoost_Results, "MAE_Metric", "MAE")    CatBoost_Results[, ModelName := "CatBoost"]    data.table::setcolorder(CatBoost_Results, c(1,2,4,3))        # Load up XGBoost Results----    XGBoost_Results <- data.table::fread(paste0(getwd(),"/XGBoost_Results",i,".csv"))    XGBoost_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]    data.table::setnames(XGBoost_Results, "MAE_Metric", "MAE")    XGBoost_Results[, ModelName := "XGBoost"]    data.table::setcolorder(XGBoost_Results, c(1,2,4,3))        # Load up H2oDRF Results----    H2oDRF_Results <- data.table::fread(paste0(getwd(),"/H2oDRF_Results",i,".csv"))    H2oDRF_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]    data.table::setnames(H2oDRF_Results, "MAE_Metric", "MAE")    H2oDRF_Results[, ModelName := "H2oDRF"]    data.table::setcolorder(H2oDRF_Results, c(1,2,4,3))        # Load up H2oGBM Results----    H2oGBM_Results <- data.table::fread(paste0(getwd(),"/H2oGBM_Results",i,".csv"))    H2oGBM_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]    data.table::setnames(H2oGBM_Results, "MAE_Metric", "MAE")    H2oGBM_Results[, ModelName := "H2oGBM"]    data.table::setcolorder(H2oGBM_Results, c(1,2,4,3))        ##################################    # Combine Data----    ##################################        # Stack Files----    ModelDataEval <- data.table::rbindlist(      list(Results, CatBoost_Results, XGBoost_Results, H2oGBM_Results, H2oDRF_Results))    data.table::setorderv(ModelDataEval, cols = c("Store","Dept","MAE"))        # Add rank----    ModelDataEval[, Rank := 1:.N, by = c("Store","Dept")]        # Get Frequencies----    RankResults <- ModelDataEval[, .(Counts = .N), by = c("ModelName","Rank")]    data.table::setorderv(RankResults, c("Rank", "Counts"), order = c(1,-1))        # Final table----    FinalResultsTable <- data.table::dcast(RankResults, formula = ModelName ~ Rank, value.var = "Counts")    data.table::setorderv(FinalResultsTable, "1", -1, na.last = TRUE)        # Rename Columns----    for(k in 2:ncol(FinalResultsTable)) {      data.table::setnames(FinalResultsTable,                            old = names(FinalResultsTable)[k],                           new = paste0("Rank_",names(FinalResultsTable)[k]))    }        # Print    print(i)    print(knitr::kable(FinalResultsTable))  }

To leave a comment for the author, please follow the link and comment on their blog: R – Remix Institute.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

No visible binding for global variable

Posted: 18 Aug 2019 07:13 PM PDT

[This article was first published on Random R Ramblings, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I have been working on a very large legacy project which utilises the excellent data.table package throughout. What this has resulted in is an R CMD check containing literally thousands of NOTEs similar to the following:

❯ checking R code for possible problems ... NOTE    my_fn: no visible binding for global variable 'mpg'

There are several reasons why you might see these NOTEs and, for our code base, some of the NOTEs were potentially more damaging than others. This was a problem as these NOTEs were hidden firstly by a suppression of them due to a manipulation of the _R_CHECK_CODETOOLS_PROFILE_ option of the .Renviron file. Once this was removed we discovered the more damaging NOTEs were hidden within the sheer amount of NOTEs we had in the R CMD check.

Non-standard Evaluation

If we have a function where we are using data.table's modification by reference features, i.e. we are using a variable in an unquoted fashion (also known as non-standard evaluation (NSE)) then this issue will occur. Take the following function as an example.

my_fn <- function() {    mtcars <- data.table::data.table(mtcars)    mtcars[, mpg_div_hp := mpg / hp]    mtcars[]  }

Here, we would find the following NOTEs:

❯ checking R code for possible problems ... NOTE    my_fn: no visible binding for global variable 'mpg_div_hp'    my_fn: no visible binding for global variable 'mpg'    my_fn: no visible binding for global variable 'hp'    Undefined global functions or variables:      hp mpg mpg_div_hp

Sometimes you may also see these NOTEs for syntactic sugar such as !! or := if you haven't correctly imported the package they come from.

This is a well discussed issue on the internet which only became an issue after a change introduced to the core R code in version 2.15.1. There are two solutions to this problem.

Option One

Include all variable names within a globalVariables() call in the package documentation file.

globalVariables(c("mpg", "hp", "mpg_div_hp"))

For our package, as there are literally thousands of variables to list in this file, it makes it very difficult to maintain this list and makes the file very long. If, however, the variables belong to data which are stored within your package then this can be greatly simplified to

globalVariables(names(my_data))

You may wish to import any syntactic sugar functionality here as well. For example

globalVariables(c(":=", "!!"))

Option Two

The second option involves binding the variable locally to the function. At the top of your function you can define the variable as a NULL value.

my_fn <- function() {    mpg <- hp <- mpg_div_hp <- NULL    mtcars <- data.table::data.table(mtcars)    mtcars[, mpg_div_hp := mpg / hp]    mtcars[]  }

Therefore your variable(s) are now bound to object(s) and so the R CMD check has nothing to complain about. This is the method that the data.table team recommend and to me, feels like a much neater and more importantly maintainable solution than the first option.

A Note on the Tidyverse

You may also come across this problem whilst programming using the tidyverse for which there is a very neat solution. You simply need to be more explicit within your function by using the .data pronoun.

#' @importFrom rlang .data  my_fn <- function() {    mtcars %>%       mutate(mpg_div_hp = .data$mpg / .data$hp)  }

Note the import!

Selecting Variables with the `data.table` `..` Prefix

NOTEs can occur when we are using the .. syntax of data.table, for example

double_dot <- function() {    mtcars <- data.table::data.table(mtcars)    select_cols <- c("cyl", "wt")    mtcars[, ..select_cols]  }

This will yield

❯ checking R code for possible problems ... NOTE    Undefined global functions or variables:      ..select_cols

In this instance, this can be solved by avoiding the .. syntax and using the alternative with = FALSE notation.

double_dot <- function() {    mtcars <- data.table::data.table(mtcars)    select_cols <- c("cyl", "wt")    mtcars[, select_cols, with = FALSE]  }

Even though the .. prefix is syntactic sugar, we cannot use globalVariables(c("..")) since the actual variable in this case is ..select_cols; we would therefore need to use globalVariables(c("..select_cols")) if we wanted to use the globalVariables() approach.

Missing Imports

In our code base, I also found NOTEs for functions or datasets which were not correctly imported. For example, consider the following simple function.

Rversion <- function() {    info <- sessionInfo()    info$R.version  }

This gives the following NOTEs:

❯ checking R code for possible problems ... NOTE    Rversion: no visible global function definition for 'sessionInfo'    Consider adding      importFrom("utils", "sessionInfo")    to your NAMESPACE file.

Here the R CMD check is rather helpful and tells us the solution; we need to ensure that we explicitly import the function from the utils package in the documentation. This can easily be done with the roxygen2 package by including an @importFrom utils sessionInfo tag.

Trying to Call Removed Functionality

If you have a function which has been removed from your package but attempt to call it from another function, R will only give you a NOTE about this.

use_non_existent_function <- function() {    this_function_doesnt_exist()  }

This will give the NOTE

❯ checking R code for possible problems ... NOTE    use_non_existent_function: no visible global function definition for      'this_function_doesnt_exist'

Of course it goes without saying that you should make sure to remove any calls to functions which have been removed from your package. As a side note, when I first started working on the project, I was initially unaware that within our package we had the option _R_CHECK_CODETOOLS_PROFILE_="suppressUndefined=TRUE" set within our .Renviron file which will suppresses all unbound global variable NOTEs from appearing in the R CMD check. However given that this can mask these deeper issues within your package, such as not recognising when a function calls functionality which has been removed from the package. This can end up meaning the end user can face nasty and confusing error messages. Therefore I would not recommend using this setting and would suggest tackling each of your packages NOTEs individually to remove them all.

I actually discovered all of our package NOTEs when introducing the lintr package to our CI pipeline. lintr will pick up on some – but not all – of these unbound global variable problems ('lintr of course does not take the _R_CHECK_CODETOOLS_PROFILE_ into account). Take our original function as an example

my_fn <- function() {    mtcars <- data.table::data.table(mtcars)    mtcars[, mpg_div_hp := mpg / hp]    mtcars[]  }

Here, lintr will highlight the variables mpg and hp as problems but it currently won't highlight the variables on the LHS of :=, i.e. mpg_div_hp.

Conclusion

When developing your package, if you are experiencing these unbound global variables NOTEs you should

Strive to define any unbound variables locally within a function.
Ensure that any functions or data from external packages (including utils, stats, etc.) have the correct @importFrom tag
Do not suppress this check in the .Renviron file and the solutions proposed here should remove the current need to do so
Any package wide unbound variables, which are typically syntactic sugar (e.g. :=), should be defined within the package description file inside a globalVariables() function, which should be a very short and maintainable list.

To leave a comment for the author, please follow the link and comment on their blog: Random R Ramblings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Mueller Report Volume 1: Network Analysis

Posted: 18 Aug 2019 05:46 PM PDT

[This article was first published on sweissblaug, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

settle down and have another cup of coffee

code

TLDR

There are a lot of Russian's talking to a lot of Trump campaign members in Mueller report. There are so many it's tough to get your head around it all. In this post I attempted some network analysis on the relations between campaign officials and Russians. I found that one can 'compress' Russian involvement into 9 (mostly) distinct groups. I then summarize these points of contacts.

Introduction to Mueller Report

Volume 1 of Mueller Report starts with Russian interference in 2016 US Presidential Elections. Russia did so in two Ways.

The first was a campaign by the IRA that used social media tools like facebook and twitter with the goal of changing public opinion. While there were some retweets by Trump and his campaign officials from these accounts there wasn't much direct communication.

The second form was to use Russian intelligence to hack Hillary Clinton's emails. These hacked emails were released with help of wikileaks and guccifer 2.0. Trump's campaign deliberately tried to find other hacked emails and encourages Russia to do so public. However, the campaign could not find additional information on these emails.

The rest of Volume 1 discusses the numerous relationship between trump campaign officials and Russians. It's this part that will be the basis for most of the results below.

The data

Volume 1 consists of 199 pages including foot-notes and appendices. I found a machine readable version here. I split the text into sentences and looked at whether a person's name was included in that sentence. This left me with a sentence by name matrix that is the starting point of my analysis. There are some drawbacks to this in that OCR does not immediately distinguish sentences. In addition it often groups footnotes with last line of sentences in a page. But it seemed like a good starting point so went ahead.

Below are the top 20 most common occurring names.

Papadopulos, Manafort, Kushner, Cohen, Trump JR, and Flynn are all in the top. Considering they were all, to varying degrees, worked in the Trump campaign this makes sense. We also see some Russian names such as Dmitriev, Kilimnik, and Kislyak. I'll explain their contacts below.

I then created a person x person matrix that counted the number of times a name co-occurs with another. I'm treating this as a weighted, undirected graph. I transformed this to a laplacian matrix and performed an eigen decomposition. This is known as a spectral analysis of a network. Basically this tries to find locations that minimizes the square error of the relations. Below is the resulting image of 2nd to last and 3rd to last eigenvectors.

WHOA … I'm getting a headache looking at this.

But it definitely looks like there is structure in the graph. There appears to be some clusters forming and these do correspond to particular events described in the report. In the lower left you can see Papadopoulos related characters, in the upper right some cohen acquaintances, and around 0,.1 there's the trump tower meeting. Not bad but still messy. I'm looking for distinct clusters.

What if we look at only the Russians in the graph?

Ok! Now we're talking. There are 6 distinct clusters of Russians here. That means there are no relations between these clusters and each correspond to a unique set of relations with trump campaign officials. I played around with this some more but the text data was too messy to have robust analysis. Co-occurring names do not pick up everything and due to sentence parsing errors somethings lead to erroneous relations.

Finally, I gave up on trying to only use text analysis, read volume one, and manually created a network found here. With that I created groupings using the above chart as a starting point. I found 9 fairly distinct clusters of Russians. Below you can see the relationships between those groups and various members of the Trump campaign.

I then further grouped them into 4 broad categories which I've named; Trump Business, The Opportunists, The Professionals, and Russian Officials and Lackeys. I also included whether a trump campaign officials interaction was of first degree (they were in meeting or talked explicitly with Russian Group in question) or second degree (they were aware of meeting). Below are my summaries for each.

Trump Business

Group 1
- agalarov, aras, goldstone, samochornov, veselnitskaya, kaveladze, akhmetshin
Group 2
- klokov, erchova
Group 3
- rtskhiladze, rozov
Group 5
- peskov

Aras Agalorov (he has a son Emin. I did not disambiguate the difference between them) is a billionaire Russian Property Developer that worked with trump to create Miss Universe Pageant in 2013. They Discussed creating a Trump tower in moscow in late 2013 and discussed with Donald Trump JR (DTJ) and Ivanka Trump but did not progress.

In Summer of 2015 Group 3 signed a letter of intent to build the trump tower in moscow and met with Ivanka and DTJ.

While this was happening Group 2 contacted cohen to discuss Trump tower in moscow and a meeting with Trump. Cohen thought this person was a pro-wrestler but that did not seem to bother and agreed to talk about business. They wanted to set up a meeting with Trump and Putin but Cohen wanted to keep clear of politics and it went nowhere.

Finally, due to the slowness of progress in Trump Tower Moscow deal from Group 2 cohen reached out to Peskov, Press Secretary for Putin, to try and get in touch with Putin directly and begin building. Cohen worked on moscow deal through summer of 2016 but it went nowhere.

During campaign the Emin Agalorov, at the behest of his father, setup a meeting with DJT to discuss hacked emails. This lead to the infamous Trump Tower meeting that involved DJT, Kushner, and Manafort and other Russians in Group 1. DJT discussed this meeting with others in the campaign as well including Gates. Kushner showed up late to the meeting and texted manafort during that this was a 'waste of time' and texted others to call him to get out and he subsequently left early. The meeting did not provide any information to Trump campaign.

The Oppurtunists

Group 4
- mifsud, polonskaya, timofeev, millian
Group 5
- klimentov, poliakova, peskov, dvorkovich

Papadoplous and Page had similar experiences with the Trump campaign and they both seemed to be in it for the opportunity it presented themselves. Both padded his resume to look more important than he was to get the job and both foreign policy advisory roles.

Papadoplous got the job of foreign policy advisor in march 2016. He met Mifsud, a Maltese Professor, in Rome at a meeting for London Centre of International Law Practice shortly after. Upon learning that Papadoplous was employed by campaign Mifsud took interest and spoke of his Russian connections. Papadopolous, thinking that having more russian connections could help his stature in the Trump Campaign, pursued this relationship. They met the following week in London where Mifsud introduced him to Polonskaya. Papadopolous relayed his new contacts with Clovis and received an approving response. This relationship continued and Mifsud said Russia had 'dirt' on Clinton during a meeting in late April. Ten days later he told a foreign official about his contacts and knowledge of dirt on Clinton. He then discussed a Trump meeting with Putin to Lewandowski, Miller, and Manafort. Manafort made clear that Trump should not meet with Putin directly.

Page also joined the campaign in march 2016 as a foreign policy advisor. He had previously lived and worked in Russia and had several Russian contacts. He was invited to talk to the New Economic School in Russia in July and asked for permission. Clovis responded that if he went he could not speak for Trump Campaign. His talk was critical of US policy towards Russia and was received welcomingly from Russian Deputy Prime Minister and others. After he met Kislyak in July in Cleveland. These activities drew the attention of the media and was removed from campaign in late september.

After Election Page went to Russia in an unofficial role after the election in late 2016. He again met with Russians in Group 5.

The Professionals

Group 6
- oknyansky, rasin
Group 7
- kilimnik, deripaska, boyarkin, oganov

Paul Manfort and Roger Stone are political consultants and previously worked together. Roger Stone worked alongside the campaign to help but was never officially apart of the campaign. Manafort joined in March 2016 and was the chairman between June and August.

Caputo set up a meeting Stone with Group 6, Oknyansky and Rasin, to get dirt on Clinton in May 2016. Rasin claimed to have information on money laundering activities by Clinton. Stone refused the offer because they asked for too much money.

Also, Stone had some contact with the twitter account Guccifer 2.0 (not shown above). This was the front used by the GRU to release stolen documents. Curiously, his name was redacted on page 45 in the Mueller report because of 'Harm to ongoing matter'. Seems a little weird to redact something when it's public information.

From March 2016 until his depart Manafort gave and ordered Gates to give campaign updates to Kilimnik. Kilimnik is thought to be a Russian spy and has connections with Deripaska, a Russian billionaire who Manafort owed money to. Manafort gave polling data on the Trump campaign and met with Kilimnik twice in person; once in May and then again in August. It's not clear why Manafort gave this data to Kilimnik although Gates thought it was to ingratiate himself to Deripaska. Deripaska and his deputy Boyarkin were subsequently sanctioned by the US Treasury.

Russian Officials and Lackeys

Group 8
- kislyak, gorkov
Group 9
- aven, dmitriev

The final groups deal with Russian Officials and and Putin's Billionaires.

Sessions and Kushner met with Kislyak, the Russian Ambassador to the US, first in April at a Trump Foreign Policy Conference. These were brief handshake affairs that lasted about a couple of minutes. Sessions does not recall seeing Kislyak.

Sessions, Gordon, and Page met with Kislyak at the Republican National Convention in July. He was one of approximately 80 foreign ambassadors to the US that were invited. Gordon and Sessions met with Kislyak for a few minutes after their speeches. Gordon, Page, and Kislyak later sat at the same table and discussed improving US Russian Relations for a few minutes.

Gordon received an email in August to meet with Kislyak but declined due to 'constant stream of false media stories' and offered to rain check the meeting.

In August Russian Embassy set up a meeting with Sessions in Kislyak and the two met in September at Session's Senate office. Meeting lasted 30 minutes and Kislyak tried to set up another meeting but Session's didn't follow up. Sessions got into trouble by not disclosing his meetings with Kislyak and was part of the reason he recused himself from what became known as the Mueller report.

Following the election in November Kislyak reached out to Kushner but Kushner did not think Kislyak had a direct line to Putin and was therefore not important enough to talk to. Nevertheless Kushner met with Kislayk in November at Trump tower and invited Flynn and spoke for about 30 minutes about repairing US Russian Relations. Kislyak suggested using a secure line to talk to Russian generals about Syrian war. Kushner said he had no secure lines to use and asked if they could use russian facilities but Kislyak rejected that idea.

Kislyak tried to get another meeting with Kushner but Kushner sent his assistant instead. Kislyak proposed meeting with Gorkov, the head of a Russian owned bank, instead. Kushner agreed and they met in December. Kushner said that meeting was about restoring US – Russian Relations. Gorkov said it was about Kushner's personal business. They did not have any follow up meetings.

In december Flynn talked with Kislyak about two separate topics. The first was to convince Russia to Veto anti-Israel resolution on settlements in the UN where it was thought the Obama administration would abstain. Russia did not vote against it. The second was to convince Russia not to retaliate against new sanctions for meddling in US elections. Mcfarland and Bannon were aware of Flynn's discussions about the sanctions. Russia did not apply retaliatory sanctions.

Finally there were two billionaires men that Putin 'deputized' to create contacts with the Trump Campaign after the election; Aven and Dmitriev. Aven said recalled that Putin did not know who to contact to get in touch with President Elect Trump. Aven did not make direct contact to campaign but Dmitriev did through two avenues. One was to try and convince Kushner's friend to setup a meeting. Kushner circulated this opportunity internally but it went nowhere. The other was meeting with Erick Prince, a supporter of Trump but not officially in campaign, in the Seychelles Islands. Prince discussed his meeting with Bannon but Bannon does not have a recollection of it.

Some notable connections

In general these Russian Groupings were distinct in the people they talked to and had little obvious contact with one another. Some notable exceptions are:

Peskov talked to Cohen and Page independently
Dmitriev and Peskov might have talked to eachother (p. 149) but there was some 'investigative technique' redactions so I'm not sure
Kilimnik was aware of Page's December visit to Russia and discussed with Manafort saying "Carter Page is in Moscow today, sending messages he is authorized to talk to Russia on behalf of DT on a range of issues of mutual interest, including Ukraine" p. 166. Leads me to ask: who would know the whereabouts and discussions of other people? Spies. Thats who.

Conclusions on Volume 1

Overall, I get the impression that the Trump campaign did not have the 'best people'. Cohen tried to make a deal but couldn't find the right people to talk to. Papadopolous and DJT tried to get dirt on Clinton but couldn't find anything. Page seemed to use the campaign as a platform to create more connections with Russians. A few 'friends' (Stone and Prince) lent a hand but probably hurt Trump's credibility by dealing with Russians more than they helped him. Manafort, a seasoned campaigner, wasn't obviously working for Trump… he worked for free after all. It seemed like a group that were willing to do shady things, for their own personal gain, but without the ability to follow through.

SAD!

All Together Graph

Conclusions on Analysis

Running text analysis before reading report was very helpful to understanding it. There are just so many connections going on it's hard to keep track. Running some basic clustering techniques as described above helped me zone into what to look for while reading the report.

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Dash with golem: The beginning

Posted: 18 Aug 2019 03:16 PM PDT

[This article was first published on Rtask, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

{golem} has been developed to help building big Shiny application to put in production. What if {golem} could be used to build another popular interactive web application, recently made available to R programmers: Dash ? Dash, a newcomer in interactive web applications A few days ago, Plotly announced Dash now available for R. After reading this announcement, I thought this

L'article Dash with golem: The beginning est apparu en premier sur Rtask.

To leave a comment for the author, please follow the link and comment on their blog: Rtask.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

noaastorms R package now supports NOAA IBTrACS v4

Posted: 17 Aug 2019 05:00 PM PDT

[This article was first published on Blog - BS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Earlier this year, I released a simple R package (available at basilesimon/noaastorms) that downloads, cleans and parses NOAA IBtrack data for you.

As the NOAA updated its datasets, noaastorms is now using these!

How to install

library(devtools)  install_github("basilesimon/noaastorms")

Available functions

getStorms: Fetch NOAA historical best track storms data

> df <- getStorms(c('EP'))    > head(df[1:5])       Serial_Num Season Num Basin Sub_basin    Name  2 1902276N14266   1902  01    EP        MM UNNAMED  3 1902276N14266   1902  01    EP        MM UNNAMED  4 1902276N14266   1902  01    EP        MM UNNAMED  5 1902276N14266   1902  01    EP        MM UNNAMED  6 1902276N14266   1902  01    EP        MM UNNAMED

The first argument is a vector of basin codes from this list:

NA: North Atlantic
SA: South Atlantic
NI: North Indian
SI: South Indian
EP: East Pacific
SP: South Pacific
WP: West Pacific

NOAA basins map

To get storms that took place in the Atlantic for example, run getStorms(c('NA', 'SA')).

The second (optional) argument is a date range to filter data with. For example:

dateRange <- c(as.Date('2010-01-01'), as.Date('2012-12-31'))  getStorms(c('NA', 'SA'), dateRange = dateRange)

Will query storms that took place in the Atlantic in 2010 and 2012.

Usage

# load a map of the world and  # use `clipPolys` to avoid issues  # when zooming in with `coord_map`  wm <- map_data("world")  library("PBSmapping")  data.table::setnames(wm, c("X","Y","PID","POS","region","subregion"))  worldmap <- clipPolys(wm,    xlim=c(20,110),ylim=c(0, 45),    keepExtra=TRUE)    # load storms for the Atlantic ocean  spStorms <- getStorms(c('NA', 'SA'))    ggplot(spStorms,    aes(x = Longitude, y = Latitude,      group = Serial_Num)) +     geom_polygon(data = worldmap,      aes(x = X, y = Y, group = PID),       fill = "whitesmoke",      colour = "gray10",      size = 0.2) +    geom_path(alpha = 0.1, size = 0.8,      color = "red") +    coord_map(xlim = c(20,110),      ylim = c(0, 45))

Screenshot of storms

Official changelog (retrieved Aug 16, 2019)

[https://www.ncdc.noaa.gov/ibtracs/index.php?name=status][https://www.ncdc.noaa.gov/ibtracs/index.php?name=status]

This is the first release of IBTrACS version 04. It is updated weekly.
Release date: March 2019

New features (improvements from v03):
* Best track data updated daily and contain provisional tracks of recent storms.
* Reduced formats – Version 4 is available in 3 formats (netCDF, CSV, shapefiles)
* Consistent formats – The data presented in each format is completely interconsistent (identical).
* More parameters – More parameters provided by the agencies are provided in IBTrACS
* Basin assignment – Any system occuring in a basin is included in that basin file (in version 3, the storm was only included in the basin in which it had its genesis)
* New derived parameters – We provide storm translation speed and direction and other variables requested by users.

To leave a comment for the author, please follow the link and comment on their blog: Blog - BS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now