[R-bloggers] How to Backtest your Crypto Trading Strategies in R (and 5 more aRticles)

[R-bloggers] How to Backtest your Crypto Trading Strategies in R (and 5 more aRticles)

Link to R-bloggers

How to Backtest your Crypto Trading Strategies in R

Posted: 04 Sep 2020 03:29 AM PDT

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Few words about Trading Strategies

One of the biggest challenges is to predict the Market. Many people have developed their own trading strategies, some of them are advanced based on Machine Learning and Artificial Intelligence algorithms like LSTM, xgBoost, Random Forest etc, some others are based on Statistical models like ARIMA, and some others are based on technical analysis.

Whatever Trading Strategy we are about to apply we need to backtest it, meaning to simulate it and finally to evaluate it. Today, we will provide an example of how you can easily backtest your own trading strategy in R.


Define the Trading Strategy

For illustrative purposes, we defined an arbitrary Trading Strategy which does not make much sense, but it is good to work on it as an example. Let's define the rules:

When the close price of the Cryptocurrency is X consecutive days in the same direction (i.e. 7 consecutive days "up" or 7 consecutive days "down") then we Open a Position as follows:

  • When the close price is X consecutive days down, then the next day we buy (long) the cryptocurrency at the "open price"
  • When the close price is X consecutive days up, then the next day we sell (short) the cryptocurrency at the "open price". We assume that short selling is allowed with CFD

Once we have Opened our positions we use the following alerts:

  • TP: Take profit when your P/L is above X%
  • SL: Stop loss when you P/L is below -Y%

Every position is closed at the open price. Of course, we can extend this assumption by considering hourly or even per minute data. Every trade is 1 unit but we can change this to a multiplier or to a fraction. Notice that the assumption of the 1-unit does not affect our outcome, since we communicate the ROI which is the (P/L) as a percentage.


R Code for to backtest the Trading Strategy

You can have a look at how we can get the Cryptocurrency prices in R and how to count the consecutive events in R. Below we build a function which takes as parameters:

  • symbol: The cryptocurrency symbol. For example, BTC is for the Bitcoin.
  • consecutive: The consecutive count of the signs of the closing prices.
  • SL: The percentage that we stop loss.
  • TP: The percentage that we take profit.
  • start_date: The date that we want to start the backtesting strategy.

Notice that the open positions that have not met the alert criteria of SL and TP and still "Active" and we return them with an "Active" status and as "Profit" we return their current "Profit".


  library(tidyverse)  library(crypto)      back_testing<-function(symbol="BTC", consecutive=7, SL=0.1, TP=0.1, start_date = "20180101") {               df<-crypto_history(coin = symbol, start_date = start_date)              df<-df%>%mutate(Sign = ifelse(close>lag(close),"up", "down"))%>%      mutate(Streak=sequence(rle(Sign)$lengths))            df<-df%>%select(symbol, date, open, high, low, close, Sign, Streak)%>%na.omit()%>%      mutate(Signal = case_when(lag(Sign)=="up" & lag(Streak)%%consecutive==0~'short',                                lag(Sign)=="down" & lag(Streak)%%consecutive==0~'long',                                TRUE~""), Dummy=TRUE      )            Trades<-df%>%filter(Signal!="")%>%select(Open_Position_Date=date, Open_Position_Price=open, Dummy, Signal)    Trades            Portfolios<-Trades%>%inner_join(df%>%select(-Signal), by="Dummy")%>%filter(date>Open_Position_Date)%>%select(-Dummy)%>%mutate(Pct_Change=open/Open_Position_Price-1)%>%      mutate(Alert = case_when(Signal=='long'& Pct_Change>TP~'TP',                               Signal=='long'& Pct_Change< -SL~'SL',                               Signal=='short'& Pct_Change>TP~'SL',                               Signal=='short'& Pct_Change< -SL~'TP'      )      )%>%group_by(Open_Position_Date)%>%mutate(Status=ifelse(sum(!is.na(Alert))>0, 'Closed', 'Active'))            Active<-Portfolios%>%filter(Status=='Active')%>%group_by(Open_Position_Date)%>%arrange(date)%>%slice(n())%>%      mutate(Profit=case_when(Signal=='short'~Open_Position_Price-open,                               Signal=='long'~open-Open_Position_Price))%>%      select(symbol, Status, Signal, date, Open_Position_Date, Open_Position_Price, open, Profit)        Closed<-Portfolios%>%filter(Status=='Closed')%>%na.omit()%>%group_by(Open_Position_Date)%>%arrange(date)%>%slice(1)%>%      mutate(Profit=case_when(Signal=='short'~Open_Position_Price-open,                               Signal=='long'~open-Open_Position_Price))%>%      select(symbol, Status, Signal, date, Open_Position_Date, Open_Position_Price, open, Profit)        final<-bind_rows(Closed,Active)%>%ungroup()%>%arrange(date)%>%mutate(ROI=Profit/Open_Position_Price, Running_ROI=cumsum(Profit)/cumsum(Open_Position_Price))           return(final)          }  

Results of the Backtest

Let's assume that we want to backtest the trading strategy that we described earlier with the following parameters:

symbol="BTC", consecutive=5, SL=0.1, TP=0.5, start_date = "20180101"
  ttt<-back_testing(symbol="BTC", consecutive=5, SL=0.1, TP=0.15, start_date = "20180101")    
symbol Status Signal Closing_Date Open_Position_Date Open_Position_Price Closing_Price Profit ROI Running_ROI
BTC Closed short 1/9/2018 1/7/2018 17527 15124 2404 13.70% 13.70%
BTC Closed short 3/8/2018 3/6/2018 11500 9951 1549 13.50% 13.60%
BTC Closed long 3/18/2018 3/11/2018 8853 7891 -962 -10.90% 7.89%
BTC Closed short 4/25/2018 4/15/2018 7999 9701 -1702 -21.30% 2.81%
BTC Closed short 8/9/2018 7/18/2018 7315 6306 1010 13.80% 4.32%
BTC Closed long 8/9/2018 8/4/2018 7439 6306 -1133 -15.20% 1.92%
BTC Closed long 11/20/2018 11/17/2018 5579 4864 -715 -12.80% 0.68%
BTC Closed short 12/28/2018 12/21/2018 4134 3653 481 11.60% 1.32%
BTC Closed short 4/3/2019 2/20/2019 3947 4880 -933 -23.60% 0.00%
BTC Closed short 5/10/2019 4/21/2019 5336 6176 -840 -15.70% -1.06%
BTC Closed short 5/12/2019 5/5/2019 5831 7204 -1372 -23.50% -2.59%
BTC Closed short 5/27/2019 5/12/2019 7204 8674 -1471 -20.40% -3.98%
BTC Closed short 6/5/2019 5/28/2019 8803 7704 1098 12.50% -2.55%
BTC Closed short 6/23/2019 6/17/2019 8989 10697 -1708 -19.00% -3.89%
BTC Closed short 6/27/2019 6/24/2019 10854 13017 -2163 -19.90% -5.32%
BTC Closed short 8/30/2019 8/4/2019 10822 9515 1307 12.10% -3.90%
BTC Closed short 9/25/2019 9/4/2019 10621 8603 2018 19.00% -2.20%
BTC Closed long 9/25/2019 9/18/2019 10248 8603 -1644 -16.00% -3.12%
BTC Closed long 1/15/2020 11/23/2019 7296 8825 1529 21.00% -2.03%
BTC Closed long 1/15/2020 12/5/2019 7253 8825 1572 21.70% -1.00%
BTC Closed short 1/31/2020 1/8/2020 8162 9508 -1346 -16.50% -1.72%
BTC Closed short 2/27/2020 2/10/2020 10116 8825 1290 12.80% -0.93%
BTC Closed long 3/13/2020 2/29/2020 8671 5018 -3653 -42.10% -2.77%
BTC Closed short 5/2/2020 4/27/2020 7679 8869 -1190 -15.50% -3.25%
BTC Closed long 7/28/2020 6/28/2020 9048 11017 1969 21.80% -2.18%

Discussion

As we can see the Trading Strategy ended with -2.18% P/L without taking into account the transaction fees. You can see also that in some periods was profitable (2018) and in some other periods was not (2019), that is why is very important to backtest a trading strategy for many different periods.

Notice that with this function we can backtest hundred of backtest strategies with some small tweaks. Instead of this simple trading strategy could be an advanced machine learning model. Once we define the trading signals, then the backtest is the same.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Backtest your Crypto Trading Strategies in R first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Recession Forecasting with a Neural Net in R

Posted: 04 Sep 2020 12:30 AM PDT

[This article was first published on R – Franklin J. Parker, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I spend quite a bit of time at work trying to understand where we are in the business cycle. That analysis informs our capital market expectations, and, by extension, our asset allocation and portfolio risk controls.

For years now I have used a trusty old linear regression model, held together with bailing wire, duct tape, and love. It's a good model. It predicted the 2020 recession and the 2008 recessions (yes, out of sample).

Even so, a couple of years ago I began experimenting with a neural net to (a) increase the amount of data I could shove at it and (b) possibly get a better model. Of course, I wouldn't retire my old model just yet, I'd just run them along side each other.

So, here is the framework for forecasting a recession using a neural net in R.

What are We Trying to Predict, Anyway?

The first question we need to answer is what we want our output to be. Typically, markets see recessions coming 3 to 6 months before they officially hit. If I'm going to add value to a portfolio, I need to be rebalancing out of risk assets before everyone else. So, for my purposes, I'd like to have 9 to 12 months of lead time to get ahead of a recession.

Ideally, then, our model would look like this:

Except, instead of getting 1s concurrently with a recession, we would get them 12-months in advance. This, then, will be our output: USREC data with a 12-month lead.

The next challenge is finding the data points which reliably predict recessionary environments. Now let's pause here for a second and acknowledge some danger…

We could data mine this thing and swap indicators in and out until we have an over-fitted model that doesn't work going forward. Personally, I'm not a fan of that. It may be a bird-dog style analysis, pointing the way to potential information, but to me that is too dangerous. I could construct any story to "justify" random inputs to a model.

I would rather begin with indicators that (1) have been at least acknowledged and studied by others, and (2) I can build a convincing theory for how it would influence/indicate the business cycle a priori. Then I'll plug them in and see if the model is effective.

I'd rather have an imperfect hammer than a brittle one that breaks on first impact.

Inputs

Tor the sake of this demonstration, I've settled on three inputs. Each of these has academic research behind it and has worked ex post. What's more, I can build strong justifications for why each of these should both indicate and influence the economy as a whole. Of course there are others, but this is just a demonstration after all!

  • Headline Unemployment
  • Effective Federal Funds Rate
  • The Yield Curve

Now, we won't be using the raw figure of each of these. We are going to convert them into "signal" form. The idea is to boil-down the essence of the signal so the neural net has to do as little work as possible. That is, in my opinion, the most effective way to increase your accuracy without increasing model fragility.

The Code – Get and Wrangle Data

We are going to need five libraries for this one:

  library(quantmod)  library(tidyverse)  library(nnet)  library(DALEX)  library(breakDown)  

Note that DALEX and breakDown are more about visualization and understanding of the model later.

Next, let's load our raw data. By the way, you can get these symbols from a search on the St. Louis Fed website. The symbol is immediately after the title. If you haven't been to the FRED database… well, then we can't be friends anymore and I can't invite you to my birthday party.

  # Get data  getSymbols( c('USREC', 'UNRATE', 'T10Y3M', 'DFF'), src = 'FRED')    # Note the differing periods of data  USREC %>% head() # Monthly data             USREC  1854-12-01     1  1855-01-01     0  1855-02-01     0  1855-03-01     0  1855-04-01     0  1855-05-01     0    UNRATE %>% head() # Monthly data             UNRATE  1948-01-01    3.4  1948-02-01    3.8  1948-03-01    4.0  1948-04-01    3.9  1948-05-01    3.5  1948-06-01    3.6    DFF %>% head() # Daily data              DFF  1954-07-01 1.13  1954-07-02 1.25  1954-07-03 1.25  1954-07-04 1.25  1954-07-05 0.88  1954-07-06 0.25    T10Y3M %>% head() # Daily data (sort of)             T10Y3M  1982-01-04   2.32  1982-01-05   2.24  1982-01-06   2.43  1982-01-07   2.46  1982-01-08   2.50  1982-01-11   2.32  

Our data is not aligned, so we have a touch of work to do to wrangle it, then signal-fy it. Though the other data starts much earlier, the yield curve data doesn't start until 1982, so we'll cut everything before then.

  # Wrangle Data, Convert to Signals  # Get all data into monthly format, starting in 1982  recessions <- USREC  unemployment <- UNRATE  fedfunds <- to.monthly(DFF)$DFF.Open  yieldcurve <- to.monthly(T10Y3M)$T10Y3M.Open  

Again, I won't cover all of the theory behind the signalification of our data, but just know that this isn't me making random stuff up. (You should never trust anyone who says that, by the way.)

The other important component of these signals is how they are shifting through time. Because the nnet package doesn't look backward at the data, it only takes in this moment, we need to add inputs that look backward. I'm doing that by lagging each signal by 3 months and 6 months.

  # Unemployment signal: 12-month slow moving average minus  # the headline unemployment rate.  signal_unemployment <- SMA(unemployment, n = 12) - unemployment   signal_unemployment_lag3 <- lag(signal_unemployment, n = 3)  signal_unemployment_lag6 <- lag(signal_unemployment, n = 6)    # FedFunds signal: effective federal funds rate minus the   # maximum fedfunds rate over the past 12 months  signal_fedfunds <- fedfunds - rollmax(fedfunds, k = 12, align = 'right')  signal_fedfunds_lag3 <- lag(signal_fedfunds, n = 3)  signal_fedfunds_lag6 <- lag(signal_fedfunds, n = 6)    # Yield curve signal: 10-year US treasury yield minus the 3-month  # US treasury yield.  signal_yieldcurve <- yieldcurve  signal_yieldcurve_lag3 <- lag(signal_yieldcurve, n = 3)  signal_yieldcurve_lag6 <- lag(signal_yieldcurve, n = 6)  

Recall that we want a 12-month lead time on our recession indicator, so we are going to adjust USREC data by 12-months for the purposes of training our model.

  # Since we want to know if a recession is starting in the next 12 months,   # need to lag USREC by NEGATIVE 12 months = lead(12)  signal_recessions <- recessions %>%     coredata() %>% # Extract core data    lead(12) %>% # Move it 12 months earlier    as.xts(., order.by = index(recessions)) # Reorder as an xts, with Recessions adjusted  

If you remember from neural net training 101, a neural net has hyperparameters, i.e. how big the net is and how we manage weight decay. This means we can't simply train a neural net. We have to tune it also.

Which brings us to the really sucky part: we need to divide our data into a train set, tune set, and out-of-sample test set. This means tedious "window" work. Ugh.

  # Seperate data into train, test, and out-of-sample  # Training data will be through Dec 1998  # Testing data will be Jan 1999 through Dec 2012  # Out of sample will be Jan 2013 through July 2020  start_train <- as.Date('1982-08-01')  end_train <- as.Date('1998-12-31')  start_test <- as.Date('1999-01-01')  end_test <- as.Date('2012-12-31')  start_oos <- as.Date('2013-01-01')  end_oos <- as.Date('2020-07-31')  train_data <- data.frame( 'unemployment' = window( signal_unemployment,                                                     start = start_train,                                                     end = end_train ),                                                        'unemployment_lag3' = window( signal_unemployment_lag3,                                                          start = start_train,                                                          end = end_train ),                                                        'unemployment_lag6' = window( signal_unemployment_lag6,                                                          start = start_train,                                                          end = end_train ),                                                        'fedfunds' = window( signal_fedfunds,                                                 start = start_train,                                                 end = end_train),                                                        'fedfunds_lag3' = window( signal_fedfunds_lag3,                                                      start = start_train,                                                      end = end_train),                                                        'fedfunds_lag6' = window( signal_fedfunds_lag6,                                                      start = start_train,                                                      end = end_train),                                                        'yieldcurve' = window( signal_yieldcurve,                                                   start = start_train,                                                   end = end_train),                                                        'yieldcurve_lag3' = window( signal_yieldcurve_lag3,                                                        start = start_train,                                                        end = end_train),                                                        'yieldcurve_lag6' = window( signal_yieldcurve_lag6,                                                        start = start_train,                                                        end = end_train),                                                        'recessions' = window( signal_recessions,                                                   start = start_train,                                                   end = end_train) )  test_data <- data.frame( 'unemployment' = window( signal_unemployment,                                                    start = start_test,                                                    end = end_test ),                                                      'unemployment_lag3' = window( signal_unemployment_lag3,                                                         start = start_test,                                                         end = end_test ),                                                      'unemployment_lag6' = window( signal_unemployment_lag6,                                                         start = start_test,                                                         end = end_test ),                                                      'fedfunds' = window( signal_fedfunds,                                                start = start_test,                                                end = end_test),                                                      'fedfunds_lag3' = window( signal_fedfunds_lag3,                                                     start = start_test,                                                     end = end_test),                                                      'fedfunds_lag6' = window( signal_fedfunds_lag6,                                                     start = start_test,                                                     end = end_test),                                                      'yieldcurve' = window( signal_yieldcurve,                                                  start = start_test,                                                  end = end_test),                                                      'yieldcurve_lag3' = window( signal_yieldcurve_lag3,                                                       start = start_test,                                                       end = end_test),                                                      'yieldcurve_lag6' = window( signal_yieldcurve_lag6,                                                       start = start_test,                                                       end = end_test),                                                      'recessions' = window( signal_recessions,                                                  start = start_test,                                                  end = end_test) )  oos_data <- data.frame( 'unemployment' = window( signal_unemployment,                                                   start = start_oos,                                                   end = end_oos ),                                                    'unemployment_lag3' = window( signal_unemployment_lag3,                                                        start = start_oos,                                                        end = end_oos ),                                                    'unemployment_lag6' = window( signal_unemployment_lag6,                                                        start = start_oos,                                                        end = end_oos ),                                                    'fedfunds' = window( signal_fedfunds,                                               start = start_oos,                                               end = end_oos),                                                    'fedfunds_lag3' = window( signal_fedfunds_lag3,                                                    start = start_oos,                                                    end = end_oos),                                                    'fedfunds_lag6' = window( signal_fedfunds_lag6,                                                    start = start_oos,                                                    end = end_oos),                                                    'yieldcurve' = window( signal_yieldcurve,                                                 start = start_oos,                                                 end = end_oos),                                                    'yieldcurve_lag3' = window( signal_yieldcurve_lag3,                                                      start = start_oos,                                                      end = end_oos),                                                    'yieldcurve_lag6' = window( signal_yieldcurve_lag6,                                                      start = start_oos,                                                      end = end_oos),                                                    'recessions' = window( signal_recessions,                                                 start = start_oos,                                                 end = end_oos) )  colnames(train_data) <- colnames(test_data) <- colnames(oos_data) <- c( 'unemployment',                                                                          'unemployment_lag3',                                                                          'unemployment_lag6',                                                                          'fedfunds',                                                                          'fedfunds_lag3',                                                                          'fedfunds_lag6',                                                                          'yieldcurve',                                                                          'yieldcurve_lag3',                                                                          'yieldcurve_lag6',                                                                          'recessions' )  

Comment below if you have a more efficient way of doing this!

The Code – Train & Tune Our Neural Net

From here we need to make some educated, but mostly guess-like, decisions.

We must specify a minimum and maximum neural net size. I remember reading once that a neural network hidden layer should be no less than about 35% the number of inputs, and no larger than 1.5 times the total number of inputs. I'm open to critique here, but that's what I'll specify for the minimum and maximum.

  # Hyperparameters for tuning  # Minimum net size  size_min <- (ncol(train_data) * 0.35) %>% round(0)  # Maximum net size  size_max <- (ncol(train_data) * 1.5) %>% round(0)  size <- seq(size_min, size_max, 1)  

The next hyperparameter is weight decay. As far as I know, there is really no rule of thumb for this, other than it should be 0 < w < 1. For illustration, I'm going to sequence it between 0.1 and 1.5.

  # Weight decay settings, start at seq(2, 50, 2) then up it.  # This figure will get divided by 100 in the coming code.  w_decay <- seq(10, 150, 2)  

Yay! Now the fun part begins!

We want the neural network that delivers the best possible Brier score, and we want to output the optimal model and hyperparameters for inspection and use later. Normally I wouldn't log the Brier scores as they go, but I want to see how net size and various weight decays effect our model, just for fun.

  best_brier <- 1 # Prime variable  net_size_optimal <- 5 # Prime variable  w_decay_optimal <- 0.2 # Prime variable  # Log evolution of RMSE, rows are net size, columns are weight decay  brier_evolution <- 0  i <-  1  j <-  1  for(n in size){    for(w in w_decay){            # Train model      model_fit <- nnet( as.matrix(train_data$recessions) ~                           unemployment + unemployment_lag3 + unemployment_lag6 +                           fedfunds + fedfunds_lag3 + fedfunds_lag6 +                           yieldcurve + yieldcurve_lag3 + yieldcurve_lag6,                         data = as.matrix(train_data),                         maxit = 500,                         size = n,                         decay = w/100,                         linout = 1 )            # Test model (for hyperparameter tuning), return Brier      model_predict <- predict( model_fit, newdata = as.matrix(test_data) )      brier <- mean( (model_predict - test_data$recessions)^2 )            # If this is the best model, store the hyperparams and the model      if( brier < best_brier ){        net_size_optimal <- n        w_decay_optimal <- w/100        best_brier <- brier        model <- model_fit      }      brier_evolution[i] <- brier      i <- i + 1    }  }  

And BOOM! We've got ourselves an artificial neural network!

The Code – Results

First, let's see how our hyperparameters affected our Brier score.

  # Drop Brier score into a matrix  brier_evolution <- matrix( brier_evolution, nrow = length(w_decay), ncol = length(size), byrow = FALSE )  rownames(brier_evolution) <- w_decay/100  colnames(brier_evolution) <- size  # View Brier score results  persp( x = w_decay/100,         y = size,         z = brier_evolution,         ticktype = 'detailed',         xlab = 'Weight Decay',         ylab = 'Net Size',         zlab = 'Brier Score',         shade = 0.5,         main = 'Brier Score, Effect of Net Size and Weight Decay')  

Yielding

It is so interesting to me how weight decay is, by far, the most influential hyperparameter. Size matters, but not nearly as much. Also, we should note there is not too much difference between the worst model we tested and the best, as the Brier score drops from 0.15 to 0.10.

Just to put Brier scores into perspective: a score of 0 is perfect prescience, and a score of 0.50 is no better than chance. Tetlock claims that we need a Brier score of 0.30 or better to generate portfolio alpha (this is a firm-wide metric, see plot from his paper below), but it looks like 0.20 and below is really where you want to be.

Accelerating Learning in Active Management: The Alpha-Brier Process | The  Journal of Portfolio Management

At 0.10, we are on track! Of course, this is in-sample. How did we do out-of-sample?

  # Use model Out of Sample, plot results  model_predict_oos <- predict( model, newdata = as.matrix(oos_data) )  data.frame( 'Date' = index( as.xts(oos_data)),               'Probability' = coredata(model_predict_oos) ) %>%     ggplot(., aes(x = Date, y = Probability) )+    geom_line(size = 2, col = 'dodgerblue4')+    labs(title = 'Probability of Recession in the Next 12 Months')+    xlab('')+    ylab('')+    theme_light()+    theme( axis.text = element_text(size = 12),           plot.title = element_text(size = 18))  

Which looks pretty good! Here we see that the traditional metrics produced a pretty reliable indication that 2020 would see a recession. My trusty-rusty old linear regression model predicted the same, by the way!

I am very interested to know our out-of-sample Brier score.

  > brier_oos <- mean( (model_predict_oos - oos_data$recessions)^2, na.rm = T )  > brier_oos  [1] 0.04061196  

0.04 is about as close to prescience as you can get. Honestly, it scares me a little because it is a bit too good. It is also suspicious to me that the model produced negative probabilities of a recession for 2014 through 2016. Before putting this into production I'd really like to know why that is.

What's Driving These Results

Using the DALEX and breakDown packages, we can examine what is driving the results.

  # Try to understand what drives the model using DALEX package  model_explanation <- explain( model, data = oos_data, predict_function = predict, y = oos_data$recessions )  bd <- break_down( model_explanation,                     oos_data[ length(oos_data), 1:(ncol(oos_data)-1) ],                     keep_distributions = T )  plot(bd)  

Giving us

This illustrates the largest factors effecting the outcome in the most recent data point. It looks like the model has a standard bias of 0.081 (the intercept), and the most recent 3-month lagged yield curve data is what moved the needle most. All-in, the yield curve is the strongest signal in this data point (which doesn't surprise me at all, it is the most well-known and widely studied because it has accurately predicted every recession since nineteensixtysomething).

We can also see how changes in any of the data points affect the output of the model

  profile <- model_profile( model_explanation,                            variables = c( 'unemployment',                                           'unemployment_lag3',                                           'unemployment_lag6',                                           'fedfunds',                                           'fedfunds_lag3',                                           'fedfunds_lag6',                                           'yieldcurve',                                           'yieldcurve_lag3',                                           'yieldcurve_lag6' ) )  plot(profile, geom = 'aggregates')  

Again, the yield curve appears to be the most influential variable as changes there (x-axis) dramatically affect the output (y-axis). That said, unemployment is also a significant factor, except when lagged by 6 months. Changes there seem to have no effect.

In the end, I'd say this is a productive model. It appears robust across the various inputs, and it has an excellent out-of-sample Brier score for predicting recessions (though 2020 is our only data point for that). I am concerned that the Brier score is too good and we have created a fragile model that may not work with future inputs, and I am concerned about the negative probabilities produced in 2013 through 2016 (this may be a function of the fragility inherent in the model–those years carried unprecedented inputs).

Ultimately, there is enough here for me to dig and refine further, but I wouldn't put this exact model into production without addressing my concerns.

As always, if you have critiques of anything here, please share them so that I can correct errors and learn!

To leave a comment for the author, please follow the link and comment on their blog: R – Franklin J. Parker.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Recession Forecasting with a Neural Net in R first appeared on R-bloggers.

#FunDataFriday – The Big Book of R

Posted: 04 Sep 2020 12:11 AM PDT

[This article was first published on #FunDataFriday - Little Miss Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Rstats GIF-downsized_large.gif

WHAT IS IT?

The big book of R is an open-source web page created by #rstats community member Oscar Baruffa. The page functions as an easy to navigate, one-stop-shop for available books on the R programming language.

WHY IS IT AWESOME?

It's awesome because it's so easy to browse for new content on our beloved R. Oscar jumpstarted the website with over 100 books and since the launch, there has been an ongoing stream of contributions from the open-source community.

Best of all, most of the books are either free or very low cost.

HOW TO GET STARTED?

It's very simple, go to the website and start browsing for new R content! If you have a favorite book that you don't see listed, please contribute your content by submitting a pull request or issue into the GitHub repo.

To leave a comment for the author, please follow the link and comment on their blog: #FunDataFriday - Little Miss Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post #FunDataFriday - The Big Book of R first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

High School Swimming State-Off Tournament Texas (2) vs. Florida (3)

Posted: 03 Sep 2020 11:00 AM PDT

[This article was first published on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to round two of the State-Off. Today we have Texas (2) taking on Florida (3) for the right to compete in the State-Off championships! Don't forget to update your version of SwimmeR to 0.4.1, because we'll be using some newly released functions. We'll also take a look at how to plot swimming times while maintaining their format.

library(SwimmeR)  library(dplyr)  library(stringr)  library(flextable)  library(ggplot2)

Since I like to use the same style flextable to display a lot of results, rather than retyping the associated flextable calls let's just define a function now, which we can reuse every time.

flextable_style <- function(x) {    x %>%      flextable() %>%      bold(part = "header") %>% # bolds header      bg(bg = "#D3D3D3", part = "header") %>% # puts gray background behind the header row      autofit()  }


Getting Results

As I mentioned in the SwimmeR 0.4.1 post last week these posts are no longer going to focus on using SwimmeR to read in results. That's been covered extensively in previous State-Off posts. Instead I've already read in the results for each state and I'm hosting them on github.

We'll just pull results for Texas and Florida, and then stick them together with bind_rows.

Texas_Link <-    "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/TX_States_2020.csv"  Texas_Results <- read.csv(url(Texas_Link)) %>%     mutate(State = "TX",           Grade = as.character(Grade))
Florida_Link <-    "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/FL_States_2020.csv"  Florida_Results <- read.csv(url(Florida_Link)) %>%     mutate(State = "FL")
Results <- Texas_Results %>%    bind_rows(Florida_Results) %>%    mutate(Gender = case_when(      str_detect(Event, "Girls") == TRUE ~ "Girls",      str_detect(Event, "Boys") == TRUE ~ "Boys"    ))


Scoring the Meet

Here's one of the new functions for SwimmeR 0.4.1. If you followed along with previous State-Off posts then you remember we developed the results_score function together during the first round. Now we get to use it in all of its released glory. results_score takes several arguments, results, which are the results to score, events, which are the events of interest, meet_type, which is either "timed_finals" or "prelims_finals". The last three, lanes, scoring_heats, and point_values are fairly self-explanatory, as the number of lanes, scoring heats, and the point values for each place respectively.

For the State_Off we've been using a "timed_finals" meet type, with 8 lanes, 2 scoring heats and point values per NFHS.

Results_Final <- results_score(    results = Results,    events = unique(Results$Event),    meet_type = "timed_finals",    lanes = 8,    scoring_heats = 2,    point_values = c(20, 17, 16, 15, 14, 13, 12, 11, 9, 7, 6, 5, 4, 3, 2, 1)  )    Scores <- Results_Final %>%    group_by(State, Gender) %>%    summarise(Score = sum(Points))    Scores %>%    arrange(Gender, desc(Score)) %>%    ungroup() %>%    flextable_style()

State

Gender

Score

TX

Boys

1395.5

FL

Boys

929.5

FL

Girls

1261.0

TX

Girls

1064.0

We have our first split result of of the State-Off, with the Texas boys and Florida girls each winning. This is a combined affair though, so let's see which state will advance…

Scores %>%    group_by(State) %>%    summarise(Score = sum(Score)) %>%    arrange(desc(Score)) %>%    ungroup() %>%    flextable_style()

State

Score

TX

2459.5

FL

2190.5

And it's Texas, living up to it's higher seed!


Swimmers of the Meet

Swimmer of the Meet criteria is the same as it's been for the entire State-Off. First we'll look for athletes who won two events, thereby scoring a the maximum possible forty points. We'll also grab the All-American cuts to use as a tiebreaker, in case multiple athletes win two events. It's possible this week we'll have our first multiple Swimmer of the Meet winner – the suspense!

Cuts_Link <-    "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/State_Cuts.csv"  Cuts <- read.csv(url(Cuts_Link))    Cuts <- Cuts %>% # clean up Cuts    filter(Stroke %!in% c("MR", "FR", "11 Dives")) %>% # %!in% is now included in SwimmeR    rename(Gender = Sex) %>%    mutate(      Event = case_when((Distance == 200 & #match events                           Stroke == 'Free') ~ "200 Yard Freestyle",                        (Distance == 200 &                           Stroke == 'IM') ~ "200 Yard IM",                        (Distance == 50 &                           Stroke == 'Free') ~ "50 Yard Freestyle",                        (Distance == 100 &                           Stroke == 'Fly') ~ "100 Yard Butterfly",                        (Distance == 100 &                           Stroke == 'Free') ~ "100 Yard Freestyle",                        (Distance == 500 &                           Stroke == 'Free') ~ "500 Yard Freestyle",                        (Distance == 100 &                           Stroke == 'Back') ~ "100 Yard Backstroke",                        (Distance == 100 &                           Stroke == 'Breast') ~ "100 Yard Breaststroke",                        TRUE ~ paste(Distance, "Yard", Stroke, sep = " ")      ),            Event = case_when(        Gender == "M" ~ paste("Boys", Event, sep = " "),        Gender == "F" ~ paste("Girls", Event, sep = " ")      )    )    Ind_Swimming_Results <- Results_Final %>%    filter(str_detect(Event, "Diving|Relay") == FALSE) %>% # join Ind_Swimming_Results and Cuts    left_join(Cuts %>% filter((Gender == "M" &                                 Year == 2020) |                                (Gender == "F" &                                   Year == 2019)) %>%                select(AAC_Cut, AA_Cut, Event),              by = 'Event')    Swimmer_Of_Meet <- Ind_Swimming_Results %>%    mutate(      AA_Diff = (Finals_Time_sec - sec_format(AA_Cut)) / sec_format(AA_Cut),      Name = str_to_title(Name)    ) %>%    group_by(Name) %>%    filter(n() == 2) %>% # get swimmers that competed in two events    summarise(      Avg_Place = sum(Place) / 2,      AA_Diff_Avg = round(mean(AA_Diff, na.rm = TRUE), 3),      Gender = unique(Gender),      State = unique(State)    ) %>%    arrange(Avg_Place, AA_Diff_Avg) %>%    group_split(Gender) # split out a dataframe for boys (1) and girls (2)


Boys

Swimmer_Of_Meet[[1]] %>%    slice_head(n = 5) %>%    select(-Gender) %>%    ungroup() %>%    flextable_style()

Name

Avg_Place

AA_Diff_Avg

State

Zuchowski, Joshua

1.5

-0.025

FL

Vincent Ribeiro

1.5

-0.023

TX

Matthew Tannenb

1.5

-0.019

TX

David Oderinde

2.0

-0.015

TX

Dalton Lowe

2.5

-0.016

TX


Joshua Zuchowski is the boys swimmer of the meet. He's a new face for this particular award – in Florida's first round meet he finished third behind two guys from Illinois. Also no boy won two events.

Results_Final %>%    filter(Name == "Zuchowski, Joshua") %>%    select(Place, Name, School, Finals_Time, Event) %>%    arrange(desc(Event)) %>%    ungroup() %>%    flextable_style()

Place

Name

School

Finals_Time

Event

2

Zuchowski, Joshua

King's Academy

1:47.44

Boys 200 Yard IM

1

Zuchowski, Joshua

King's Academy

47.85

Boys 100 Yard Backstroke


Girls

Swimmer_Of_Meet[[2]] %>%    slice_head(n = 5) %>%    select(-Gender) %>%    ungroup() %>%    flextable() %>%    bold(part = "header") %>%    bg(bg = "#D3D3D3", part = "header") %>%    autofit()

Name

Avg_Place

AA_Diff_Avg

State

Lillie Nordmann

1.0

-0.046

TX

Weyant, Emma

1.0

-0.033

FL

Cronk, Micayla

1.5

-0.040

FL

Emma Sticklen

1.5

-0.032

TX

Kit Kat Zenick

1.5

-0.029

TX


Lillie Nordmann continues her winning ways, being named girls swimmer of the meet for the second meet in a row. While she hasn't written in to answer my questions about swimming outdoors in cold weather she did tell SwimSwam that she's postponing her collegiate career because of the ongoing pandemic. I don't blame her. As these results attest, she's a champ, and frankly this way she'll be able to focus more on trying to make the Olympic team. Also, it's warmer in Texas than it is in Palo Alto (or nearby San Jose). Just saying. Actually I'm not just saying – I have data on the temperatures in San Jose and Houston, from the National Weather Service. Let me show you what I'm talking about, in both Fahrenheit and Celsius (the State-Off has a surprisingly international audience).

San_Jose_Link <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/San_Jose_Temps.csv"  San_Jose <- read.csv(url(San_Jose_Link)) %>%     mutate(City = "San Jose")  Houston_Link <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/Houston_Temps.csv"  Houston <- read.csv(url(Houston_Link)) %>%     mutate(City = "Houston")    Temps <- bind_rows(San_Jose, Houston)    names(Temps) <- c("Month", "Mean_Max", "Mean_Min", "Mean_Avg", "City")    Temps$Month <- factor(Temps$Month, ordered = TRUE,                                   levels = unique(Temps$Month))    Temps %>%    ggplot() +    geom_line(aes(      x = Month,      y = Mean_Avg,      group = City,      color = City    )) +    scale_y_continuous(sec.axis = sec_axis( ~ (. - 32) * (5/9), name = "Monthly Mean Average Temperature (C)")) +    theme_bw() +    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +    labs(x = "Month", y = "Monthly Mean Average Temperature (F)")

Those are the monthly average temperatures. Plotting Mean_Min is even colder, and closer to what it feels like in those early mornings. If I was going to swim outdoors I'd much rather do it in Texas versus shivering in NorCal. I am a delicate flower.

Results_Final %>%    filter(Name == "Lillie Nordmann") %>%    select(Place, Name, School, Finals_Time, Event) %>%    arrange(desc(Event)) %>%    ungroup() %>%    flextable_style()

Place

Name

School

Finals_Time

Event

1

Lillie Nordmann

COTW

1:43.62

Girls 200 Yard Freestyle

1

Lillie Nordmann

COTW

52.00

Girls 100 Yard Butterfly


The DQ Column

A new feature in SwimmeR 0.4.1 is that results now include a column called DQ. When DQ == 1 a swimmer or relay has been disqualified. Recording DQs is of course important to the overall mission of SwimmeR (making swimming results available in usable formats etc. etc.), but it's also of personal interest to me as an official.

Results %>%    group_by(State) %>%    summarise(DQs = sum(DQ, na.rm = TRUE) / n()) %>%    flextable_style()

State

DQs

FL

0.01050119

TX

0.02785030

Results %>%     mutate(Event = str_remove(Event, "Girls |Boys ")) %>%     group_by(Event) %>%     summarise(DQs = sum(DQ, na.rm = TRUE)) %>%    filter(DQs > 0) %>%     arrange(desc(DQs)) %>%     head(5) %>%     flextable_style()

Event

DQs

200 Yard Medley Relay

14

200 Yard Freestyle Relay

10

100 Yard Breaststroke

9

400 Yard Freestyle Relay

8

200 Yard IM

3

Not surprisingly most DQs are seen in relays. False starts can and do happen as teams push for every advantage. The line between a fast start and a false start is exceedingly fine. No one is trying to false start, it just happens sometimes. Breaststroke on the other hand is a disciple where intentional cheating can and sometimes does prosper. Recall Cameron van der Burgh admitting to taking extra dolphin kicks after winning gold in Rio, or Kosuke Kitajima having "no response" to Aaron Peirsol's accusation (backed up by video) that he took an illegal dolphin kick en route to gold in Beijing. Breaststrokers also seem to struggle with the modern "brush" turn, and ring up a lot of one-hand touches. The Texas results (5A, 6A) actually say what infraction was committed, and it's the usual mix of kicking violations and one hand touches. Florida doesn't share reasons for DQs, but just from my own experience – kicking and one hand touches. Must be in the water. Or not, because butterflyers don't seem to suffer from the same issues. Could be the the butterfly recovery is more difficult to stop short, or to do in an unbalanced fashion such that only one hand makes contact with the wall? I don't know, it's an genuine mystery, but it is nice that my personal observations from officiating are borne out in the data.


Plotting Times With SwimmeR

We might be interested in seeing a distribution of times for a particular event. There's a problem though. Swimming times are traditionally reported as minutes:seconds.hundredths. In R that means they're stored as character strings, which can be confirmed by calling str(Results$Finals_Time). We'd like to be able to look at times from fastest to slowest though – that is we'd like them to have an order. One option would be to convert the character strings in Results$Finals_Time to an ordered factor, but that would be awful. We'd likely need to manually define the ordering – no fun at all. SwimmeR instead offers a pair of functions to make this easier. The function sec_format will take a time written as minutes:seconds.hundredths, like the ones in Results$Finals_Time and covert it to a numeric value – the total duration in seconds. Ordering numerics is easy, R handles it automatically. While that's very nice, we as swimming aficionados would still like to look at times in the standard swimming format. This is where mmss_format comes in – it simply coverts times back to the standard minutes:seconds.hundredths for our viewing pleasure. Below is an example of converting times to numeric with sec_format for plotting, and then converting the axis labels to swimming format with mmss_format for viewing.

Results %>%    filter(Gender == 'Girls',           str_detect(Event, "Relay") == FALSE) %>%     left_join(Cuts %>% filter(Year == 2019, Gender == "F"), by = c("Event")) %>% # add in All-American cuts    filter(Event == "Girls 200 Yard Freestyle") %>% # pick event    mutate(Finals_Time = sec_format(Finals_Time)) %>% # change time format to seconds    ggplot() +    geom_histogram(aes(x = Finals_Time, fill = State), binwidth = 0.5) +  geom_vline(aes(xintercept = sec_format(AA_Cut))) + # line for All-American cut    geom_vline(aes(xintercept = sec_format(AAC_Cut))) + # line for All-American consideration cut    scale_y_continuous(breaks = seq(0, 10, 2)) +    scale_x_continuous(      breaks = seq(100, 150, 5),      labels = scales::trans_format("identity", mmss_format) # labels in swimming format    ) +      geom_label(      label="All-American",      x = 107,      y = 7,      color = "black"    ) +        geom_label(      label="All-American Cons.",      x = 110.5,      y = 10,      color = "black"    ) +    theme_bw() +    labs(y = "No. of Swimmers",         x = "Finals Time")

Here we see classic "bell" curves for each state, with a few very fast or very slow (comparatively speaking) swimmers, and a bunch of athletes clumped right in the middle.


In Closing

Texas wins the overall, despite a winning effort from the Florida girls. Next week at Swimming + Data Science we'll have number one seed California (1) vs. number five Pennsylvania (5). We'll also dig further into new SwimmeR 0.4.1 functionality. Hope to see you then!

To leave a comment for the author, please follow the link and comment on their blog: Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post High School Swimming State-Off Tournament Texas (2) vs. Florida (3) first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Correcting for confounded variables with GLMs

Posted: 03 Sep 2020 11:00 AM PDT

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Correcting for confounded variables with GLMs

General (and generalized) linear models can be useful for analyzing
field data, where sampling is often distributed unevenly across different
environmental gradients or treatment groups. They help us correct for
confounded gradients and discover effects that are hidden in plots of
raw data.

For instance, we used GLMs in a meta-analysis of the rates species are shifting their ranges under climate change. We used the GLMs to correct for differences in ways different studies had measured species ranges, so we could then study the unique effects of ecological variables.

You can also think of these GLMs with multiple covariates as 'statistically' (rather than experimentally) controlling for the effects of each variable when looking at the effects of the other variables.

In this post I'll demonstrate this application for statistical controls with a simple example.

Simulate data

We'll start by simulating some data. Let's say the data represent fish
abundance at sites. Fish vary in the area of fish habitat available at
that site. There are also two bioregions we've sampled. We'll assume the
fish only occur on one bioregion, even though habitat is available in
both bioregions.

We'll assume data are poisson distributed, to account for the fact that
fish numbers can't be <0, and that they should be integers.

To simulate the data we will specify an intercept and an habitat effect
b_habitat. We'll assume two bioregions, so we only need to specify one
bioregion effect (which is the mean difference from bioregion B and A)

nsites <- 88    #Effects  intercept <- 0  b_habitat <- 1  b_region <- -20  

Then we need to make up the covariate data.

set.seed(50)  region <- c(rep(0, nsites/2), rep(1, nsites/2))    habitat_area <- rnorm(nsites, 1.5, 0.5)  habitat_area <- habitat_area*(region + 1)  

We made it so habitat area was on average twice as big in the second
region.

Now assemble a 'design matrix' (X) and matrix of effects (ie our ).
These help us then make predictions of mean abundance with a simple
matrix multiplication:

coefs_true <- matrix(c(intercept, b_habitat, b_region))  X <- matrix(cbind(1, habitat_area, region), ncol = 3)  

Now make the 'true' means for our simulation and the fish abundance
data:

y_mean <- exp(X %*% coefs_true)  y <- rpois(nsites, y_mean)  

We took an exponent to ensure positive values.

If this confuses you so far, you can read more about linear models and then generalized linear models on my other blogs.

Plot simulations

plot(habitat_area, y)  

So not strong evidence of any effect of habitat area on abundance. What
about if we colour points by bioregion?

plot(habitat_area, y, col = region + 1)  

So it appears the red region (the region with 1 in the region
variable) has not habitat relationship, whereas the black region (region
with 0 in region variable) has a positive relationship. This occurs
because we set the region intercept to -20 (and a poisson with mean
exp(-20) will predict basically always zero).

Naive GLM

Now let's fit a GLM assuming we don't know about bioregion

m1 <- glm(y ~ habitat_area, family = "poisson")  summary(m1)    ##  ## Call:  ## glm(formula = y ~ habitat_area, family = "poisson")  ##  ## Deviance Residuals:  ##    Min      1Q  Median      3Q     Max    ## -3.071  -1.783  -1.281   0.229   7.194    ##  ## Coefficients:  ##              Estimate Std. Error z value Pr(>|z|)      ## (Intercept)   1.91396    0.15321  12.493  < 2e-16 ***  ## habitat_area -0.53476    0.08272  -6.464 1.02e-10 ***  ## ---  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  ##  ## (Dispersion parameter for poisson family taken to be 1)  ##  ##     Null deviance: 401.78  on 87  degrees of freedom  ## Residual deviance: 354.11  on 86  degrees of freedom  ## AIC: 501.23  ##  ## Number of Fisher Scoring iterations: 7  

Habitat is significant (p < 0.001), but the estimate is in the wrong
direction! It is saying there are fewer fish with more habitat. This is
because the second region has on average twice as much habitat, but our
fish doesn't ever occur there.

To see this, we can do a plot with the visreg package:

library(visreg)  visreg(m1, scale = "response")  

So predicting an effect of habitat area that decreases with habitat
area.

Region adjusted GLM

Ok, so now try including region:

m2 <- glm(y ~ habitat_area + region, family = "poisson")  summary(m2)    ##  ## Call:  ## glm(formula = y ~ habitat_area + region, family = "poisson")  ##  ## Deviance Residuals:  ##      Min        1Q    Median        3Q       Max    ## -2.43472  -0.26179  -0.00006  -0.00002   2.46518    ##  ## Coefficients:  ##               Estimate Std. Error z value Pr(>|z|)      ## (Intercept)     0.2340     0.2294   1.020    0.308      ## habitat_area    0.8851     0.1319   6.713 1.91e-11 ***  ## region        -23.0326  2100.7940  -0.011    0.991      ## ---  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  ##  ## (Dispersion parameter for poisson family taken to be 1)  ##  ##     Null deviance: 401.777  on 87  degrees of freedom  ## Residual deviance:  54.311  on 85  degrees of freedom  ## AIC: 203.43  ##  ## Number of Fisher Scoring iterations: 18  

Habitat area is significant and positive now. Note also the effect size
(0.885) is within the error bounds (SE = 0.132) for the true effect size
(=1). Let's also confirm our better model (most parsimonious) with the
AIC (you can read more about what the AIC is on this blog).

AIC(m1, m2)    ##    df      AIC  ## m1  2 501.2268  ## m2  3 203.4282  

So there's basically no evidence for the model without bioregion
compared to the model with that effect.

But notice that the region effect is approximately correct (-23), but
the standard error is huge and, consequently, the region effect is
non-signficant (p = 0.99). This happens because the abundance is always
zero in region 2, so the poisson model can't get a good estimate of the
mean abundance. Put another way, the poisson must have mean >0, but
it can't distinguish between a mean of 0.000000001 versus say 0.0001
(they both predict basically all zeros), so the SE are very broad.

We see this if we do a plot, and habitat area appears insignificant:

visreg(m2, xvar = "habitat_area")  

Now replot the predictions, asking for habitat area just in the first
bioregion (labelled 0 in the data):

visreg(m2, xvar = "habitat_area",         cond = list(region = 0), scale = "response")  

Now we see the trend.

So our GLM has corrected for the absence of fish in bioregion 2 and
correctly identified the positive relationship between abundance and
fish biomass. The GLM with the region effect also gets the magnitude of
the trend approximately correct (within the margin of error from the
true value).

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Correcting for confounded variables with GLMs first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Deploying flexdashboard on Github Pages

Posted: 03 Sep 2020 11:00 AM PDT

[This article was first published on Rami Krispin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of Github's coolest features is Github Pages, which enables you to create and deploy websites under github.com domain for free. The most common usage of Github Pages is the deployment of project documentation. The R community is widely using it to deploy different R Markdown formats such as package documentation with pkgdown, blogs with blogdown, books with bookdown, etc. Recently, as I was looking for a way to share Covid19 data, I had this eureka moment realizing that it should be straightforward to deploy flexdashboard or any other format of R Markdown (none Shiny) on Github Pages. After a short googling, I come across this example by Phil Batey. Since then, I started to utlized this feature in some of my packages (visualizing Covid19 datasets). Here are some use cases of such dashboards:

This post provides a short tutorial for deploying flexdashboard on Github Pages.

Flexdashboard on Github Pages

The flexdashboard package provides a customized format for building interactive dashboards. It is a simplistic, useful, and fast method for developing a static dashboard that does not require a big data or back-end server (although you can use flexdashboard with Shiny to create a dynamic dashboard with back-end server support). To deploy flexdashboard on Github Pages, you will need:

  • Github repository
  • flexdashboard and rmarkdown packages
  • A flexdashboard file (i.e., Rmd)
  • _site.yml file

Simple example

The following example demonstrates the deployment process of flexdashboard on Github Pages. We will start by creating a new repository for the deployment (can also deploy on an existing repository as well) using the Github web interface:

Note that the repository must be public to expose it on a Github page. We will set the name, keep all the checkboxes unchecked, and create the repository.

The next step is to create a local folder and sync it with the origin on Github. There are few methods for doing it, such as local file manager (e.g., Finder on Mac, File Explorer on Windows, etc.), terminal (or command line), RStudio, etc. For simplicity, I will use here RStudio by creating new project by following those steps:

  • Clicking the Project: drop-down button on the top right (see the blue rectangle on the first screenshot)
  • Select the New Project… option
  • Select the first option, New Directory, on the New Project window (screenshot below)

  • Set the folder path and name, leave the checkboxes unchecked, and click the create project button

Before we sync the new folder with the origin on Github, let's create the dashboard. We will start by creating the _site.yml file. The _site.yml is a setting file that enables us to configure the R Markdown file's rendering options. Creating this file can be done by clicking on the top left menu File -> New File -> Text File, which will pop-up a new file on the source pane, and we will save it as _site.yml. As typically, Github pages is rendering the website from the docs folder, we will set the R Markdown HTML output to the docs folder. Paste the following code and save it:

name: "flexdashboard-example"  output_dir: docs

More details about the _site.yml file available here.

Next, we will use the built-in flexdashboard template by selecting on Rstudio on the top left options File -> New File -> R Markdown. The New R Markdown window should pop-up, select the From Template option on the left, and then choose the Flex Dashboard template. Note that you probably won't have this template available on this template menu if the flexdashboard package is not installed.

Once creating the template, a new file will pop-up on the source pane named Untitled1 (or Untitiledn if more n-1 un-named files are already opened). To run the dashboard as a website, we will name the file as index.Rmd (saving it on RStudio as index will automatically name it as index.Rmd), which will render it into index.html file. By default, the dashboard template comes with a layout of two empty columns (where the second column splitted into two raws). For making this example more appealing, I will drop three interactive plots from the Plotly package documentation.

Now we are ready to render the dashboard! Click on the Knit option (under the file name tab on the source pane), and this is the output you should expect (just without the plots which I added for the example):

You can noticed on the screenshot above that the dashboard file name (on the dashboard top right) is index.html. In addition, the HTML output was rendered automatically into the docs folder (see the folder on the buttom right of the screenshot) as we set on the _site.yml file before.

The last step of the deployment is to sync the local repo with the origin and set the website. To set the git you can use the repository details from Github to linked it with the local folder:

We will init a git work space on the local folder we created for the project, and commit the folder content by using the origin address from the red box above:

git init  git add *  git commit -m "init commit"  git remote add origin git@github.com:RamiKrispin/flexdashboard_example.git  git push -u origin master

Now, if all went well, you should see the content of the local folder on the Github repository. The last step would be to set the Github Page. Go to the repository settings -> options -> Github Pages and set the page source by selecting the master branch and the docs folder, and save the changes. Once saved, you should get the deployed Github Page URL on top:

And… it deployed!

https://ramikrispin.github.io/flexdashboard_example/

Note that the page URL is a combination of https://Your_Github_User_Name.github.io/Repository_Name/

Resources

Here are some useful resources for deploying flexdashboard on Github Pages:

To leave a comment for the author, please follow the link and comment on their blog: Rami Krispin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Deploying flexdashboard on Github Pages first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Comments

  1. Thank you because you have been willing to share information with us. we will always appreciate all you have done here because I know you are very concerned with our. Non custodial exchange

    ReplyDelete
  2. I’m going to read this. I’ll be sure to come back. thanks for sharing. and also This article gives the light in which we can observe the reality. this is very nice one and gives indepth information. thanks for this nice article... crypto market maker

    ReplyDelete
  3. I’m going to read this. I’ll be sure to come back. thanks for sharing. and also This article gives the light in which we can observe the reality. this is very nice one and gives indepth information. thanks for this nice article... market making bot

    ReplyDelete

Post a Comment