[R-bloggers] Three ways of visualizing a graph on a map (and 6 more aRticles)

[R-bloggers] Three ways of visualizing a graph on a map (and 6 more aRticles)

Link to R-bloggers

Three ways of visualizing a graph on a map

Posted: 31 May 2018 05:32 AM PDT

(This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers)

When visualizing a network with nodes that refer to a geographic place, it is often useful to put these nodes on a map and draw the connections (edges) between them. By this, we can directly see the geographic distribution of nodes and their connections in our network. This is different to a traditional network plot, where the placement of the nodes depends on the layout algorithm that is used (which may for example form clusters of strongly interconnected nodes).

In this blog post, I'll present three ways of visualizing network graphs on a map using R with the packages igraph, ggplot2 and optionally ggraph. Several properties of our graph should be visualized along with the positions on the map and the connections between them. Specifically, the size of a node on the map should reflect its degree, the width of an edge between two nodes should represent the weight (strength) of this connection (since we can't use proximity to illustrate the strength of a connection when we place the nodes on a map), and the color of an edge should illustrate the type of connection (some categorical variable, e.g. a type of treaty between two international partners).

Preparation

We'll need to load the following libraries first:

library(assertthat)  library(dplyr)  library(purrr)  library(igraph)  library(ggplot2)  library(ggraph)  library(ggmap)  

Now, let's load some example nodes. I've picked some random countries with their geo-coordinates:

country_coords_txt <- "   1     3.00000  28.00000       Algeria   2    54.00000  24.00000           UAE   3   139.75309  35.68536         Japan   4    45.00000  25.00000 'Saudi Arabia'   5     9.00000  34.00000       Tunisia   6     5.75000  52.50000   Netherlands   7   103.80000   1.36667     Singapore   8   124.10000  -8.36667         Korea   9    -2.69531  54.75844            UK  10    34.91155  39.05901        Turkey  11  -113.64258  60.10867        Canada  12    77.00000  20.00000         India  13    25.00000  46.00000       Romania  14   135.00000 -25.00000     Australia  15    10.00000  62.00000        Norway"    # nodes come from the above table and contain geo-coordinates for some  # randomly picked countries  nodes <- read.delim(text = country_coords_txt, header = FALSE,                      quote = "'", sep = "",                      col.names = c('id', 'lon', 'lat', 'name'))  

So we now have 15 countries, each with an ID, geo-coordinates (lon and lat) and a name. These are our graph nodes. We'll now create some random connections (edges) between our nodes:

set.seed(123)  # set random generator state for the same output    N_EDGES_PER_NODE_MIN <- 1  N_EDGES_PER_NODE_MAX <- 4  N_CATEGORIES <- 4    # edges: create random connections between countries (nodes)  edges <- map_dfr(nodes$id, function(id) {    n <- floor(runif(1, N_EDGES_PER_NODE_MIN, N_EDGES_PER_NODE_MAX+1))    to <- sample(1:max(nodes$id), n, replace = FALSE)    to <- to[to != id]    categories <- sample(1:N_CATEGORIES, length(to), replace = TRUE)    weights <- runif(length(to))    data_frame(from = id, to = to, weight = weights, category = categories)  })    edges <- edges %>% mutate(category = as.factor(category))  

Each of these edges defines a connection via the node IDs in the from and to columns and additionally we generated random connection categories and weights. Such properties are often used in graph analysis and will later be visualized too.

Our nodes and edges fully describe a graph so we can now generate a graph structure g with the igraph library. This is especially necessary for fast calculation of the degree or other properties of each node later.

g <- graph_from_data_frame(edges, directed = FALSE, vertices = nodes)  

We now create some data structures that will be needed for all the plots that we will generate. At first, we create a data frame for plotting the edges. This data frame will be the same like the edges data frame but with four additional columns that define the start and end points for each edge (x, y and xend, yend):

edges_for_plot <- edges %>%    inner_join(nodes %>% select(id, lon, lat), by = c('from' = 'id')) %>%    rename(x = lon, y = lat) %>%    inner_join(nodes %>% select(id, lon, lat), by = c('to' = 'id')) %>%    rename(xend = lon, yend = lat)    assert_that(nrow(edges_for_plot) == nrow(edges))  

Let's give each node a weight and use the degree metric for this. This will be reflected by the node sizes on the map later.

nodes$weight = degree(g)  

Now we define a common ggplot2 theme that is suitable for displaying maps (sans axes and grids):

maptheme <- theme(panel.grid = element_blank()) +    theme(axis.text = element_blank()) +    theme(axis.ticks = element_blank()) +    theme(axis.title = element_blank()) +    theme(legend.position = "bottom") +    theme(panel.grid = element_blank()) +    theme(panel.background = element_rect(fill = "#596673")) +    theme(plot.margin = unit(c(0, 0, 0.5, 0), 'cm'))  

Not only the theme will be the same for all plots, but they will also share the same world map as "background" (using map_data('world')) and the same fixed ratio coordinate system that also specifies the limits of the longitude and latitude coordinates.

country_shapes <- geom_polygon(aes(x = long, y = lat, group = group),                                 data = map_data('world'),                                 fill = "#CECECE", color = "#515151",                                 size = 0.15)  mapcoords <- coord_fixed(xlim = c(-150, 180), ylim = c(-55, 80))  

Plot 1: Pure ggplot2

Let's start simple by using ggplot2. We'll need three geometric objects (geoms) additional to the country polygons from the world map (country_shapes): Nodes can be drawn as points using geom_point and their labels with geom_text; edges between nodes can be realized as curves using geom_curve. For each geom we need to define aesthetic mappings that "describe how variables in the data are mapped to visual properties" in the plot. For the nodes we map the geo-coordinates to the x and y positions in the plot and make the node size dependent on its weight (aes(x = lon, y = lat, size = weight)). For the edges, we pass our edges_for_plot data frame and use the x, y and xend, yend as start and end points of the curves. Additionally, we make each edge's color dependent on its category, and its "size" (which refers to its line width) dependent on the edges' weights (we will see that the latter will fail). Note that the order of the geoms is important as it defines which object is drawn first and can be occluded by an object that is drawn later in the next geom layer. Hence we draw the edges first and then the node points and finally the labels on top:

ggplot(nodes) + country_shapes +    geom_curve(aes(x = x, y = y, xend = xend, yend = yend,     # draw edges as arcs                   color = category, size = weight),               data = edges_for_plot, curvature = 0.33,               alpha = 0.5) +    scale_size_continuous(guide = FALSE, range = c(0.25, 2)) + # scale for edge widths    geom_point(aes(x = lon, y = lat, size = weight),           # draw nodes               shape = 21, fill = 'white',               color = 'black', stroke = 0.5) +    scale_size_continuous(guide = FALSE, range = c(1, 6)) +    # scale for node size    geom_text(aes(x = lon, y = lat, label = name),             # draw text labels              hjust = 0, nudge_x = 1, nudge_y = 4,              size = 3, color = "white", fontface = "bold") +    mapcoords + maptheme  

A warning will be displayed in the console saying "Scale for 'size' is already present. Adding another scale for 'size', which will replace the existing scale.". This is because we used the "size" aesthetic and its scale twice, once for the node size and once for the line width of the curves. Unfortunately you cannot use two different scales for the same aesthetic even when they're used for different geoms (here: "size" for both node size and the edges' line widths). There is also no alternative to "size" I know of for controlling a line's width in ggplot2.

With ggplot2, we're left of with deciding which geom's size we want to scale. Here, I go for a static node size and a dynamic line width for the edges:

ggplot(nodes) + country_shapes +    geom_curve(aes(x = x, y = y, xend = xend, yend = yend,     # draw edges as arcs                   color = category, size = weight),               data = edges_for_plot, curvature = 0.33,               alpha = 0.5) +    scale_size_continuous(guide = FALSE, range = c(0.25, 2)) + # scale for edge widths    geom_point(aes(x = lon, y = lat),                          # draw nodes               shape = 21, size = 3, fill = 'white',               color = 'black', stroke = 0.5) +    geom_text(aes(x = lon, y = lat, label = name),             # draw text labels              hjust = 0, nudge_x = 1, nudge_y = 4,              size = 3, color = "white", fontface = "bold") +    mapcoords + maptheme  

Plot 2: ggplot2 + ggraph

Luckily, there is an extension to ggplot2 called ggraph with geoms and aesthetics added specifically for plotting network graphs. This allows us to use separate scales for the nodes and edges. By default, ggraph will place the nodes according to a layout algorithm that you can specify. However, we can also define our own custom layout using the geo-coordinates as node positions:

node_pos <- nodes %>%    select(lon, lat) %>%    rename(x = lon, y = lat)   # node positions must be called x, y  lay <- create_layout(g, 'manual',                       node.positions = node_pos)  assert_that(nrow(lay) == nrow(nodes))    # add node degree for scaling the node sizes  lay$weight <- degree(g)  

We pass the layout lay and use ggraph's geoms geom_edge_arc and geom_node_point for plotting:

ggraph(lay) + country_shapes +    geom_edge_arc(aes(color = category, edge_width = weight,   # draw edges as arcs                      circular = FALSE),                  data = edges_for_plot, curvature = 0.33,                  alpha = 0.5) +    scale_edge_width_continuous(range = c(0.5, 2),             # scale for edge widths                                guide = FALSE) +    geom_node_point(aes(size = weight), shape = 21,            # draw nodes                    fill = "white", color = "black",                    stroke = 0.5) +    scale_size_continuous(range = c(1, 6), guide = FALSE) +    # scale for node sizes    geom_node_text(aes(label = name), repel = TRUE, size = 3,                   color = "white", fontface = "bold") +    mapcoords + maptheme  

The edges' widths can be controlled with the edge_width aesthetic and its scale functions scale_edge_width_*. The nodes' sizes are controlled with size as before. Another nice feature is that geom_node_text has an option to distribute node labels with repel = TRUE so that they do not occlude each other that much.

Note that the plot's edges are differently drawn than with the ggplot2 graphics before. The connections are still the same only the placement is different due to different layout algorithms that are used by ggraph. For example, the turquoise edge line between Canada and Japan has moved from the very north to south across the center of Africa.

Plot 3: the hacky way (overlay several ggplot2 "plot grobs")

I do not want to withhold another option which may be considered a dirty hack: You can overlay several separately created plots (with transparent background) by annotating them as "grobs" (short for "graphical objects"). This is probably not how grob annotations should be used, but anyway it can come in handy when you really need to overcome the aesthetics limitation of ggplot2 described above in plot 1.

As explained, we will produce separate plots and "stack" them. The first plot will be the "background" which displays the world map as before. The second plot will be an overlay that only displays the edges. Finally, a third overlay shows only the points for the nodes and their labels. With this setup, we can control the edges' line widths and the nodes' point sizes separately because they are generated in separate plots.

The two overlays need to have a transparent background so we define it with a theme:

theme_transp_overlay <- theme(    panel.background = element_rect(fill = "transparent", color = NA),    plot.background = element_rect(fill = "transparent", color = NA)  )  

The base or "background" plot is easy to make and only shows the map:

p_base <- ggplot() + country_shapes + mapcoords + maptheme  

Now we create the first overlay with the edges whose line width is scaled according to the edges' weights:

p_edges <- ggplot(edges_for_plot) +    geom_curve(aes(x = x, y = y, xend = xend, yend = yend,     # draw edges as arcs                   color = category, size = weight),               curvature = 0.33, alpha = 0.5) +    scale_size_continuous(guide = FALSE, range = c(0.5, 2)) +  # scale for edge widths    mapcoords + maptheme + theme_transp_overlay +    theme(legend.position = c(0.5, -0.1),          legend.direction = "horizontal")  

The second overlay shows the node points and their labels:

p_nodes <- ggplot(nodes) +    geom_point(aes(x = lon, y = lat, size = weight),               shape = 21, fill = "white", color = "black",    # draw nodes               stroke = 0.5) +    scale_size_continuous(guide = FALSE, range = c(1, 6)) +    # scale for node size    geom_text(aes(x = lon, y = lat, label = name),             # draw text labels              hjust = 0, nudge_x = 1, nudge_y = 4,              size = 3, color = "white", fontface = "bold") +    mapcoords + maptheme + theme_transp_overlay  

Finally we combine the overlays using grob annotations. Note that proper positioning of the grobs can be tedious. I found that using ymin works quite well but manual tweaking of the parameter seems necessary.

p <- p_base +    annotation_custom(ggplotGrob(p_edges), ymin = -74) +    annotation_custom(ggplotGrob(p_nodes), ymin = -74)    print(p)  

As explained before, this is a hacky solution and should be used with care. Still it is useful also in other circumstances. For example when you need to use different scales for point sizes and line widths in line graphs or need to use different color scales in a single plot this way might be an option to consider.

All in all, network graphs displayed on maps can be useful to show connections between the nodes in your graph on a geographic scale. A downside is that it can look quite cluttered when you have many geographically close points and many overlapping connections. It can be useful then to show only certain details of a map or add some jitter to the edges' anchor points.

The full R script is available as gist on github.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Defining Marketing with the Rvest and Tidytext Packages

Posted: 30 May 2018 05:00 PM PDT

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

I am preparing to facilitate another session of the marketing course for the La Trobe University MBA. The first lecture delves into the definition of marketing. Like most other social phenomena, marketing is tough to define. Definitions of social constructs often rely on the perspective taken by the person or group writing the definition. As such, definitions also change over time. While a few decades ago, definitions of marketing revolved around sales and advertising, contemporary definitions are more holistic and reference creating value.

Heidi Cohen wrote a blog post where she collated 72 definitions of marketing. So rather than arguing over which definition is the best, why not use all definitions simultaneously? This article attempts to define a new definition of marketing, using a data science approach. We can use the R language to scrape the 72 definitions from Heidi's website and attempt text analysis to extract the essence of marketing from this data set.

I have mentioned in a previous post about qualitative data science that automated text analysis is not always a useful method to extract meaning from a text. I decided to delve a little deeper into automated text analysis to see if we find out anything useful about marketing using the rvest and tidytext packages.

The presentation below shows the slides I use in my introductory lecture into marketing. The code and analyses are shown below the slideshow. You can download the most recent version of the code from my GitHub Repository.

 

Scraping text with Rvest

Web scraping is a common technique to download data from websites where this data is not available as a clean data source. Web scraping starts with downloading the HTML code from the website and the filtering the wanted text from this file. The rvest package makes this process very easy.

The code for this article uses a pipe (%>%) with three rvest commands. The first step is to download the wanted html code from the web using the read_html function. The output of this function is piped to the html_nodes function, which does all the hard work. In this case, the code picks all lines of text that are embedded in an

  1. container, i.e. ordered lists. You can use the SelectorGadget to target the text you like to scrape. The last scraping step cleans the text by piping the output of the previous commands to the html_text function.

    The result of the scraping process is converted to a Tibble, which is a type of data frame used in the Tidyverse. The definition number is added to the data, and the Tibble is converted to the format required by the Tidytext package. The resulting data frame is much longer than the 72 definitions because there are other lists on the page. Unfortunately, I could not find a way to filter only the 72 definitions.

      library(tidyverse)  library(rvest)  definitions <- read_html("https://heidicohen.com/marketing-definition/") %>%      html_nodes("ol li") %>%      html_text() %>%      as_data_frame() %>%      mutate(No = 1:nrow(.)) %>%      select(No, Definition = value)  

    Tidying the Text

    The Tidytext package extends the tidy data philosophy to a text. In this approach to text analysis, a corpus consists of a data frame where each word is a separate item. The code snippet below takes the first 72 rows and the unnest_tokens function extracts each word from the 72 definitions. This function can also extract ngrams and other word groups from the text. The Tidytext package is an extremely versatile piece of software which goes far beyond the scope of this article. Julia Silge and David Robinson have written a book about text mining using this package, which provides a very clear introduction to the craft of analysing text.

    The last section of the pipe removes the trailing "s" from each word to convert plurals into single words. The mutate function in the Tidyverse creates or recreates a new variable in a data frame.

      library(tidytext)  def_words <- definitions[1:72, ] %>%      unnest_tokens(word, Definition) %>%      mutate(word = gsub("s$", "", word))  

    This section creates a data frame with two variables. The No variable indicates the definition number (1–72) and the word variable is a word within the definition. The order of the words is preserved in the row name. To check the data frame you can run unique(def_words$No[which(def_words$word == "marketing")]). This line finds all definition numbers with the word "marketing", wich results, as expected, in the number 1 to 72.

    Using Rvest and Tidytext to define marketing

    We can now proceed to analyse the definitions scraped from the website with Rvest and cleaned with Tidytext. The first step is to create a word cloud, which is a popular way to visualise word frequencies. This code creates a data frame for each unique word, excluding the word marketing itself, and uses the wordcloud package to visualise the fifty most common words.

      library(wordcloud)  library(RColorBrewer)    word_freq <- def_words %>%      anti_join(stop_words) %>%      count(word) %>%      filter(word != "marketing")    word_freq %>%      with(wordcloud(word, n, max.words = 50, rot.per = .5,                     colors = rev(brewer.pal(5, "Dark2"))))  

    While a word cloud is certainly a pretty way to visualise the bag of words in a text, it is not the most useful way to get the reader to understand the data. The words are jumbled, and the reader needs to search for meaning. A better way to visualise word frequencies is a bar chart. This code takes the data frame created in the previous snippet, determines the top ten occurring words. The mutate statement reorders to factor levels so that the words are plotted in order.

      word_freq %>%      top_n(10) %>%      mutate(word = reorder(word, n)) %>%      ggplot(aes(word, n)) +           geom_col(fill = "dodgerblue4") +          coord_flip() +          theme(text = element_text(size=20))  

    A first look at the word cloud and bar chart suggests that marketing is about customers and products and services. Marketing is a process that includes branding and communication; a simplistic but functional definition.

    Topic Modeling using Tidytext

    Word frequencies are a weak method to analyse text because it interprets each word as a solitary unit. Topic modelling is a more advanced method that analyses the relationships between words, i.e. the distance between them. The first step is to create a Document-Term Matrix, which is a matrix that indicates how often a word appears in a text.  As each of the 72 texts are very short, I decided to treat the collection of definitions as one text about marketing. The cast_dtm function converts the data frame to a Document-Term Matrix.

    The following pipe determines the top words in the topics. Just like k-means clustering, the analyst needs to choose the number of topics before analysing the text. In this case, I have opted for 4 topics. The code determines the contribution of each word to the four topics and selects the five most common words in each topic. The faceted bar chart shows each of the words in the four topics.

      marketing_dtm <- word_freq %>%      mutate(doc = 1) %>%      cast_dtm(doc, word, n)    marketing_lda <- LDA(marketing_dtm, k = 4) %>%      tidy(matrix = "beta") %>%      group_by(topic) %>%      top_n(5, beta) %>%      ungroup() %>%      arrange(topic, -beta)    marketing_lda %>%      mutate(term = reorder(term, beta)) %>%      ggplot(aes(term, beta, fill = factor(topic))) +         geom_col(show.legend = FALSE) +         facet_wrap(~topic, scales = "free") +         coord_flip() +         theme(text = element_text(size=20))  

    This example also does not tell me much more about what marketing is, other than giving a slightly more sophisticated version of the word frequency charts. This chart shows me that marketing is about customers that enjoy a service and a product. Perhaps the original definitions are not distinctive enough to be separated from each other. The persistence of the word "president" is interesting as it seems to suggest that marketing is something that occurs at the highest levels in the business.


    This article is only a weak summary of the great work by Julia Silge who coauthored the Tidytext package. The video below provides a comprehensive introduction to topic modelling.

    What have we learnt?

    This excursion into text analysis using rvest and Tidytext shows that data science can help us to make some sense out of an unread text. If I did not know what this page was about, then perhaps this analysis would enlighten me. This kind of analysis can assist us in wading through to large amounts of text to select the ones we want to read. I am still not convinced that this type of analysis will provide any knowledge beyond what can be obtained from actually reading and engaging with a text.

    Although I am a data scientist and want to maximise the use of code in analysing data, I am very much in favour of developing human intelligence before we worry about the artificial kind.

    The post Defining Marketing with the Rvest and Tidytext Packages appeared first on The Devil is in the Data.

    To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    This posting includes an audio/video/photo media file: Download Now

Harry Potter and rankings with comperank

Posted: 30 May 2018 05:00 PM PDT

(This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

Ranking Harry Potter books with comperank package.

Prologue

Package comperank is on CRAN now. It offers consistent implementations of several ranking and rating methods. Originally, it was intended to be my first CRAN package when I started to build it 13 months ago. Back then I was very curious to learn about different ranking and rating methods that are used in sport. This led me to two conclusions:

  • There is an amazing book "Who's #1" by Langville and Meyer which describes several ideas in great detail.
  • Although there are some CRAN packages dedicated specifically to ranking methods (for example, elo, mvglmmRank), I didn't find them to be tidy enough.

These discoveries motivated me to write my first ever CRAN package. Things didn't turn out the way I was planning, and now comperank is actually my fourth. After spending some time writing it I realized that most of the package will be about storing and manipulating competition results in consistent ways. That is how comperes was born.

After diverging into creating this site and writing ruler in pair with keyholder, a few months ago I returned to competition results and rankings. Gained experience helped me to improve functional API of both packages which eventually resulted into submitting them to CRAN.

Overview

This post, as one of the previous ones, has two goals:

We will cover the following topics:

  • Short notes about functionality of comperank.
  • Exploration ranking with ranking based on mean book score. No comperank package functionality is required.
  • Rankings with fixed Head-to-Head structure. This will cover Massey and Colley ranking methods.
  • Rankings with variable Head-to-Head structure. This will cover Keener, Markov and Offense-Defense ranking methods.
  • Combined rankings in which average ranks will be computed using all described comperank methods.

Another very interesting set of ranking methods implemented in comperank are methods with iterative nature. However, their usage with mentioned Harry Potter Books Survey dataset is meaningless as temporal ordering of games (acts of book scoring by one person) should make sense, which it doesn't.

The idea behind converting survey results into competition results is described in aforementioned post. We will need the following setup:

library(dplyr)  library(purrr)  library(rlang)    # This will automatically load {comperes}  library(comperank)    # Create competition results from hp_survey  hp_cr <- hp_survey %>%    transmute(      game = person, player = book,      score = as.integer(gsub("[^0-9].*$", "", score))    ) %>%    as_longcr()

Functionality of comperank

Rating is considered to be a list (in the ordinary sense) of numerical values, one for each player, or the numerical value itself. Its interpretation depends on rating method: either bigger value indicates better player performance or otherwise.

Ranking is considered to be a rank-ordered list (in the ordinary sense) of players: rank 1 indicates player with best performance.

comperank leverages the tidyverse ecosystem of R packages. Among other things, it means that the main output format is tibble.

There are three sets of functions:

  • rate_*() (* stands for ranking method short name). Its output is a tibble with columns player (player identifier) and at least one rating_* (rating value). Names of rating columns depend on rating method.
  • rank_*(). Its default output is similar to previous one, but with ranking_* instead of rating columns. It runs rate_*() and does ranking with correct direction. One can use option keep_rating = TRUE to keep rating columns in the output.
  • add_*_ratings(). These functions are present only for algorithms with iterative nature and competition results with games only between two players. They return tibble with row corresponding to a game and extra columns indicating ratings of players before and after the game.

Exploration ranking

Previously we established that "Harry Potter and the Prisoner of Azkaban" seems to be "the best" book and "Harry Potter and the Chamber of Secrets" comes last. This was evaluated by mean score:

hp_rank_explore <- hp_cr %>%    summarise_player(rating_explore = mean(score)) %>%    # round_rank() is a function from {comperank} package for doing ranking    mutate(ranking_explore = round_rank(rating_explore))  hp_rank_explore  ## # A tibble: 7 x 3  ##   player rating_explore ranking_explore  ##                           ## 1 HP_1             3.91               5  ## 2 HP_2             3.55               7  ## 3 HP_3             4.19               1  ## 4 HP_4             4                  3  ## 5 HP_5             3.90               6  ## 6 HP_6             4.13               2  ## 7 HP_7             3.96               4

As simple as it is, this approach might leave some available information unused. Survey originally was designed to obtain information not only about books performance as separate objects, but also to learn about possible pair relationships between them. Maybe some book is considered generally "not the best" but it "outperforms" some other "better" book. This was partially studied in "Harry Potter and competition results with comperes" by computing different Head-to-Head values and manually studying them.

Here we will attempt to summarise books performance based on their Head-to-Head relationships.

Rankings with fixed H2H structure

In comperank there are two methods which operate on fixed Head-to-Head structure: Massey and Colley. Both of them are designed for competitions where:

  • Games are held only between two players.
  • It is assumed that score is numeric and higher values indicate better player performance in a game.

Being very upset for moment, we realize that in dataset under study there are games with different number of players. Fortunately, comperes package comes to rescue: it has function to_pairgames() just for this situation. It takes competition results as input and returns completely another (strictly speaking) competition results where "crowded" games are split into small ones. More strictly, games with one player are removed and games with three and more players are converted to multiple games between all unordered pairs of players. The result is in wide format (as opposed to long one of hp_cr):

hp_cr_paired <- to_pairgames(hp_cr)    # For example, second game was converted to a set of 10 games  hp_cr %>% filter(game == 2)  ## # A longcr object:  ## # A tibble: 5 x 3  ##    game player score  ##        ## 1     2 HP_1       3  ## 2     2 HP_4       5  ## 3     2 HP_5       2  ## 4     2 HP_6       4  ## 5     2 HP_7       5    hp_cr_paired %>% slice(2:11)   ## # A widecr object:  ## # A tibble: 10 x 5  ##     game player1 score1 player2 score2  ##                ##  1     2 HP_1         3 HP_4         5  ##  2     3 HP_1         3 HP_5         2  ##  3     4 HP_1         3 HP_6         4  ##  4     5 HP_1         3 HP_7         5  ##  5     6 HP_4         5 HP_5         2  ##  6     7 HP_4         5 HP_6         4  ##  7     8 HP_4         5 HP_7         5  ##  8     9 HP_5         2 HP_6         4  ##  9    10 HP_5         2 HP_7         5  ## 10    11 HP_6         4 HP_7         5

Massey method

Idea of Massey method is that difference in ratings should be proportional to score difference in direct confrontations. Bigger value indicates better player competition performance.

hp_cr_massey <- hp_cr_paired %>% rank_massey(keep_rating = TRUE)  hp_cr_massey  ## # A tibble: 7 x 3  ##   player rating_massey ranking_massey  ##                         ## 1 HP_1        -0.00870              5  ## 2 HP_2        -0.514                7  ## 3 HP_3         0.293                1  ## 4 HP_4         0.114                3  ## 5 HP_5         0.00195              4  ## 6 HP_6         0.124                2  ## 7 HP_7        -0.00948              6

Colley method

Idea of Colley method is that ratings should be proportional to share of player's won games. Bigger value indicates better player performance.

hp_cr_colley <- hp_cr_paired %>% rank_colley(keep_rating = TRUE)  hp_cr_colley  ## # A tibble: 7 x 3  ##   player rating_colley ranking_colley  ##                         ## 1 HP_1           0.497              5  ## 2 HP_2           0.326              7  ## 3 HP_3           0.599              1  ## 4 HP_4           0.534              3  ## 5 HP_5           0.505              4  ## 6 HP_6           0.542              2  ## 7 HP_7           0.497              6

Both Massey and Colley give the same result differing from Exploration ranking in treating "HP_5" ("Order of the Phoenix") and "HP_7" ("Deathly Hallows") differently: "HP_5" moved up from 6-th to 4-th place.

Rankings with variable H2H structure

All algorithms with variable Head-to-Head structure depend on user supplying custom Head-to-Head expression for computing quality of direct confrontations between all pairs of players of interest.

There is much freedom in choosing Head-to-Head structure appropriate for ranking. For example, it can be "number of wins plus half the number of ties" (implemented in h2h_funs[["num_wins2"]] from comperes) or "mean score difference from direct matchups" (h2h_funs[["mean_score_diff"]]). In this post we will use the latter one. Corresponding Head-to-Head matrix looks like this:

hp_h2h <- hp_cr %>%    h2h_mat(!!! h2h_funs[["mean_score_diff"]]) %>%    round(digits = 2)    # Value indicates mean score difference between "row-player" and  # "column-player". Positive - "row-player" is better.  hp_h2h  ## # A matrix format of Head-to-Head values:  ##       HP_1 HP_2  HP_3  HP_4  HP_5  HP_6  HP_7  ## HP_1  0.00 0.50 -0.39  0.04  0.00 -0.14 -0.06  ## HP_2 -0.50 0.00 -0.77 -0.58 -0.72 -0.62 -0.45  ## HP_3  0.39 0.77  0.00  0.05  0.51  0.11  0.25  ## HP_4 -0.04 0.58 -0.05  0.00 -0.04  0.09  0.20  ## HP_5  0.00 0.72 -0.51  0.04  0.00 -0.17 -0.04  ## HP_6  0.14 0.62 -0.11 -0.09  0.17  0.00  0.15  ## HP_7  0.06 0.45 -0.25 -0.20  0.04 -0.15  0.00

Keener method

Keener method is based on the idea of "relative strength" – the strength of the player relative to the strength of the players he/she has played against. This is computed based on provided Head-to-Head values and some flexible algorithmic adjustments to make method more robust. Bigger value indicates better player performance.

hp_cr_keener <- hp_cr %>%    rank_keener(!!! h2h_funs["mean_score_diff"], keep_rating = TRUE)  hp_cr_keener  ## # A tibble: 7 x 3  ##   player rating_keener ranking_keener  ##                         ## 1 HP_1          0.147               5  ## 2 HP_2          0.0816              7  ## 3 HP_3          0.191               1  ## 4 HP_4          0.150               4  ## 5 HP_5          0.153               3  ## 6 HP_6          0.155               2  ## 7 HP_7          0.122               6

Results for Keener method again raised "HP_5" one step up to third place.

Markov method

The main idea of Markov method is that players "vote" for other players' performance. Voting is done with Head-to-Head values and the more value the more "votes" gives player2 ("column-player") to player1 ("row-player"). For example, if Head-to-Head value is "number of wins" then player2 "votes" for player1 proportionally to number of times player1 won in a matchup with player2.

Actual "voting" is done in Markov chain fashion: Head-to-Head values are organized in stochastic matrix which vector of stationary probabilities is declared to be output ratings. Bigger value indicates better player performance.

hp_cr_markov <- hp_cr %>%    rank_markov(!!! h2h_funs["mean_score_diff"], keep_rating = TRUE)  hp_cr_markov  ## # A tibble: 7 x 3  ##   player rating_markov ranking_markov  ##                         ## 1 HP_1          0.140               5  ## 2 HP_2          0.0500              7  ## 3 HP_3          0.196               1  ## 4 HP_4          0.168               2  ## 5 HP_5          0.135               6  ## 6 HP_6          0.167               3  ## 7 HP_7          0.143               4

We can see that Markov method put "HP_4" ("Goblet of Fire") on second place. This is due to its reasonably good performance against the leader "HP_3" ("Prisoner of Azkaban"): mean score difference is only 0.05 in "HP_3" favour. Doing well against the leader in Markov method has a great impact on output ranking, which somewhat resonates with common sense.

Offense-Defense method

The idea of Offense-Defense (OD) method is to account for different abilities of players by combining different ratings:

  • For player which can achieve high Head-to-Head value (even against the player with strong defense) it is said that he/she has strong offense which results into high offensive rating.
  • For player which can force their opponents into achieving low Head-to-Head value (even if they have strong offense) it is said that he/she has strong defense which results into low defensive rating.

Offensive and defensive ratings describe different skills of players. In order to fully rate players, OD ratings are computed: offensive ratings divided by defensive. The more OD rating the better player performance.

hp_cr_od <- hp_cr %>%    rank_od(!!! h2h_funs["mean_score_diff"], keep_rating = TRUE)  print(hp_cr_od, width = Inf)  ## # A tibble: 7 x 7  ##   player rating_off rating_def rating_od ranking_off ranking_def  ##                                     ## 1 HP_1         5.42      1.03      5.29            5           5  ## 2 HP_2         1.45      1.88      0.771           7           7  ## 3 HP_3         7.91      0.522    15.1             1           1  ## 4 HP_4         6.51      0.869     7.49            3           3  ## 5 HP_5         5.30      0.888     5.97            6           4  ## 6 HP_6         6.59      0.809     8.14            2           2  ## 7 HP_7         5.54      1.05      5.29            4           6  ##   ranking_od  ##          ## 1          5  ## 2          7  ## 3          1  ## 4          3  ## 5          4  ## 6          2  ## 7          6

All methods give almost equal results again differing only in ranks of "HP_5" and "HP_7".

Combined rankings

To obtain averaged, and hopefully less "noisy", rankings we will combine rankings produced with comperank by computing their mean.

list(hp_cr_massey, hp_cr_colley, hp_cr_keener, hp_cr_markov, hp_cr_od) %>%    # Extract ranking column    map(. %>% select(., player, starts_with("ranking"))) %>%    # Join all ranking data in one tibble    reduce(left_join, by = "player") %>%    # Compute mean ranking    transmute(player, ranking_combined = rowMeans(select(., -player))) %>%    # Join exploration rankings for easy comparison    left_join(y = hp_rank_explore %>% select(-rating_explore), by = "player")  ## # A tibble: 7 x 3  ##   player ranking_combined ranking_explore  ##                             ## 1 HP_1               5                  5  ## 2 HP_2               7                  7  ## 3 HP_3               1                  1  ## 4 HP_4               3                  3  ## 5 HP_5               4.43               6  ## 6 HP_6               2.14               2  ## 7 HP_7               5.43               4

As we can see, although different ranking methods handle results differently for books with "middle performance", combined rankings are only slightly different from exploration ones. Only notable difference is in switched rankings of "Order of the Phoenix" and "Deathly Hallows".

Conclusion

  • "Harry Potter and the Prisoner of Azkaban" still seems to be considered "best" among R users. And yet "Harry Potter and the Chamber of Secrets" still suffers the opposite fate.
  • Using different ranking methods is a powerful tool in analyzing Head-to-Head performance. This can be done in very straightforward manner with new addition to CRAN – comperank package.

sessionInfo()

sessionInfo()  ## R version 3.4.4 (2018-03-15)  ## Platform: x86_64-pc-linux-gnu (64-bit)  ## Running under: Ubuntu 16.04.4 LTS  ##   ## Matrix products: default  ## BLAS: /usr/lib/openblas-base/libblas.so.3  ## LAPACK: /usr/lib/libopenblasp-r0.2.18.so  ##   ## locale:  ##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C                ##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8      ##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8     ##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C                   ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C              ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C         ##   ## attached base packages:  ## [1] methods   stats     graphics  grDevices utils     datasets  base       ##   ## other attached packages:  ## [1] bindrcpp_0.2.2  comperank_0.1.0 comperes_0.2.0  rlang_0.2.0      ## [5] purrr_0.2.4     dplyr_0.7.5      ##   ## loaded via a namespace (and not attached):  ##  [1] Rcpp_0.12.17     knitr_1.20       bindr_0.1.1      magrittr_1.5      ##  [5] tidyselect_0.2.4 R6_2.2.2         stringr_1.3.1    tools_3.4.4       ##  [9] xfun_0.1         utf8_1.1.3       cli_1.0.0        htmltools_0.3.6   ## [13] yaml_2.1.19      rprojroot_1.3-2  digest_0.6.15    assertthat_0.2.0  ## [17] tibble_1.4.2     crayon_1.3.4     bookdown_0.7     tidyr_0.8.1       ## [21] glue_1.2.0       evaluate_0.10.1  rmarkdown_1.9    blogdown_0.6      ## [25] stringi_1.2.2    compiler_3.4.4   pillar_1.2.2     backports_1.1.2   ## [29] pkgconfig_2.0.1

To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Algorithmic Trading: Using Quantopian’s Zipline Python Library In R And Backtest Optimizations By Grid Search And Parallel Processing

Posted: 30 May 2018 05:00 PM PDT

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

We are ready to demo our new new experimental package for Algorithmic Trading, flyingfox, which uses reticulate to to bring Quantopian's open source algorithmic trading Python library, Zipline, to R. The flyingfox library is part of our NEW Business Science Labs innovation lab, which is dedicated to bringing experimental packages to our followers early on so they can test them out and let us know what they think before they make their way to CRAN. This article includes a long-form code tutorial on how to perform backtest optimizations of trading algorithms via grid search and parallel processing. In this article, we'll show you how to use the combination of tibbletime (time-based extension of tibble) + furrr (a parallel-processing compliment to purrr) + flyingfox (Zipline in R) to develop a backtested trading algorithm that can be optimized via grid search and parallel processing. We are releasing this article as a compliment to the R/Finance Conference presentation "A Time Series Platform For The Tidyverse", which Matt will present on Saturday (June 2nd, 2018). Enjoy!

New: Business Science Labs

I (Davis) am excited to introduce a new open source initiative called Business Science Labs. A lot of the experimental work we do is done behind the scenes, and much of it you don't see early on. What you do see is a "refined" version of what we think you need based on our perception, which is not always reality. We aim to change this. Starting today, we have created Business Science Labs, which is aimed at bringing our experimental software to you earlier so you can test it out and let us know your thoughts!

Our first initiative is to bring Quantopian's Open Source algorithmic trading Python library, Zipline, to R via an experimental package called flyingfox (built using the awesome reticulate package).

What We're Going To Learn

Introducing Business Science Labs is exciting, but we really want to educate you on some new packages! In this tutorial, we are going to go over how to backtest algorithmic trading strategies using parallel processing and Quantopian's Zipline infrastructure in R. You'll gain exposure to a tibbletime, furrr, and our experimental flyingfox package. The general progression is:

  • tibbletime: What it is and why it's essential to performing scalable time-based calculations in the tidyverse

  • furrr: Why you need to know this package for speeding up code by processing purrr in parallel

  • flyingfox: The story behind the package, and how you can use it to test algorithmic trading strategies

  • tibbletime + furrr + flyingfox: Putting it all together to perform parallelized algorithmic trading strategies and analyze time-based performance

Here's an example of the grid search we perform to determine which are the best combinations of short and long moving averages for the stock symbol JPM (JP Morgan).

Portfolio Value Grid

Here's an example of the time series showing the order (buy/sell) points determined by the moving average crossovers, and the effect on the portfolio value.

Portfolio Over Time With Buy/Sell Points

Algorithmic Trading Strategies And Backtesting

Algorithmic trading is nothing new. Financial companies have been performing algorithmic trading for years as a way of attempting to "beat" the market. It can be very difficult to do, but some traders have successfully applied advanced algorithms to yield significant profits.

Using an algorithm to trade boils down to buying and selling. In the simplest case, when an algorithm detects an asset (a stock) is going to go higher, a buy order is placed. Conversely, when the algorithm detects that an asset is going to go lower, a sell order is placed. Positions are managed by buying and selling all or part of the portfolio of assets. To keep things simple, we'll focus on just the full buy/sell orders.

One very basic method of algorithmic trading is using short and long moving averages to detect shifts in trend. The crossover is the point where a buy/sell order would take place. The figure below shows the price of Halliburton (symbol "HAL"), which a trader would have an initial position in of say 10,000 shares. In a hypothetical case, the trader could use a combination of a 20 day short moving average and a 150 day long moving average and look for buy/sell points at the crossovers. If the trader hypothetically sold his/her position in full on the sell and bought the position back in full, then the trader would stand to avoid a delta loss of approximately $5/share during the downswing, or $50,000.

plot of chunk unnamed-chunk-1

Backtesting is a strategy that is used to detect how a trading strategy would have performed in the past. It's impossible to know what the future will bring, but using trading strategies that work in the past helps to instill confidence in an algorithm.

Quantopian is a platform designed to enable anyone to develop algorithmic trading strategies. To help its community, Quantopian provides several open source tools. The one we'll focus on is Zipline for backtesting. There's one downside: it's only available in Python.

With the advent of the reticulate package, which enables porting any Python library to R, we took it upon ourselves to test out the viability of porting Zipline to R. Our experiment is called flyingfox.

RStudio Cloud Experiment Sandbox

In this code-based tutorial, we'll use an experimental package called flyingfox. It has several dependencies including Python that require setup time and effort. For those that want to test out flyingfox quickly, we've created a FREE RStudio Cloud Sandbox for running experiments. You can access the Cloud Sandbox here for FREE: https://rstudio.cloud/project/38291

RStudio Cloud Sandbox

Packages Needed For Backtest Optimization

The meat of this code-tutorial is the section Backtest Optimization Using tibbletime + furrr + flyingfox . However, before we get to it, we'll go over the three main packages used to do high-performance backtesting optimizations:

  1. tibbletime: What it is, and why it's essential to performing scalable time-based calculations in the tidyverse

  2. furrr: Why you need to know this package for speeding up code by processing purrr in parallel

  3. flyingfox: How to use it to test algorithmic trading strategies

Putting It All Together: tibbletime + furrr + flyingfox for backtesting optimizations performed using parallel processing and grid search!

Install & Load Libraries

Install Packages

For this post, you'll need to install development version of flyingfox.

devtools::install_github("DavisVaughan/flyingfox")

If you are on windows, you should also install the development version of furrr.

devtools::install_github("DavisVaughan/furrr")

Other packages you'll need include tibbletime, furrr, and tidyverse. We'll also load tidyquant mainly for the ggplot2 themes. We'll install ggrepel to repel overlapping plot labels. You can install these from CRAN using install.packages().

Load Packages

library(tidyverse)  library(tibbletime)  library(furrr)  library(flyingfox)  library(tidyquant)  library(ggrepel)

We'll cover how a few packages work before jumping into backtesting and optimizations.

1. tibbletime

The tibbletime package is a cornerstone for future time series software development efforts at Business Science. We have major plans for this package. Here are some key benefits:

  • Time periods down to milliseconds are supported
  • Because this is a tibble under the hood, we are able to leverage existing packages for analysis without reinventing the wheel
  • Scalable grouped analysis is at your fingertips because of collapse_by() and integration with group_by()

It's best to learn now, and we'll go over the basics along with a few commonly used functions: collapse_by(), rollify(), filter_time(), and as_period().

First, let's get some data. We'll use the FANG data set that comes with tibbletime, which includes stock prices for FB, AMZN, NFLX, and GOOG. We recommend using the tidyquant package to get this or other stock data.

data("FANG")    FANG
## # A tibble: 4,032 x 8  ##    symbol date        open  high   low close    volume adjusted  ##                         ##  1 FB     2013-01-02  27.4  28.2  27.4  28    69846400     28    ##  2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600     27.8  ##  3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400     28.8  ##  4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800     29.4  ##  5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300     29.1  ##  6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700     30.6  ##  7 FB     2013-01-10  30.6  31.5  30.3  31.3  95316400     31.3  ##  8 FB     2013-01-11  31.3  32.0  31.1  31.7  89598000     31.7  ##  9 FB     2013-01-14  32.1  32.2  30.6  31.0  98892800     31.0  ## 10 FB     2013-01-15  30.6  31.7  29.9  30.1 173242600     30.1  ## # ... with 4,022 more rows

Next, you'll need to convert thistbl_df object to a tbl_time object using the tbl_time() function.

FANG_time <- FANG %>%    tbl_time(date) %>%    group_by(symbol)    FANG_time
## # A time tibble: 4,032 x 8  ## # Index:  date  ## # Groups: symbol [4]  ##    symbol date        open  high   low close    volume adjusted  ##                         ##  1 FB     2013-01-02  27.4  28.2  27.4  28    69846400     28    ##  2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600     27.8  ##  3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400     28.8  ##  4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800     29.4  ##  5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300     29.1  ##  6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700     30.6  ##  7 FB     2013-01-10  30.6  31.5  30.3  31.3  95316400     31.3  ##  8 FB     2013-01-11  31.3  32.0  31.1  31.7  89598000     31.7  ##  9 FB     2013-01-14  32.1  32.2  30.6  31.0  98892800     31.0  ## 10 FB     2013-01-15  30.6  31.7  29.9  30.1 173242600     30.1  ## # ... with 4,022 more rows

collapse_by()

Beautiful. Now we have a time-aware tibble. Let's test out some functions. First, let's take a look at collapse_by(), which is used for grouped operations. We'll collapse by "year", and calculate the average price for each of the stocks.

FANG_time %>%    collapse_by(period = "year") %>%    group_by(symbol, date) %>%    summarise(mean_adj = mean(adjusted))
## # A time tibble: 16 x 3  ## # Index:  date  ## # Groups: symbol [?]  ##    symbol date       mean_adj  ##                ##  1 AMZN   2013-12-31    298.   ##  2 AMZN   2014-12-31    333.   ##  3 AMZN   2015-12-31    478.   ##  4 AMZN   2016-12-30    700.   ##  5 FB     2013-12-31     35.5  ##  6 FB     2014-12-31     68.8  ##  7 FB     2015-12-31     88.8  ##  8 FB     2016-12-30    117.   ##  9 GOOG   2013-12-31    442.   ## 10 GOOG   2014-12-31    561.   ## 11 GOOG   2015-12-31    602.   ## 12 GOOG   2016-12-30    743.   ## 13 NFLX   2013-12-31     35.3  ## 14 NFLX   2014-12-31     57.5  ## 15 NFLX   2015-12-31     91.9  ## 16 NFLX   2016-12-30    102.

rollify()

Next, let's take a look at rollify(). Remember the chart of Halliburton prices at the beginning. It was created using rollify(), which turns any function into a rolling function. Here's the code for the chart. Notice how we create two rolling functions using mean() and supplying the appropriate window argument.

hal_tbl <- tq_get("HAL", from = "2016-01-01", to = "2017-12-31")    roll_20  <- rollify(mean, window = 20)  roll_150 <- rollify(mean, window = 150)    hal_ma_tbl <- hal_tbl %>%      select(date, adjusted) %>%      mutate(          ma_20  = roll_20(adjusted),          ma_150 = roll_150(adjusted)      ) %>%      gather(key = key, value = value, -date, factor_key = T)    hal_ma_tbl %>%      ggplot(aes(date, value, color = key, linetype = key)) +      geom_line() +      theme_tq() +      scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) +      labs(title = "Halliburton Asset Price (HAL)",            subtitle = "Strategy: Short MA = 20, Long MA = 150",           x = "Date", y = "Adjusted Stock Price") +      annotate("point", x = as_date("2017-04-07"), y = 49.25, colour = "red", size = 5) +      annotate("text", label = "Sell Signal",                x = as_date("2017-04-07"), y = 49.25 + 2.5, color = "red", hjust = 0) +      annotate("point", x = as_date("2017-10-09"), y = 44.25, colour = "blue", size = 5) +      annotate("text", label = "Buy Signal",                x = as_date("2017-10-09"), y = 44.25 + 2.5, color = "blue")

plot of chunk unnamed-chunk-8

filter_time()

Let's check out filter_time(), which enables easier subsetting of time-based indexes. Let's redo the chart above, instead focusing in on sell and buy signals, which occur after February 2017. We can convert the previously stored hal_ma_tbl to a tbl_time object, group by the "key" column, and then filter using the time function format filter_time("2017-03-01" ~ "end"). We then reuse the plotting code above.

hal_ma_time <- hal_ma_tbl %>%      as_tbl_time(date) %>%      group_by(key)     hal_ma_time %>%            filter_time("2017-03-01" ~ "end") %>%                ggplot(aes(date, value, color = key, linetype = key)) +      geom_line() +      theme_tq() +      scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) +      labs(title = "Halliburton Asset Price (HAL)", subtitle = "2016 through 2017",           x = "Date", y = "Adjusted Stock Price") +      annotate("point", x = as_date("2017-04-07"), y = 49.25, colour = "red", size = 5) +      annotate("text", label = "Sell Signal",                x = as_date("2017-04-07"), y = 49.25 + 2.5, color = "red", hjust = 0) +      annotate("point", x = as_date("2017-10-09"), y = 44.25, colour = "blue", size = 5) +      annotate("text", label = "Buy Signal",                x = as_date("2017-10-09"), y = 44.25 + 2.5, color = "blue")

plot of chunk unnamed-chunk-9

as_period()

We can use the as_period() function to change the periodicity to a less granular level (e.g. going from daily to monthly). Here we convert the HAL share prices from daily periodicity to monthly periodicity.

hal_ma_time %>%            as_period(period = "monthly", side = "end") %>%            ggplot(aes(date, value, color = key, linetype = key)) +      geom_line() +      theme_tq() +      scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) +      labs(title = "Halliburton Asset Price (HAL)",            subtitle = "Periodicity Changed To Monthly",           x = "Date", y = "Adjusted Stock Price")

plot of chunk unnamed-chunk-10

Next, let's check out a new package for parallel processing using purrr: furrr!

2. furrr

The furrr package combines the awesome powers of future for parallel processing with purrr for iteration. Let's break these up into pieces by purrr, future, and furrr.

purrr

The purrr package is used for iteration over a number of different types of generic objects in R, including vectors, lists, and tibbles. The main function used is map(), which comes in several varieties (e.g. map(), map2(), pmap(), map_chr(), map_lgl(), map_df(), etc). Here's a basic example to get the class() of the columns of the FANG_time variable. Using map() iterates over the columns of the data frame returning a list containing the contents of the function applied to each column.

FANG_time %>%      map(class)
## $symbol  ## [1] "character"  ##   ## $date  ## [1] "Date"  ##   ## $open  ## [1] "numeric"  ##   ## $high  ## [1] "numeric"  ##   ## $low  ## [1] "numeric"  ##   ## $close  ## [1] "numeric"  ##   ## $volume  ## [1] "numeric"  ##   ## $adjusted  ## [1] "numeric"

future

The future package enables parallel processing. Here are a few important points:

  • future is a unified interface for parallel programming in R.

  • You set a "plan" for how code should be executed, call future() with an expression to evaluate, and call value() to retrieve the value. The first future() call sends off the code to another R process and is non-blocking so you can keep running R code in this session. It only blocks once you call value().

  • Global variables and packages are automatically identified and
    exported for you!

Now, the major point: If you're familiar with purrr, you can take advantage of future parallel processing with furrr!

furrr = future + purrr

furrr takes the best parts of purrr and combines it with the parallel processing capabilities of future to create a powerful package with an interface instantly familiar to anyone who has used purrr before. All you need to do is two things:

  1. Setup a plan() such as plan("multiprocess")

  2. Change map() (or other purrr function) to future_map() (or other compatible furrr function)

Every purrr function has a compatible furrr function. For example,
map_df() has future_map_df(). Set a plan, and run future_map_df() and that is all there
is to it!

furrr Example: Multiple Moving Averages

Say you would like to process not a single moving average but multiple moving averages for a given data set. We can create a custom function, multi_roller(), that uses map_dfc() to iteratively map a rollify()-ed mean() based on a sequence of windows. Here's the function and how it works when a user supplies a data frame, a column in the data frame to perform moving averages, and a sequence of moving averages.

multi_roller <- function(data, col, ..f = mean,                            window = c(10, 20, 30, 100, 200), sep = "V_") {            col_expr <- enquo(col)            ret_tbl <- window %>%          set_names(~ paste0(sep, window)) %>%          map_dfc(~ rollify(..f, window = .x)(pull(data, !! col_expr))) %>%          bind_cols(data, .)            return(ret_tbl)        }

We can test the function out with the FB stock prices from FANG. We'll ungroup, filter by FB, and select the important columns, then pass the data frame to the multi_roller() function with window = seq(10, 100, by = 10). Great, it's working!

FANG_time %>%      ungroup() %>%      filter(symbol == "FB") %>%      select(symbol, date, adjusted) %>%      multi_roller(adjusted, mean, window = seq(10, 100, by = 10), sep = "ma_")
## # A time tibble: 1,008 x 13  ## # Index: date  ##    symbol date       adjusted ma_10 ma_20 ma_30 ma_40 ma_50 ma_60  ##                      ##  1 FB     2013-01-02     28    NA      NA    NA    NA    NA    NA  ##  2 FB     2013-01-03     27.8  NA      NA    NA    NA    NA    NA  ##  3 FB     2013-01-04     28.8  NA      NA    NA    NA    NA    NA  ##  4 FB     2013-01-07     29.4  NA      NA    NA    NA    NA    NA  ##  5 FB     2013-01-08     29.1  NA      NA    NA    NA    NA    NA  ##  6 FB     2013-01-09     30.6  NA      NA    NA    NA    NA    NA  ##  7 FB     2013-01-10     31.3  NA      NA    NA    NA    NA    NA  ##  8 FB     2013-01-11     31.7  NA      NA    NA    NA    NA    NA  ##  9 FB     2013-01-14     31.0  NA      NA    NA    NA    NA    NA  ## 10 FB     2013-01-15     30.1  29.8    NA    NA    NA    NA    NA  ## # ... with 998 more rows, and 4 more variables: ma_70 ,  ## #   ma_80 , ma_90 , ma_100 

We can apply this multi_roller() function to a nested data frame. Let's try it on our FANG_time data set. We'll select the columns of interest (symbol, date, and adjusted), group by symbol, and use the nest() function to get the data nested.

FANG_time_nested <- FANG_time %>%      select(symbol, date, adjusted) %>%      group_by(symbol) %>%      nest()     FANG_time_nested
## # A tibble: 4 x 2  ##   symbol data                  ##                     ## 1 FB       ## 2 AMZN     ## 3 NFLX     ## 4 GOOG   

Next, we can perform a rowwise map using the combination of mutate() and map(). We apply multi_roller() as an argument to map() along with data (variable being mapped), and the additional static arguments, col and window.

FANG_time_nested %>%      mutate(ma = map(data, multi_roller, col = adjusted, ..f = mean,                      window = seq(10, 100, by = 10), sep = "ma_")) %>%      select(-data) %>%      unnest()
## # A time tibble: 4,032 x 13  ## # Index: date  ##    symbol date       adjusted ma_10 ma_20 ma_30 ma_40 ma_50 ma_60  ##                      ##  1 FB     2013-01-02     28    NA      NA    NA    NA    NA    NA  ##  2 FB     2013-01-03     27.8  NA      NA    NA    NA    NA    NA  ##  3 FB     2013-01-04     28.8  NA      NA    NA    NA    NA    NA  ##  4 FB     2013-01-07     29.4  NA      NA    NA    NA    NA    NA  ##  5 FB     2013-01-08     29.1  NA      NA    NA    NA    NA    NA  ##  6 FB     2013-01-09     30.6  NA      NA    NA    NA    NA    NA  ##  7 FB     2013-01-10     31.3  NA      NA    NA    NA    NA    NA  ##  8 FB     2013-01-11     31.7  NA      NA    NA    NA    NA    NA  ##  9 FB     2013-01-14     31.0  NA      NA    NA    NA    NA    NA  ## 10 FB     2013-01-15     30.1  29.8    NA    NA    NA    NA    NA  ## # ... with 4,022 more rows, and 4 more variables: ma_70 ,  ## #   ma_80 , ma_90 , ma_100 

Great, we have our moving averages. But…

What if instead of 10 moving averages, we had 500? This would take a really long time to run on many stocks. Solution: Parallelize with furrr!

There are two ways we could do this since there are two maps:

  1. Parallelize the map() internal to the multi_roller() function
  2. Parallelize the rowwise map() applied to each symbol

We'll choose the former (1) to show off furrr.

To make the multi_roller_parallel() function, copy the multi_roller() function and do 2 things:

  1. Add plan("multiprocess") at the beginning
  2. Change map_dfc() to future_map_dfc()

That's it!

multi_roller_parallel <- function(data, col, ..f = mean,                                     window = c(10, 20, 30, 100, 200),                                     sep = "V_") {            plan("multiprocess")            col_expr <- enquo(col)            ret_tbl <- window %>%          set_names(~ paste0(sep, window)) %>%          future_map_dfc(~ rollify(..f, window = .x)(pull(data, !! col_expr))) %>%          bind_cols(data, .)            return(ret_tbl)        }

In the previous rowwise map, switch out multi_roller() for multi_roller_parallel() and change the window = 2:500. Sit back and let the function run in parallel using each of your computer cores.

FANG_time_nested %>%      mutate(ma = map(data, multi_roller_parallel, col = adjusted,                       ..f = mean, window = 2:500, sep = "ma_")) %>%      select(-data) %>%      unnest()
## # A time tibble: 4,032 x 502  ## # Index: date  ##    symbol date       adjusted  ma_2  ma_3  ma_4  ma_5  ma_6  ma_7  ##                      ##  1 FB     2013-01-02     28    NA    NA    NA    NA    NA    NA    ##  2 FB     2013-01-03     27.8  27.9  NA    NA    NA    NA    NA    ##  3 FB     2013-01-04     28.8  28.3  28.2  NA    NA    NA    NA    ##  4 FB     2013-01-07     29.4  29.1  28.6  28.5  NA    NA    NA    ##  5 FB     2013-01-08     29.1  29.2  29.1  28.8  28.6  NA    NA    ##  6 FB     2013-01-09     30.6  29.8  29.7  29.5  29.1  28.9  NA    ##  7 FB     2013-01-10     31.3  30.9  30.3  30.1  29.8  29.5  29.3  ##  8 FB     2013-01-11     31.7  31.5  31.2  30.7  30.4  30.1  29.8  ##  9 FB     2013-01-14     31.0  31.3  31.3  31.1  30.7  30.5  30.3  ## 10 FB     2013-01-15     30.1  30.5  30.9  31.0  30.9  30.6  30.4  ## # ... with 4,022 more rows, and 493 more variables: ma_8 ,  ## #   ma_9 , ma_10 , ma_11 , ma_12 , ma_13 ,  ## #   ma_14 , ma_15 , ma_16 , ma_17 , ma_18 ,  ## #   ma_19 , ma_20 , ma_21 , ma_22 , ma_23 ,  ## #   ma_24 , ma_25 , ma_26 , ma_27 , ma_28 ,  ## #   ma_29 , ma_30 , ma_31 , ma_32 , ma_33 ,  ## #   ma_34 , ma_35 , ma_36 , ma_37 , ma_38 ,  ## #   ma_39 , ma_40 , ma_41 , ma_42 , ma_43 ,  ## #   ma_44 , ma_45 , ma_46 , ma_47 , ma_48 ,  ## #   ma_49 , ma_50 , ma_51 , ma_52 , ma_53 ,  ## #   ma_54 , ma_55 , ma_56 , ma_57 , ma_58 ,  ## #   ma_59 , ma_60 , ma_61 , ma_62 , ma_63 ,  ## #   ma_64 , ma_65 , ma_66 , ma_67 , ma_68 ,  ## #   ma_69 , ma_70 , ma_71 , ma_72 , ma_73 ,  ## #   ma_74 , ma_75 , ma_76 , ma_77 , ma_78 ,  ## #   ma_79 , ma_80 , ma_81 , ma_82 , ma_83 ,  ## #   ma_84 , ma_85 , ma_86 , ma_87 , ma_88 ,  ## #   ma_89 , ma_90 , ma_91 , ma_92 , ma_93 ,  ## #   ma_94 , ma_95 , ma_96 , ma_97 , ma_98 ,  ## #   ma_99 , ma_100 , ma_101 , ma_102 ,  ## #   ma_103 , ma_104 , ma_105 , ma_106 ,  ## #   ma_107 , ...

Bam! 500 moving averages run in parallel in fraction of the time it would take running in series.

3. flyingfox

We have one final package we need to demo prior to jumping into our Algorithmic Trading Backtest Optimization: flyingfox.

What is Quantopian?

Quantopian is a company that has setup a community-driven platform for everyone (from traders to home-gamers) enabling development of algorithmic trading strategies. The one downside is they only use Python.

What is Zipline?

Zipline is a Python module open-sourced by Quantopian to help traders back-test their trading algorithms. Here are some quick facts about Quantopian's Zipline Python module for backtesting algorithmic trading strategies:

  • It is used to develop and backtest financial algorithms using Python.

  • It includes an event-driven backtester (really good at preventing look-ahead bias)

  • Algorithms consist of two main functions:

    1. initialize(): You write an initialize() function that is called once at the beginning of the backtest. This sets up variables for use in the backtest, schedules functions to be called daily, monthly, etc, and let's you set slippage or commission for the backtest.

    2. handle_data(): You then write a handle_data() function that is called once per day (or minute) that implements the trading logic. You can place orders, retrieve historical pricing data, record metrics for performance evalutation and more.

  • Extra facts: You can use any Python module inside the handle_data() function, so you have a lot of flexibility here.

What is reticulate?

The reticulate package from RStudio is an interface with Python. It smartly takes care of (most) conversion between R and Python objects.

Can you combine them?

Yes, and that's exactly what we did. We used reticulate to access the Zipline Python module from R!

What is the benefit to R users?

What if you could write your initialize() and handle_data() functions in R utilizing any financial or time series R package for your analysis and then have them called from Python and Zipline?

Introducing flyingfox: An R interface to Zipline

flyingfox integrates the Zipline backtesting module in R! Further, it takes care of the overhead with creating the main infrastructure functions initialize() and handle_data() by enabling the user to set up:

  • fly_initialize(): R version of Zipline's initialize()
  • fly_handle_data(): R version of Zipline's handle_data()

flyingfox takes care of passing these functions to Python and Zipline.

Why "Flying Fox"?

Zipline just doesn't quite make for a good hex sticker. A flying fox is a synonym for zipliners, and it's hard to argue that this majestic animal wouldn't create a killer hex sticker.

Flying Fox

Getting Started With flyingfox: Moving Average Crossover

Let's do a Moving Average Crossover example using the following strategy:

  • Using JP Morgan (JPM) stock prices
  • If the 20 day short-term moving average crosses above the 150 day long-term moving average, buy 100% into JP Morgan
  • If 20 day crosses below the 150 day, sell all of the current JPM position

Setup

Setup can take a while and take up some computer space due to ingesting data (which is where Zipline saves every major asset to your computer). We recommend one of two options:

  1. No weight option (for people that just want to try it out): Use our flyingfox sandbox on RStudio Cloud. You can connect here: https://rstudio.cloud/project/38291

  2. Heavy weight option (for people that want to expand and really test it): Follow the instructions on my GitHub page to fly_ingest() data. The setup and data ingesting process are discussed here: https://github.com/DavisVaughan/flyingfox. Keep in mind that this is still a work in progress. We recommend doing the no weight option as a first start.

Initialize

First, write the R function for initialize(). It must take context as an argument. This is where you store variables used later, which are accessed via context$variable. We'll store context$i = 0L to initialize the tracking of days, and context$asset = fly_symbol("JPM") to trade the JPM symbol. You can select any symbol that you'd like (provided Quantopian pulls it from Quandl).

fly_initialize <- function(context) {      # We want to track what day we are on.     context$i = 0L      # We want to trade JP Morgan stock    context$asset = fly_symbol("JPM")  }

Handle Data

Next, write a handle_data() function:

  • This implements the crossover trading algorithm logic

  • In this example we also use fly_data_history() to retrieve historical data each day for JP Morgan

  • We use fly_order_target_percent() to move to a new percentage amount invested in JP Morgan (if we order 1, we want to move to be 100% invested in JP Morgan, no matter where we were before)

  • We use fly_record() to store arbitrary metrics for review later

fly_handle_data <- function(context, data) {          # Increment day      context$i <- context$i + 1L          # While < 20 days of data, return      if(context$i < 20L) {          return()      }          # Calculate a short term (20 day) moving average      # by pulling history for the asset (JPM) and taking an average      short_hist <- fly_data_history(data, context$asset, "price",                                      bar_count = 20L, frequency = "1d")      short_mavg <- mean(short_hist)          # Calculate a long term (150 day) moving average      long_hist <- fly_data_history(data, context$asset, "price",                                     bar_count = 150L, frequency = "1d")      long_mavg <- mean(long_hist)          # If short > long, go 100% in JPM      if(short_mavg > long_mavg) {          fly_order_target_percent(asset = context$asset, target = 1)      }      # Else if we hit the crossover, dump all of JPM      else if (short_mavg < long_mavg) {          fly_order_target_percent(asset = context$asset, target = 0)      }          # Record today's data      # We record the current JPM price, along with the value of the short and long      # term moving average      fly_record(          JPM = fly_data_current(data, context$asset, "price"),          short_mavg = short_mavg,          long_mavg  = long_mavg      )      }

Run The Algorithm

Finally, run the algorithm from 2013-2016 using fly_run_algorithm().

performance_time <- fly_run_algorithm(    initialize  = fly_initialize,    handle_data = fly_handle_data,    start       = as.Date("2013-01-01"),    end         = as.Date("2016-01-01")  )

If you got to this point, you've just successfully run a single backtest. Let's review the performance output.

Reviewing The Performance

Let's glimpse performance_time to see what the results show. It's a tbl_time time series data frame organized by the "date" column, and there is a ton of information. We'll focus on:

  • date: Time stamp for each point in the performance analysis
  • JPM: This is the price of the asset
  • short_mavg and long_mavg: These are our moving averages we are using for the buy/sell crossover signals
  • portfolio_value: The value of the portfolio at each time point
  • transactions: Transactions stored as a list column. The tibble contains a bunch of information that is useful in determining what happened. More information below.
glimpse(performance_time)
## Observations: 756  ## Variables: 41  ## $ date                     2013-01-02 21:00:00, 2013-01-03 ...  ## $ JPM                      NaN, NaN, NaN, NaN, NaN, NaN, NaN...  ## $ algo_volatility          NaN, 0.00000000, 0.00000000, 0.00...  ## $ algorithm_period_return  0.00000000, 0.00000000, 0.0000000...  ## $ alpha                    NaN, NaN, NaN, NaN, NaN, NaN, NaN...  ## $ benchmark_period_return  0.003691629, 0.007396885, 0.01111...  ## $ benchmark_volatility     NaN, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...  ## $ beta                     NaN, NaN, NaN, NaN, NaN, NaN, NaN...  ## $ capital_used             0.000, 0.000, 0.000, 0.000, 0.000...  ## $ ending_cash              10000.0000, 10000.0000, 10000.000...  ## $ ending_exposure          0.00, 0.00, 0.00, 0.00, 0.00, 0.0...  ## $ ending_value             0.00, 0.00, 0.00, 0.00, 0.00, 0.0...  ## $ excess_return            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  ## $ gross_leverage           0.0000000, 0.0000000, 0.0000000, ...  ## $ long_exposure            0.00, 0.00, 0.00, 0.00, 0.00, 0.0...  ## $ long_mavg                NaN, NaN, NaN, NaN, NaN, NaN, NaN...  ## $ long_value               0.00, 0.00, 0.00, 0.00, 0.00, 0.0...  ## $ longs_count              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  ## $ max_drawdown             0.000000000, 0.000000000, 0.00000...  ## $ max_leverage             0.0000000, 0.0000000, 0.0000000, ...  ## $ net_leverage             0.0000000, 0.0000000, 0.0000000, ...  ## $ orders                   [[], [], [], [], [], [], [], [],...  ## $ period_close             2013-01-02 21:00:00, 2013-01-03 ...  ## $ period_label             "2013-01", "2013-01", "2013-01", ...  ## $ period_open              2013-01-02 14:31:00, 2013-01-03 ...  ## $ pnl                      0.0000, 0.0000, 0.0000, 0.0000, 0...  ## $ portfolio_value          10000.000, 10000.000, 10000.000, ...  ## $ positions                [[], [], [], [], [], [], [], [],...  ## $ returns                  0.000000000, 0.000000000, 0.00000...  ## $ sharpe                   NaN, NaN, NaN, NaN, NaN, NaN, NaN...  ## $ short_exposure           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  ## $ short_mavg               NaN, NaN, NaN, NaN, NaN, NaN, NaN...  ## $ short_value              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  ## $ shorts_count             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  ## $ sortino                  NaN, NaN, NaN, NaN, NaN, NaN, NaN...  ## $ starting_cash            10000.0000, 10000.0000, 10000.000...  ## $ starting_exposure        0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...  ## $ starting_value           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...  ## $ trading_days             1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11...  ## $ transactions             [[], [], [], [], [], [], [], [],...  ## $ treasury_period_return   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

First, let's plot the asset (JPM) along with the short and long moving averages. We can see there are a few crossovers.

performance_time %>%      select(date, JPM, short_mavg, long_mavg) %>%      gather(key = "key", value = "value", -date, factor_key = T) %>%      ggplot(aes(date, value, color = key, linetype = key)) +       geom_line() +       theme_tq() +      scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) +      scale_y_continuous(labels = scales::dollar) +      labs(          title = "JPM Stock Price",          subtitle = "Strategy: 20 Day Short MA, 150 Day Long MA",          x = "Date", y = "Share Price"      )

plot of chunk unnamed-chunk-24

Next, we can investigate the transactions. Stored within the performance_time output are transaction information as nested tibbles. We can get these values by flagging which time points contain tibbles and the filtering and unnesting. A transaction type can be determined if the "amount" of the transaction (number of shares bought or sold) is positive or negative.

transactions_flagged_time <- performance_time %>%      select(date, portfolio_value, transactions) %>%      mutate(flag_transactions = map_lgl(transactions, is.tibble)) %>%      filter(flag_transactions == TRUE) %>%      unnest() %>%      mutate(transaction_type = ifelse(amount >= 0, "buy", "sell"))    transactions_flagged_time %>%      glimpse()
## Observations: 9  ## Variables: 10  ## $ date               2013-01-31 21:00:00, 2014-05-07 20:00:...  ## $ portfolio_value    9994.801, 11472.859, 11467.007, 11488.6...  ## $ flag_transactions  TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...  ## $ commission         NA, NA, NA, NA, NA, NA, NA, NA, NA  ## $ amount             212, -212, 197, 2, -199, 179, -179, 166, 3  ## $ sid                [,  "573c67da9fb6481c8a10cc55cddf57eb", "79...  ## $ price              47.07353, 54.02298, 57.44871, 57.55877,...  ## $ dt                 2013-01-31 21:00:00, 2014-05-07 21:00:...  ## $ transaction_type   "buy", "sell", "buy", "buy", "sell", "b...

Finally, we can visualize the results using ggplot2. We can see that the ending portfolio value is just under $11.5K.

performance_time %>%      ggplot(aes(x = date, y = portfolio_value)) +      geom_line(color = palette_light()[[1]]) +      geom_point(aes(color = transaction_type), size = 5,                  data = transactions_flagged_time) +       geom_label_repel(          aes(label = amount), vjust = -1, color = palette_light()[[1]],          data = transactions_flagged_time) +      theme_tq() +      scale_color_tq() +      scale_y_continuous(labels = scales::dollar) +      labs(title = "Portfolio Value And Transactions",           subtitle = "JPM Strategy: Short MA = 20, Long MA = 150",           x = "Date", y = "Portfolio Value"      )

plot of chunk unnamed-chunk-26

Last, let's use tibbletime to see what happened to our portfolio towards the end. We'll use the portfolio value as a proxy for the stock price, visualizing the crossover of the 20 and 150-day moving averages of the portfolio. Note that the actual algorithm is run with moving averages based on the adjusted stock price, not the portfolio value.

performance_time %>%      # Data manipulation with tibbletime & multi_roller() function      select(date, portfolio_value) %>%      multi_roller(portfolio_value, mean, window = c(20, 150), sep = "ma_") %>%      filter_time("2015-05-01" ~ "end") %>%      gather(key = key, value = value, -date, factor_key = T) %>%            # Visualization with ggplot2      ggplot(aes(date, value, color = key, linetype = key)) +      geom_line() +      theme_tq() +      scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) +      scale_y_continuous(labels = scales::dollar) +      labs(title = "Portfolio Value And Transactions",           subtitle = "JPM Strategy: Short MA = 20, Long MA = 150",           x = "Date", y = "Portfolio Value"      )

plot of chunk unnamed-chunk-27

Backtest Optimization Via Grid Search

Now for the main course: Optimizing our algorithm using the backtested performance. To do so, we'll combine what we learned from our three packages: tibbletime + furrr + flyingfox.

Let's say we want to use backtesting to find the optimal combination or several best combinations of short and long term moving averages for our strategy. We can do this using Cartesian Grid Search, which is simply creating a combination of all of the possible "hyperparameters" (parameters we wish to adjust). Recognizing that running multiple backtests can take some time, we'll parallelize the operation too.

Preparation

Before we can do grid search, we need to adjust our fly_handle_data() function to enable our parameters to be adjusted. The two parameters we are concerned with are the short and long moving averages. We'll add these as arguments of a new function fly_handle_data_parameterized().

fly_handle_data_parameterized <- function(context, data,                                             short_ma = 20L, long_ma = 150L) {        # Increment day      context$i <- context$i + 1L          # While < 300 days of data, return      if(context$i < 300) {          return()      }          # Calculate a short term (20 day) moving average      # by pulling history for the asset (JPM) and taking an average      short_hist <- fly_data_history(data, context$asset, "price",                                      bar_count = short_ma, frequency = "1d")      short_mavg <- mean(short_hist)          # Calculate a long term (150 day) moving average      long_hist <- fly_data_history(data, context$asset, "price",                                     bar_count = long_ma, frequency = "1d")      long_mavg <- mean(long_hist)          # If short > long, go 100% in JPM      if(short_mavg > long_mavg) {          fly_order_target_percent(asset = context$asset, target = 1)      }      # Else if we hit the crossover, dump all of JPM      else if (short_mavg < long_mavg) {          fly_order_target_percent(asset = context$asset, target = 0)      }          # Record today's data      # We record the current JPM price, along with the value of the short and long      # term moving average      fly_record(          JPM = fly_data_current(data, context$asset, "price"),          short_mavg = short_mavg,          long_mavg  = long_mavg      )    }

Making The Grid

Next, we can create a grid of values from a list() containing the hyperparameter values. We can turn this into a cross product as a tibble using the cross_df() function.

ma_grid_tbl <- list(      short_ma = seq(20, 100, by = 20),      long_ma  = seq(150, 300, by = 50)  ) %>%      cross_df()    ma_grid_tbl
## # A tibble: 20 x 2  ##    short_ma long_ma  ##            ##  1       20     150  ##  2       40     150  ##  3       60     150  ##  4       80     150  ##  5      100     150  ##  6       20     200  ##  7       40     200  ##  8       60     200  ##  9       80     200  ## 10      100     200  ## 11       20     250  ## 12       40     250  ## 13       60     250  ## 14       80     250  ## 15      100     250  ## 16       20     300  ## 17       40     300  ## 18       60     300  ## 19       80     300  ## 20      100     300

Now that we have the hyperparameters, let's create a new column with the function we wish to run. We'll use the partial() function to partially fill the function with the hyper parameters.

ma_grid_tbl <- ma_grid_tbl %>%      mutate(f = map2(.x = short_ma,                       .y = long_ma,                       .f = ~ partial(fly_handle_data_parameterized,                                      short_ma = .x, long_ma = .y)))    ma_grid_tbl
## # A tibble: 20 x 3  ##    short_ma long_ma f       ##             ##  1       20     150     ##  2       40     150     ##  3       60     150     ##  4       80     150     ##  5      100     150     ##  6       20     200     ##  7       40     200     ##  8       60     200     ##  9       80     200     ## 10      100     200     ## 11       20     250     ## 12       40     250     ## 13       60     250     ## 14       80     250     ## 15      100     250     ## 16       20     300     ## 17       40     300     ## 18       60     300     ## 19       80     300     ## 20      100     300 

Running Grid Search In Parallel Using furrr

Now for the Grid Search. We use the future_map() function to process in parallel. Make sure to setup a plan() first. The following function runs the fly_run_algorithm() for each fly_handle_data() function stored in the "f" column.

plan("multiprocess")    results_tbl <- ma_grid_tbl %>%      mutate(results = future_map(          .x = f,          .f = ~ fly_run_algorithm(              initialize  = fly_initialize,              handle_data = .x,              start       = as.Date("2013-01-01"),              end         = as.Date("2016-01-01")              )      ))    results_tbl
## # A tibble: 20 x 4  ##    short_ma long_ma f      results              ##                           ##  1       20     150      ##  2       40     150      ##  3       60     150      ##  4       80     150      ##  5      100     150      ##  6       20     200      ##  7       40     200      ##  8       60     200      ##  9       80     200      ## 10      100     200      ## 11       20     250      ## 12       40     250      ## 13       60     250      ## 14       80     250      ## 15      100     250      ## 16       20     300      ## 17       40     300      ## 18       60     300      ## 19       80     300      ## 20      100     300    

Inspecting The Backtest Performance Results

The performance results are stored in the "results" column as tbl_time objects. We can examine the first result.

results_tbl$results[[1]]
## # A time tibble: 756 x 41  ## # Index: date  ##    date                  JPM algo_volatility algorithm_period_~ alpha  ##                                              ##  1 2013-01-02 21:00:00   NaN             NaN                  0   NaN  ##  2 2013-01-03 21:00:00   NaN               0                  0   NaN  ##  3 2013-01-04 21:00:00   NaN               0                  0   NaN  ##  4 2013-01-07 21:00:00   NaN               0                  0   NaN  ##  5 2013-01-08 21:00:00   NaN               0                  0   NaN  ##  6 2013-01-09 21:00:00   NaN               0                  0   NaN  ##  7 2013-01-10 21:00:00   NaN               0                  0   NaN  ##  8 2013-01-11 21:00:00   NaN               0                  0   NaN  ##  9 2013-01-14 21:00:00   NaN               0                  0   NaN  ## 10 2013-01-15 21:00:00   NaN               0                  0   NaN  ## # ... with 746 more rows, and 36 more variables:  ## #   benchmark_period_return , benchmark_volatility ,  ## #   beta , capital_used , ending_cash ,  ## #   ending_exposure , ending_value , excess_return ,  ## #   gross_leverage , long_exposure , long_mavg ,  ## #   long_value , longs_count , max_drawdown ,  ## #   max_leverage , net_leverage , orders ,  ## #   period_close , period_label , period_open ,  ## #   pnl , portfolio_value , positions ,  ## #   returns , sharpe , short_exposure ,  ## #   short_mavg , short_value , shorts_count ,  ## #   sortino , starting_cash , starting_exposure ,  ## #   starting_value , trading_days , transactions ,  ## #   treasury_period_return 

We can also get the final portfolio value using a combination of pull() and tail().

results_tbl$results[[1]] %>%      pull(portfolio_value) %>%      tail(1)
## [1] 9187.849

We can turn this into a function and map it to all of the columns to obtain the "final_portfolio_value" for each of the grid search combinations.

results_tbl <- results_tbl %>%      mutate(final_portfolio_value = map_dbl(results,                                              ~ pull(.x, portfolio_value) %>%                                                  tail(1)))    results_tbl
## # A tibble: 20 x 5  ##    short_ma long_ma f      results             final_portfolio_value  ##                                            ##  1       20     150                     9188.  ##  2       40     150                     9277.  ##  3       60     150                     9823.  ##  4       80     150                    10816.  ##  5      100     150                    10697.  ##  6       20     200                     9065.  ##  7       40     200                    10612.  ##  8       60     200                    11444.  ##  9       80     200                    11366.  ## 10      100     200                    11473.  ## 11       20     250                     9522.  ## 12       40     250                    10898.  ## 13       60     250                    11494.  ## 14       80     250                    11494.  ## 15      100     250                    11494.  ## 16       20     300                    10951.  ## 17       40     300                    11494.  ## 18       60     300                    11494.  ## 19       80     300                    11494.  ## 20      100     300                    11494.

Visualizing The Backtest Performance Results

Now let's visualize the results to see which combinations of short and long moving averages maximize the portfolio value. It's clear that short >= 60 days and long >= 200 days maximize the return. But, why?

results_tbl %>%      ggplot(aes(long_ma, short_ma, color = final_portfolio_value)) +      geom_point(aes(size = final_portfolio_value)) +      geom_label_repel(          aes(label = scales::dollar(round(final_portfolio_value, 0))),           vjust = -1, color = palette_light()[[1]]) +      theme_tq() +      scale_color_gradient(low = palette_light()[[1]], high = palette_light()[[3]]) +      scale_size(range = c(1, 12)) +      labs(          title = "JPM Strategy: Backtest Grid Search Optimization",          subtitle = "Short and Fast Moving Averages: Optimum = 60 days (Short MA), 300 Days (Long MA)",          x = "Long Moving Average (Days)", y = "Short Moving Average (Days)",          color = "Final Portfolio Value", size = "Final Portfolio Value"      ) +      theme(legend.direction = "vertical",            legend.position = "right")

plot of chunk unnamed-chunk-36

Let's get the transaction information (buy/sell) by unnesting the results and determining which transactions are buys and sells.

transactions_tbl <- results_tbl %>%      unnest(results) %>%      select(short_ma:date, portfolio_value, transactions) %>%      mutate(flag_transactions = map_lgl(transactions, is.tibble)) %>%      filter(flag_transactions == TRUE) %>%      unnest() %>%      mutate(transaction_type = ifelse(amount >= 0, "buy", "sell")) %>%      mutate(          short_ma = as.factor(short_ma) %>% fct_rev(),          long_ma  = as.factor(long_ma)      )        transactions_tbl  %>%      glimpse()
## Observations: 105  ## Variables: 13  ## $ short_ma               20, 20, 20, 20, 20, 20, 20, 20, 20,...  ## $ long_ma                150, 150, 150, 150, 150, 150, 150, ...  ## $ final_portfolio_value  9187.849, 9187.849, 9187.849, 9187....  ## $ date                   2014-03-13 20:00:00, 2014-03-14 20...  ## $ portfolio_value        9994.890, 9888.191, 9404.815, 9400....  ## $ flag_transactions      TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,...  ## $ commission             NA, NA, NA, NA, NA, NA, NA, NA, NA,...  ## $ amount                 172, 2, -174, 161, 2, -163, 147, -1...  ## $ sid                    [...  ## $ order_id               "d8a3551cb36c4ff985c42e4d78e00ef8",...  ## $ price                  57.44871, 56.82840, 54.02298, 57.44...  ## $ dt                     2014-03-13 21:00:00, 2014-03-14 21...  ## $ transaction_type       "buy", "buy", "sell", "buy", "buy",...

Finally, we can visualize the portfolio value over time for each combination of short and long moving averages. By plotting the buy/sell transactions, we can see the effect on a stock with a bullish trend. The portfolios with the optimal performance are those that were bought and held rather than sold using the moving average crossover. For this particular stock, the benefit of downside protection via the moving average crossover costs the portfolio during the bullish uptrend.

results_tbl %>%      unnest(results) %>%      mutate(          short_ma = as.factor(short_ma) %>% fct_rev(),          long_ma  = as.factor(long_ma)      ) %>%      group_by(short_ma, long_ma) %>%      filter_time("2014" ~ "end") %>%      ggplot(aes(date, portfolio_value)) +      geom_line(color = palette_light()[[1]]) +      geom_vline(aes(xintercept = date), color = "blue",                 data = filter(transactions_tbl, transaction_type == "buy")) +      geom_vline(aes(xintercept = date), color = "red",                 data = filter(transactions_tbl, transaction_type == "sell")) +      facet_grid(short_ma ~ long_ma, labeller = label_both) +      theme_tq() +      scale_y_continuous(labels = scales::dollar) +      scale_x_datetime(date_labels = "%Y", date_breaks = "1 year") +      labs(          title = "JPM Strategy: Portfolio Value Over Time With Transactions",          subtitle = "Bull Market: Buy and Hold Strategy is Optimal Vs Short/Long Moving Average Strategies",          x = "Date", y = "Portfolio Value"      )

plot of chunk unnamed-chunk-38

Conclusions

We've covered a lot of ground in this article. Congrats if you've made it through. You've now been exposed to three cool packages:

  1. tibbletime: For time-series in the tidyverse
  2. furrr: Our parallel processing extension to purrr
  3. flyingfox: Our experimental package brought to you as part of our Business Science Labs initiative

Further, you've seen how to apply all three of these packages to perform grid search backtest optimization of your trading algorithm.

Business Science University

If you are looking to take the next step and learn Data Science For Business (DS4B), you should consider Business Science University. Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You'll learn:

  • Data Science Framework: Business Science Problem Framework
  • Tidy Eval
  • H2O Automated Machine Learning
  • LIME Feature Explanations
  • Sensitivity Analysis
  • Tying data science to financial improvement

All while solving a REAL WORLD CHURN PROBLEM: Employee Turnover!


Get 15% OFF in June!

DS4B Virtual Workshop: Predicting Employee Attrition

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don't realize this because it's an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here's an example of a Shiny app you will create.


Get 15% OFF in June!

HR 301 Shiny Application: Employee Prediction

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is intended for intermediate and advanced R users. It's code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.


Get 15% OFF in June!

Don't Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

OS Secrets Exposed: Extracting Extended File Attributes and Exploring Hidden Download URLs With The xattrs Package

Posted: 30 May 2018 10:19 AM PDT

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it's fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

When you're in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

  $ cd ~/Downloads  $ ls -l  total 264856  -rw-r--r--@ 1 user  staff     169062 Nov 27  2017 1109.1968.pdf  -rw-r--r--@ 1 user  staff     171059 Nov 27  2017 1109.1968v1.pdf  -rw-r--r--@ 1 user  staff     291373 Apr 27 21:25 1804.09970.pdf  -rw-r--r--@ 1 user  staff    1150562 Apr 27 21:26 1804.09988.pdf  -rw-r--r--@ 1 user  staff     482953 May 11 12:00 1805.01554.pdf  -rw-r--r--@ 1 user  staff  125822222 May 14 16:34 RStudio-1.2.627.dmg  -rw-r--r--@ 1 user  staff    2727305 Dec 21 17:50 athena-ug.pdf  -rw-r--r--@ 1 user  staff      90181 Jan 11 15:55 bgptools-0.2.tar.gz  -rw-r--r--@ 1 user  staff    4683220 May 25 14:52 osquery-3.2.4.pkg  

You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

I didn't think so. Thus begat the xattrs🔗 package.

Exploring Past Downloads

Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let's use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you've chosen a different default for your browsers, use that directory).

We're not going to work with the entire package in this post (it's really straightforward to use and has a README on the GitHub site along with extensive examples) but I'll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

First, let's see what is hidden with the RStudio disk image:

  library(xattrs)  library(reticulate) # not 100% necessary but you'll see why later  library(tidyverse) # we'll need this later    list_xattrs("~/Downloads/RStudio-1.2.627.dmg")  ## [1] "com.apple.diskimages.fsck"            "com.apple.diskimages.recentcksum"      ## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"     

There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let's take a look:

  get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")  ##   [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61  ##  [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64  ##  [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36  ##  [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74  ## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a  ## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00  ## [169] 00 00 00 89  

Why "raw"? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:

  get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")  ## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"  

So, we can kinda figure out the URL but it's definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We're going to use the latter.

I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don't judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You'll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.

  biplist <- import("biplist", as="biplist")    biplist$readPlistFromString(    get_xattr_raw(      "~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms"    )  )  ## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg"  ## [2] "https://dailies.rstudio.com/rstudio/oss/mac/"   

That's much better.

Let's work with metadata for the whole directory:

  list.files("~/Downloads", full.names = TRUE) %>%     keep(has_xattrs) %>%     set_names(basename(.)) %>%     map_df(read_xattrs, .id="file") -> xdf    xdf  ## # A tibble: 24 x 4  ##    file            name                                  size contents     ##                                                       ##  1 1109.1968.pdf   com.apple.lastuseddate#PS               16    ##  2 1109.1968.pdf   com.apple.metadata:kMDItemWhereFroms   110   ##  3 1109.1968.pdf   com.apple.quarantine                    74    ##  4 1109.1968v1.pdf com.apple.lastuseddate#PS               16    ##  5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms   116   ##  6 1109.1968v1.pdf com.apple.quarantine                    74    ##  7 1804.09970.pdf  com.apple.metadata:kMDItemWhereFroms    86    ##  8 1804.09970.pdf  com.apple.quarantine                    82    ##  9 1804.09988.pdf  com.apple.lastuseddate#PS               16    ## 10 1804.09988.pdf  com.apple.metadata:kMDItemWhereFroms   104   ## # ... with 14 more rows    ## count(xdf, name, sort=TRUE)  ## # A tibble: 5 x 2  ##   name                                     n  ##                                     ## 1 com.apple.metadata:kMDItemWhereFroms     9  ## 2 com.apple.quarantine                     9  ## 3 com.apple.lastuseddate#PS                4  ## 4 com.apple.diskimages.fsck                1  ## 5 com.apple.diskimages.recentcksum         1  

Now we can focus on the task at hand: recovering the URLs:

  list.files("~/Downloads", full.names = TRUE) %>%     keep(has_xattrs) %>%     set_names(basename(.)) %>%     map_df(read_xattrs, .id="file") %>%     filter(name == "com.apple.metadata:kMDItemWhereFroms") %>%     mutate(where_from = map(contents, biplist$readPlistFromString)) %>%     select(file, where_from) %>%     unnest() %>%     filter(!where_from == "")  ## # A tibble: 15 x 2  ##    file                where_from                                                         ##                                                                                 ##  1 1109.1968.pdf       https://arxiv.org/pdf/1109.1968.pdf                                ##  2 1109.1968.pdf       https://www.google.com/                                            ##  3 1109.1968v1.pdf     https://128.84.21.199/pdf/1109.1968v1.pdf                          ##  4 1109.1968v1.pdf     https://www.google.com/                                            ##  5 1804.09970.pdf      https://arxiv.org/pdf/1804.09970.pdf                               ##  6 1804.09988.pdf      https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf             ##  7 1805.01554.pdf      https://arxiv.org/pdf/1805.01554.pdf                               ##  8 athena-ug.pdf       http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf          ##  9 athena-ug.pdf       https://www.google.com/                                            ## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz   ## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/                      ## 12 osquery-3.2.4.pkg   https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…  ## 13 osquery-3.2.4.pkg   https://osquery.io/downloads/official/3.2.4                        ## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…  ## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/               

(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you'll need to deal with an ugly list object vs some quaint text vectors.

FIN

Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

New Version of ggplot2

Posted: 30 May 2018 08:30 AM PDT

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here's what choroplethr users need to know about this new version of ggplot2.

Choroplethr Update Required

The new version of ggplot2 introduces bugs into choroplethr. In particular, choroplethr does not pass R cmd check when the new version of ggplot2 is loaded. I am planning to submit a new version of choroplethr to CRAN that addresses this issue prior to June 25.

This change is the third or fourth time I've had to update choroplethr in recent years as a result of changes to ggplot2. This experience reminds me of one of the first lessons I learned as a software engineer: "More time is spent maintaining old software than writing new software."

Simple Features Support

One of the most common questions I get about choroplethr is whether I intend to add support for interactive maps. My answer has always been "I can't do that until ggplot2 adds support for Simple Features." Thankfully, this new version of ggplot2 introduces that support!

Currently all maps in the choroplethr ecosystem are stored as ggplot2 "fortified" dataframes. This is a format unique to ggplot2. Storing the maps in this format makes it possible to render the maps as quickly as possible. The downside is that:

  1. ggplot2 does not support interactive graphics.
  2. The main interactive mapping library in CRAN (leaflet) cannot render fortified data frames. It can only render maps stored as Shapefiles or Simple Features.

Once ggplot2 adds support for Simple Features, I can begin work on adding interactive map support to choroplethr. The first steps will likely be:

  1. Updating choroplethr to be able to render maps stored in the Simple Features format.
  2. Migrating choroplethr's existing maps to the Simple Features format.

After that, I can start experimenting with adding interactive graphics support to choroplethr.

Note that Simple Features is not without its drawbacks. In particular, many users are reporting performance problems when creating maps using Simple Features and ggplot2. I will likely not begin this project until these issues have been resolved.

Thoughts on the CRAN Ecosystem

This issue has caused me to reflect a bit about the stability of the CRAN ecosystem. 

ggplot2 is used by about 1,700 packages. It's unclear to me how many of these packages will have similar problems as a result of this change to ggplot2. And of the impacted packages, how many have maintainers who will push out a new version before June 25?

And ggplot2, of course, is just one of many packages on CRAN. This issue has the potential to occur whenever any package on CRAN is updated.

This issue reminded me that CRAN has a complex web of dependencies, and that package maintainers are largely unpaid volunteers. It seems like a situation where bugs can easily creep into an end user's code.

The post New Version of ggplot2 appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Does financial support in Australia favour residents born elsewhere? Responding to racism with data

Posted: 30 May 2018 08:09 AM PDT

(This article was first published on blogR, and kindly contributed to R-bloggers)

Seeing a racist outburst made me wonder whether the Australian Government unfairly supports people based on their background. Using data from the Australian Government and Bureau of Statistics, I couldn't find compelling evidence of this being true. Don't believe me? Read on and see what you make of the data.

 Australian racism goes viral, again

Australian racism went viral again this year when a man was filmed abusing staff at Centrelink, which delivers social security payments and services to Australians (see story here). The man yells that he didn't vote for multiculturalism and that Centrelink is supporting everyone except "Australians". It is distressing to watch, especially as someone whose ancestors found a home in Australia having escaped persecution. He can't take it back, but the man did publically apologise and may be suffering from mental illness (see story here).

This topic is still polarising. Many of us want to vilify this man while others may applaud him. But hate begets hate, and fighting fire with fire makes things worse. As a data scientist, the best way I know to respond to the assumptions and stereotypes that fuel racism is with evidence. So, without prejudice, let us investigate the data and uncover to whom the Australian government provides support through Centrelink.

With rare exceptions, Centrelink supports Australian Residents living in Australia (see here and here). So, the claim that Centrelink supports everyone but Australians in misguided. Perhaps the reference to "multiculturalism" can direct us to a more testable question. Centrelink offers support to Australian residents who can be born anywhere in the world. So in this article, I'll use publically accessible data to investigate differences in support given to residents born in Australia or elsewhere.

 Estimated Residential Population

The Figure below shows changes in Australia's Estimated Residential Population, which is an official value published by the Australian Bureau of Statistics and used for policy formation and decision making.

erp_over_time-1.png

The residential population has been increasing from about 17 million in 1992 to over 24 million in 2016. In contrast, the percentage of residents who are Australian-born has decreased from 76.9% to 71.5%. This will guide our sense of whether Centrelink payments are unbiased.

As a side note, Census statistics reported that the percentage of Australian-born residents in 2016 was 66.7% (4.8% lower than the official estimate above). This discrepancy is the result of the the Australian Bureau of Statistics making adjustments that you can learn about here.

Centrelink data is published quarterly and has included country-of-birth breakdowns since December 2016 (which aligns with the last available population data reported above). At this time, Centrelink made 15 million payments to Australian residents.

In December 2016, 71.5% of Australia's Estimated Residential population was Australian-born. Comparably, 68.8% of all Centrelink payments went to Australian-born residents.

The data shows that Centrelink payments are made to residents born in Australia or elsewhere in approximately the same proportions as these groups are represented in the population. The difference of a couple of percent indicates that slightly fewer payments were going to Australian-born residents than we'd expect. As we'll see in the following section, this difference can be almost wholly accounted for by the Age Pension. Still, the difference is small enough to negate the claim that Centrelink substantially favours residents born outside Australia.

 Breakdown by Type

It's also possible to break down these total numbers into the specific payment types shown below (detailed list here).

payment_type_total-1.png

It's expected that these payment types, which support specific needs, will show biases in favour of certain groups. For example, ABSTUDY supports study costs and housing for Aboriginal or Torres Strait Islander residents. This should mostly go to Australian-born residents. To investigate, we can extend the Figure above to include the number of Australian-born recipients:

payment_type_total_v_aus-1.png

Looking at this Figure, most Centrelink payments fall along the dotted line, which is what we'd expect from a fair system (if 71.5% of the recipients were Australian-born).

The outlier is the Age Pension, which falls below the line. More recipients of the Age Pension are born outside Australia than is reflected in the total population. I cannot say from this data alone why there is some bias in the Age Pension and perhaps a knowledgeable reader can comment. Nonetheless, this discrepancy is large enough that removing the Age Pension from consideration results in 70.5% of all other Centrelink payments going to Australian-born residents – almost exactly the proportion in the population.

 Ignoring Total Numbers

The Figure below shows the percentage of Australian-born recipients for each payment type, ignoring totals.

payment_type_p_aus-1.png

At the upper end of the scale, we can see Australian-born recipients being over-represented for ABSTUDY and Youth Allowance payments. At the lower end, residents who are born outside Australia are over-represented for Wife and Widow pension and allowance.

These payments with large biases (in either direction) have some common features. They have very specific eligibility criteria and are among the least-awarded services (shown in earlier Figures). Furthermore, the granting of payments to new recipients has been stopped in some cases such as the Wife Pension.

These findings are consistent with the expectation that specific types of payments should be biased in specific ways. It also shows that substantial biases only arise for specific payments that are awarded to very few individuals.

 Concluding remarks

In response to a racist outburst, I sought out publically available data to investigate whether there was evidence that the Australian Government unfairly supported residents based on their country of origin. I found that the percentage of residents born outside Australia has increased over time. However, with the minor exception of the Age pension (which the outraged man was not eligible for), residents born in Australia or elsewhere were fairly represented in the total number of Centrelink payments.

I'd like to thank the Australian Government and Australian Bureau of Statistics for publicising this data and making it possible for me to respond to racism with evidence. If you'd like to reproduce this work or dig into the data yourself, everything from explaining where I got the data to create this article is freely available on GitHub. You can also keep in touch with me on LinkedIn or by following @drsimonj on Twitter.

To leave a comment for the author, please follow the link and comment on their blog: blogR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Comments