[R-bloggers] Tips for R to Python and Vice-Versa seamlessly (and 8 more aRticles)

[R-bloggers] Tips for R to Python and Vice-Versa seamlessly (and 8 more aRticles)

Link to R-bloggers

Tips for R to Python and Vice-Versa seamlessly

Posted: 31 Mar 2019 04:56 AM PDT

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

When we TATVA AI visit our clients, often both data scientists and higher management ask us, how we deal with both  Python and R simultaneously for client requests; as there is no universal preference among clients.


Though solution is not straight forward, however, I suggest to exploit common libraries for quick deployments, such as, dfply (python) and dplyr (R). Below is a quick example:

## R Code
library(dplyr)
testdata %>% filter(col.1==10)

## Python Code
from dfply import *
testdata >> filter_by(X.col.1==10)

Those who have not yet experience this can try out now and let me know your experiences at either info@tatvaai.com or mavuluri.pradeep@gmail.com


Happy R and Python Programming!

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Using R: plotting the genome on a line

Posted: 31 Mar 2019 01:08 AM PDT

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Imagine you want to make a Manhattan-style plot or anything else where you want a series of intervals laid out on one axis after one another. If it's actually a Manhattan plot you may have a friendly R package that does it for you, but here is how to cobble the plot together ourselves with ggplot2.

We start by making some fake data. Here, we have three contigs (this could be your chromosomes, your genomic intervals or whatever) divided into one, two and three windows, respectively. Each window has a value that we'll put on the y-axis.

  library(dplyr)  library(ggplot2)    data <- data_frame(contig = c("a", "a", "a", "b", "b", "c"),                     start = c(0, 500, 1000, 0, 500, 0),                     end = c(500, 1000, 1500, 500, 1000, 200),                     value = c(0.5, 0.2, 0.4, 0.5, 0.3, 0.1))  

We will need to know how long each contig is. In this case, if we assume that the windows cover the whole thing, we can get this from the data. If not, say if the windows don't go up to the end of the chromosome, we will have to get this data from elsewhere (often some genome assembly metadata). This is also where we can decide in what order we want the contigs.

  contig_lengths <- summarise(group_by(data, contig), length = max(end))  

Now, we need to transform the coordinates on each contig to coordinates on our new axis, where we lay the contings after one another. What we need to do is to add an offset to each point, where the offset is the sum of the lengths of the contigs we've layed down before this one. We make a function that takes three arguments: two vectors containing the contig of each point and the position of each point, and also the table of lengths we just made.

  flatten_coordinates <- function(contig, coord, contig_lengths) {      coord_flat <- coord      offset <- 0        for (contig_ix in 1:nrow(contig_lengths)) {          on_contig <- contig == contig_lengths$contig[contig_ix]          coord_flat[on_contig] <- coord[on_contig] + offset          offset <- offset + contig_lengths$length[contig_ix]      }        coord_flat  }  

Now, we use this to transform the start and end of each window. We also transform the vector of the length of the contigs, so we can use it to add vertical lines between the contigs.

  data$start_flat <- flatten_coordinates(data$contig,                                         data$start,                                         contig_lengths)  data$end_flat <- flatten_coordinates(data$contig,                                       data$end,                                       contig_lengths)  contig_lengths$length_flat <- flatten_coordinates(contig_lengths$contig,                                                    contig_lengths$length,                                                    contig_lengths)  

It would be nice to label the x-axis with contig names. One way to do this is to take the coordinates we just made for the vertical lines, add a zero, and shift them one position, like so:

  axis_coord <- c(0, contig_lengths$length_flat[-nrow(contig_lengths)])  

Now it's time to plot! We add one layer of points for the values on the y-axis, where each point is centered on the middle of the window, followed by a layer of vertical lines at the borders between contigs. Finally, we add our custom x-axis, and also some window dressing.

  plot_genome <- ggplot() +      geom_point(aes(x = (start_flat + end_flat)/2,                     y = value),                 data = data) +      geom_vline(aes(xintercept = length_flat),                 data = contig_lengths) +      scale_x_continuous(breaks = axis_coord,                         labels = contig_lengths$contig,                         limits = c(0, max(contig_lengths$length_flat))) +      xlab("Contig") + ylim(0, 1) + theme_bw()  

And this is what we get:

I'm sure your plot will look more impressive, but you get the idea.

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

See you at useR! Toulouse

Posted: 30 Mar 2019 05:00 PM PDT

(This article was first published on Colin Fay, and kindly contributed to R-bloggers)

Hey all, just a quick post to give you some details about my workshop at useR! 2019, in Toulouse!

Hacking RStudio: Advanced Use of your Favorite IDE

About

Have you ever wanted to become more productive with RStudio? Then this workshop is made for you!

You've been wandering the web for a while now, reading about all the things the cool kids do with RStudio:

  • building addins & playing around with the {rstudioapi} package, like for example what we do in {remedy},
  • using project templates, like the one in {golem},
  • managing connections, like the (just for fun) package I wrote called {fryingpane}

All this seems pretty nifty but so far you have never find the time to sit down and learn how to master these skills. Then, you've come to the right place — my useR workshop will teach you how to push the boundaries of your basic RStudio usage, in order to become more efficient in your day to day usage of RStudio.

Target audience

We expect people to be familiar with RStudio, and have a little knowledge about programming with R. Knowing how to build a package will be better, but is not mandatory.

Instruction

Please come with a recent version of RStudio installed.

Sign up for useR

useR! 2019 will be hold from the 9th to the 12th of July, in Toulouse.

Early birds close soon (May 7th), so hurry 😉

http://www.user2019.fr/registration/

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

mapedit 0.5.0 and Leaflet.pm

Posted: 30 Mar 2019 05:00 PM PDT

(This article was first published on r-spatial, and kindly contributed to R-bloggers)

[view raw
Rmd
]

In our last post mapedit and leaflet.js >
1.0
we
discussed remaining tasks for the
RConsortium funded project
mapedit. mapedit 0.5.0 fixes
a couple of lingering issues, but primarily focuses on bringing the
power of Leaflet.pm as an
alternate editor.
Leaflet.draw,
the original editor in mapedit provided by leaflet.extras, is a
wonderful tool but struggles with snapping and those pesky holes that we
commonly face in geospatial tasks. Depending on the task, a user might
prefer to continue using Leaflet.draw, so we will maintain full
support for both editors. We'll spend the rest of the post demonstrating
where Leaflet.pm excels to help illustrate when you might want to
choose editor = "leafpm".

Install/Update

At a minimum, to follow along with the rest of this post, please update
mapedit and install the new standalone package leafpm. While we are
it, we highly recommend updating your other geospatial dependencies.

install.packages(c("sf", "leaflet", "leafpm", "mapview", "mapedit"))  # lwgeom is optional but nice when working with holes in leaflet.pm  # install.packages("lwgeom")  

Holes

mapedit now supports holes. Let's look at a quick example in which we
add, edit, and delete holes.

library(sf)  library(leaflet)  library(mapview)  library(mapedit)  library(leafpm)  # make a contrived polygon with holes for testing  outer1 = matrix(c(0,0,10,0,10,10,0,10,0,0),ncol=2, byrow=TRUE)  hole1 = matrix(c(1,1,1,2,2,2,2,1,1,1),ncol=2, byrow=TRUE)  hole2 = matrix(c(5,5,5,6,6,6,6,5,5,5),ncol=2, byrow=TRUE)  outer2 = matrix(c(11,0,11,1,12,1,12,0,11,0),ncol=2, byrow=TRUE)  pts1 = list(outer1, hole1, hole2)  pts2 = list(outer2)  pl1 = st_sf(geom = st_sfc(st_polygon(pts1)))  pl2 = st_sf(geom = st_sfc(st_polygon(pts2)))  mpl = st_sf(geom = st_combine(rbind(pl1, pl2)), crs=4326)  tst = editFeatures(mpl, editor = "leafpm")  # look at our creation  mapview(tst)  

screenshot of hole editing

Please note that right mouse click deletes vertexes. For a more real
world application franconia[5,] from mapview has a hole. Try to edit
it with the following code.

library(sf)  library(leaflet)  library(mapview)  library(mapedit)  library(leafpm)  editFeatures(franconia[5,], editor="leafpm")  

Snapping

Leaflet.pm gives us a very pleasant snapping experience, so if you
want to snap, set editor = "leafpm" and snap away. Snapping is
particular important when drawing/digitizing features from scratch. Here
is how it looks with the example from above.

screenshot of snapping

Snapping is enabled by default.

Fixes For Lingering Issues

GeoJSON Precision

Robin Lovelace discovered that at
leaflet zoom level > 17 we lose coordinate precision. Of course,
this is not good enough, so we will prioritize a fix as discussed in
issue. Hopefully,
this leaflet.js pull
request
will make this
fix fairly straightforward.

I am happy to report that we have found a solution for the loss of
precision. Please let us know if you discover any remaining problems.

Mulitlinestring Editing

Leaflet.js and multilinestrings don't get along as Tim
Appelhans
reported in
issue.
For complete support of sf, mapedit should work with
multilinestring, so we have promoted this to issue
62
.

We backed into a solution with MULTILINESTRING since Leaflet.pm's
approach fits better with MULTI* features. As an example, let's edit
one of the trails from mapview.

library(sf)  library(leaflet)  library(mapview)  library(mapedit)  library(leafpm)  editFeatures(trails[4,], editor="leafpm")  

screenshot of MULTILINESTRING editing

Conclusion and Thanks

As of this post we have reached the end of the extremely generous
RConsortium funding of mapedit.
Although the funding is over, we still expect to actively maintain and
improve mapedit. One feature that we had hoped to implement as part of
the mapedit toolset was editing of feature attributes. This turned out
to be very ambitious, and unfortunately we were not able to implement a
satisfactory solution for this feature during the funding period. We
plan, however, to develop a solution. Your participation, ideas, and
feedback are as vital as ever, so please continue to engage. Thanks to
all those who have contributed so far and thanks to all open source
contributors in the R and JavaScript communities.

To leave a comment for the author, please follow the link and comment on their blog: r-spatial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick}

Posted: 30 Mar 2019 05:00 PM PDT

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

In this blog post I'm going to show you how you can extract text from scanned pdf files, or pdf files
where no text recognition was performed. (For pdfs where text recognition was performed, you can
read my other blog post).

The pdf I'm going to use can be downloaded from here.
It's a poem titled, D'Léierchen (Dem Léiweckerche säi Lidd),
written by Michel Rodange, arguably Luxembourg's most well known writer and poet. Michel Rodange is
mostly known for his fable, Renert oder De Fuuß am Frack an a Ma'nsgrëßt, starring a central European
trickster anthropomorphic red fox.

Anyway, back to the point of this blog post. How can be get data from a pdf where no text recognition
was performed (or, how can we get text from an image)? The pdf we need the text from looks like
this:

To get the text from the pdf, we can use the {tesseract} package, which provides bindings to the tesseract program.
tesseract is an open source OCR engine developed by Google. But before that, let's use the {pdftools}
package to convert the pdf to png. This is because {tesseract} requires images as input (if you
provide a pdf file, it will converted on the fly). Let's first load the needed packages:

library(tidyverse)  library(tesseract)  library(pdftools)  library(magick)

And now let's convert the pdf to png files (in plural, because we'll get one image per page of the pdf):

pngfile <- pdftools::pdf_convert("path/to/pdf", dpi = 600)

This will generate 14 png files. I erase the ones that are not needed, such as the title page. Now,
let's read in all the image files:

path <- dir(path = "path/to/pngs", pattern = "*.png", full.names = TRUE)    images <- map(path, magick::image_read)

The images object is a list of magick-images, which we can parse. BUUUUUT! There's a problem.
The text is laid out in two columns. Which means that the first line after performing OCR will be
the first line of the first column, and the first line of the second column joined together. Same
for the other lines of course. So ideally, I'd need to split the file in the middle, and then
perform OCR. This is easily done with the {magick} package:

first_half <- map(images, ~image_crop(., geometry = "2307x6462"))    second_half <- map(images, ~image_crop(., geometry = "2307x6462+2307+0"))

Because the pngs are 4614 by 6962 pixels, I can get the first half of the png by cropping at
"2307×6462" (I decrease the height a bit to get rid of the page number), and the second half by
applying the same logic, but starting the cropping at the "2307+0" position. The result looks like
this:

Much better! Now I need to join these two lists together. I cannot simply join them. Consider
the following example:

one <- list(1, 3, 5)    two <- list(2, 4, 6)

This is the setup I currently have; first_half contains odd pages, and second_half contains
even pages. The result I want would look like this:

list(1, 2, 3, 4, 5, 6)
## [[1]]  ## [1] 1  ##   ## [[2]]  ## [1] 2  ##   ## [[3]]  ## [1] 3  ##   ## [[4]]  ## [1] 4  ##   ## [[5]]  ## [1] 5  ##   ## [[6]]  ## [1] 6

There is a very elegant solution, with reduce2() from the {purrr} package. reduce() takes one
list and a function, and … reduces the list to a single element. For instance:

reduce(list(1, 2, 3), paste)
## [1] "1 2 3"

reduce2() is very similar, but takes in two lists, but the second list must be one element shorter:

reduce2(list(1, 2, 3), list("a", "b"), paste)
## [1] "1 2 a 3 b"

So we cannot simply use reduce2() on lists one and two, because they're the same length. So let's
prepend a value to one, using the prepend() function of {purrr}:

prepend(one, 0) %>%       reduce2(two, c)
## [1] 0 1 2 3 4 5 6

Exactly what we need! Let's apply this trick to our lists:

merged_list <- prepend(first_half, NA) %>%       reduce2(second_half, c) %>%       discard(is.na)

I've prepended NA to the first list, and then used reduce2() and then used discard(is.na) to
remove the NA I've added at the start. Now, we can use OCR to get the text:

text_list <- map(merged_list, ocr)

ocr() uses a model trained on English by default, and even though there is a model trained on
Luxembourguish, the one trained on English works better! Very likely because the English model was trained
on a lot more data than the Luxembourguish one. I was worried the English model was not going to
recognize characters such as é, but no, it worked quite well.

This how it looks like:

text_list    [[1]]  [1] "Lhe\n| Kaum huet d'Feld dat fréndlecht Feier\nVun der Aussentssonn gesunn\nAs mam Plou aus Stall a Scheier\n* D'lescht e Bauer ausgezunn.\nFir de Plou em nach ze dreiwen\nWar sai Jéngelchen alaert,\nDeen nét wéllt doheem méi bleiwen\n8 An esouz um viischte Paerd.\nOp der Schéllche stoung ze denken\nD'Léierche mam Hierz voll Lidder\nFir de Béifchen nach ze zanken\n12 Duckelt s'an de Som sech nidder.\nBis e laascht war, an du stémmt se\nUn e Liddchen, datt et kraacht\nOp der Nouteleder klémmt se\n16 Datt dem Béifchen d'Haerz alt laacht.\nAn du sot en: Papp, ech mengen\nBal de Vull dee kénnt och schwatzen.\nLauschter, sot de Papp zum Klengen,\n20 Ech kann d'Liddchen iwersetzen.\nI\nBas de do, mii léiwe Fréndchen\nMa de Wanter dee war laang!\nKuck, ech hat keng fréilech Sténnchen\n24 *T war fir dech a mech mer baang.\nAn du koum ech dech besichen\nWell du goungs nét méi eraus\nMann wat hues jo du eng Kichen\n28 Wat eng Scheier wat en Haus.\nWi zerguttster, a wat Saachen!\nAn déng Frache gouf mer Brout.\nAn déng Kanner, wi se laachen,\n32, An hir Backelcher, wi rout!\nJo, bei dir as Rot nét deier!\nJo a kuck mer wat eng Méscht.\nDat gét Saache fir an d'Scheier\n36 An och Sué fir an d'Késcht.\nMuerges waars de schuns um Dreschen\nIr der Daudes d'Schung sech stréckt\nBas am Do duurch Wis a Paschen\n40 Laascht all Waassergruef geschréckt.\n"  ....  ....

We still need to split at the "" character:

text_list <- text_list %>%       map(., ~str_split(., "\n"))

The end result:

text_list    [[1]]  [[1]][[1]]   [1] "Lhe"                                      "| Kaum huet d'Feld dat fréndlecht Feier"    [3] "Vun der Aussentssonn gesunn"              "As mam Plou aus Stall a Scheier"            [5] "* D'lescht e Bauer ausgezunn."            "Fir de Plou em nach ze dreiwen"             [7] "War sai Jéngelchen alaert,"               "Deen nét wéllt doheem méi bleiwen"          [9] "8 An esouz um viischte Paerd."            "Op der Schéllche stoung ze denken"         [11] "D'Léierche mam Hierz voll Lidder"         "Fir de Béifchen nach ze zanken"            [13] "12 Duckelt s'an de Som sech nidder."      "Bis e laascht war, an du stémmt se"        [15] "Un e Liddchen, datt et kraacht"           "Op der Nouteleder klémmt se"               [17] "16 Datt dem Béifchen d'Haerz alt laacht." "An du sot en: Papp, ech mengen"            [19] "Bal de Vull dee kénnt och schwatzen."     "Lauschter, sot de Papp zum Klengen,"       [21] "20 Ech kann d'Liddchen iwersetzen."       "I"                                         [23] "Bas de do, mii léiwe Fréndchen"           "Ma de Wanter dee war laang!"               [25] "Kuck, ech hat keng fréilech Sténnchen"    "24 *T war fir dech a mech mer baang."      [27] "An du koum ech dech besichen"             "Well du goungs nét méi eraus"              [29] "Mann wat hues jo du eng Kichen"           "28 Wat eng Scheier wat en Haus."           [31] "Wi zerguttster, a wat Saachen!"           "An déng Frache gouf mer Brout."            [33] "An déng Kanner, wi se laachen,"           "32, An hir Backelcher, wi rout!"           [35] "Jo, bei dir as Rot nét deier!"            "Jo a kuck mer wat eng Méscht."             [37] "Dat gét Saache fir an d'Scheier"          "36 An och Sué fir an d'Késcht."            [39] "Muerges waars de schuns um Dreschen"      "Ir der Daudes d'Schung sech stréckt"       [41] "Bas am Do duurch Wis a Paschen"           "40 Laascht all Waassergruef geschréckt."   [43] ""    ...  ...

Perfect! Some more cleaning would be needed though. For example, I need to remove the little
annotations that are included:

I don't know yet how I'm going to do that.I also need to remove the line numbers at the beginning
of every fourth line, but this is easily done with a simple regular expression:

str_remove_all(c("12 bla", "blb", "123 blc"), "^\\d{1,}\\s+")
## [1] "bla" "blb" "blc"

But this will be left for a future blog post!

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.

Buy me an EspressoBuy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Website with Australian federal election forecasts by @ellis2013nz

Posted: 30 Mar 2019 06:00 AM PDT

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

The election forecasts

Building on my recent blog posts, I've put up a page dedicated to forecasts of the coming Australian federal election. It takes the state space model of two-party-preferred vote from my first blog on polls leading up to this election, and combines it with a more nuanced understanding of the seats actually up for grabs in this election than I'd been able to develop for my second blog, on swings. For the current margins at divisional level after the various electoral distributions, I'm drawing on this excellent summary by the ABC.

To cut to the chase, here's my prediction – 76 percent chance of ALP being able to form government by themselves:

Because the forecast is built on division-level simulations of what will happen when local randomness adds to an uncertain prediction of two-party-preferred swing, I also get probabilistic forecasts for each individual seat. This lets me produce charts like this one:

…which shows what is likely to happen seat by seat when we get the actual election.

The ozfedelect R package

The ozfedelect R package continues to grow. Just today, I've added to it:

  • a vector of colours for the Australian political parties involved in my forecasting
  • a useful data frame oz_pendulum_2019 of the margins of the various House of Representatives seats going in to the 2019 election.
  • update of polling data.

All this is in addition to the historical polling and division-level election results it already contains.

Code for these forecasts

The code for conducting the forecasts or just installing ozfedelect for other purposes is available in GitHub. The ozfedelect GitHub repository not only will build the ozfedelect R package from scratch (ie downloading polling data from Wikipedia and election results from the Australian Electoral Commission) it also has the R and Stan code for fitting the model of two-party-preferred vote and turning it into division-level simulated results.

It should be regarded as a work in progress. Comments and suggestions are warmly welcomed!

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Wrapping up the stars project

Posted: 29 Mar 2019 05:00 PM PDT

(This article was first published on r-spatial, and kindly contributed to R-bloggers)

[view raw
Rmd
]

Summary

This is the fourth blog on the
stars project, an it completes the
R-Consortium funded project for spatiotemporal tidy arrays with R. It
reports on the current status of the project, and current development
directions. Although this project ends, with the release of stars 0.3 on
CRAN, the
adoption, update, enthusiasm and participation in the development of the
stars project have really only started, and will without doubt increase
and continue.

Status

The stars package has now five
vignettes (called "Articles" on
the pkgdown site) that explain its main features. Besides writing these
vignettes, a lot of work over the past few months went into

  • writing support for stars_proxy objects, objects for which the
    metadata has been read but for which the payload is still on disk.
    This allows handling raster files or data cubes that do not fit into
    memory. Manipulating them uses lazy evaluation: only when pixel
    values are really needed they are read and processed: this is for
    instance when a plot is needed, or results are to be written with
    write_stars. In case of plotting, no more pixels are processed
    than can be seen on the device.
  • making rectilinear and curvilinear grids work, by better parsing
    NetCDF files directly (rather than through GDAL), reading their
    bounds, and by writing conversions to sf objects so that they can
    be plotted;
  • writing a tighter integration with GDAL, e.g. for warping grids,
    contouring grids, and rasterizing polygons;
  • supporting 360-day and 365-day (noleap) calendars, which are used
    often in climate model data;
  • providing an off-cran starsdata package, with around 1 Gb of real
    imagery, too large for submitting to CRAN or GitHub, but used for
    testing and demonstration;
  • resolving issues (we're at 154 now) and managing pull requests;
  • adding stars support to
    gstat,
    a package for spatial and spatiotemporal geostatistical modelling,
    interpolation and simulation.

I have used stars and sf successfully last week in a two-day course
at Munich Re on Spatial Data Science with
R
(online material), focusing on
data handling and geostatistics. Both packages worked out beautifully
(with a minor amount of rough edges), in particular in conjunction with
each other and with the tidyverse.

Further resources on the status of the project are found in

  • the
    video
    of my rstudio::conf presentation on "Spatial data science in the
    Tidyverse"
  • chapter 4 of
    the Spatial Data Science book (under development)

Future

Near future development will entail experiments with very large
datasets, such as the entire Sentinel-2
archive
. We secured earlier
some
funding
from the R Consortium for doing this, and first outcomes will be
presented shortly in a follow-up blog. A large challenge here is the
handling of multi-resolution imagery, imagery spread over different
coordinate reference systems (e.g., crossing multiple UTM zones) and the
temporal resampling needed to form space-time raster cubes. This is
being handled gracefully by the
gdalcubes C++ library and R
package developed by Marius Appel. The gdalcubes package has been
submitted to CRAN.

Earlier stars blogs

To leave a comment for the author, please follow the link and comment on their blog: r-spatial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Quick Control Charts for AFL

Posted: 29 Mar 2019 05:00 PM PDT

(This article was first published on Analysis of AFL, and kindly contributed to R-bloggers)

Who doesn't like a wikipedia entry control chart If analysis of the control chart indicates that the process is currently under control (i.e., is stable, with variation only coming from sources common to the process), then no corrections or changes to process control parameters are needed or desired I mean gee whiz this sure could relate to something like I don't know AFL total game scores?

There seems to always be talk about the scores in AFLM see AFL website, foxsports just to name a couple. Of course you could find more if you searched out for it as well.

Let's use fitzRoy and the good people over at statsinsder who have kindly provided me with the expected score data you can get from the herald sun.

First thing, lets use fitzRoy

library(fitzRoy)  library(tidyverse)
## ── Attaching packages ──────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.3.2    ## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1  ## ✔ tidyr   0.8.3       ✔ stringr 1.4.0    ## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ─────────────────── tidyverse_conflicts() ──  ## ✖ dplyr::filter() masks stats::filter()  ## ✖ dplyr::lag()    masks stats::lag()
library(ggQC)  fitzRoy::match_results%>%    mutate(total=Home.Points+Away.Points)%>%    group_by(Season,Round)%>%    summarise(meantotal=mean(total))%>%  filter(Season>1989 &  Round=="R1")%>%  ggplot(aes(x=Season,y=meantotal))+geom_point()+geom_line()+stat_QC(method="XmR")+ylab("Mean Round 1 Total for Each Game") +ggtitle("Stop Freaking OUT over ONE ROUND")

So if we were to look at the control chart just for round 1 in each AFLM season since the 90s it would seem as though that even though this round was lower scoring that there isn't much too see here.

After all we can and should expect natural variation in scores, wouldn't footy be boring if scores were the same every week.

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Better predictions for AFL from adjusted Elo ratings by @ellis2013nz

Posted: 29 Mar 2019 06:00 AM PDT

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Warning – this post discusses gambling odds and even describes me placing small $5 bets, which I can easily afford to lose. In no way should this be interpreted as advice to anyone else to do the same, and I accept absolutely no liability for anyone who treats this blog post as a basis for gambling advice. If you find yourself losing money at gambling or suspect you or someone close to you has a gambling problem, please seek help from https://www.gamblinghelponline.org.au/ or other services.

Appreciating home team advantage

Last week I blogged about using Elo ratings to predict the winners in the Australian Football League (AFL) to help me blend in with the locals in my Melbourne workplace footy-tipping competition. Several people pointed out that my approach ignored the home team advantage, which is known to be a significant factor in the AFL. How significant? Well, here's the proportion of games won by the home team in each season since the AFL's beginning:

Overall, 59% of games in the AFL history have been won by the home team, although that proportion varies materially from season to season. Another way of putting this – if my AFL prediction system was as simple as "always pick the home team" I would expect (overall) to score a respectable 59% of successful picks.

My first inclination in response to this was just to add 0.09 to the chance of the home team winning from my model's base probability, but first I thought I should check whether this adjustment varies by team. Well, it does, somewhat dramatically. This next chart, based on just modern era games (1997 and onwards), defines the home advantage as the proportion of home games won minus the proportion of away games won, divided by 2. The all-teams all-seasons average for this figure would be 0.09.

The teams that are conspicuously high on this measure are all non-Melbourne teams:

  • Geelong
  • Adelaide
  • West Coast
  • Fremantle
  • Greater Western Sydney

Gold Coast (another non-Melbourne team, for any non-Australians reading) would show up as material if a success ratio measure were used instead of my simple (and crude) additive indicator; because their overall win rate is so low. Sydney and perhaps Brisbane stand out as good performers overall that don't have as marked a home town advantage (or away-town disadvantage) as their non-Melbourne peers.

I presume this issue is well known to AFL afficianados. With the majority (or at least plurality – I haven't counted) of games played in Melbourne, Melbourne-based clubs generally play many of their "away" matches still relatively close to players' homes. Whereas (for example) the West Coast Eagles flying across the Nullarbor for a match is bound to cost their players, compared to the alternative situation of being at home and making the opponents fly west instead.

Geelong surprised me – Geelong is much closer to Melbourne than the inter-state teams, so no long flights are involved. But simply travelling even an hour up the M1, added to the enthusiastic partisanship of Geelong-Melbourne fan rivalry, perhaps explains the strong advantage.

Here's the R code that performs these steps:

  • load in functionality for the analysis
  • download the AFL results from 1897 onwards from afltables using the fitzRoy package and store in the object r (for "results")
  • draw the chart of home win rates per season
  • estimate and plot teams' recent-decades home advantage
  • create a new r2 object with home and away advantage and disadvantage adjustments for probabilities

Post continues after code extract

#================Setup and get data============    library(tidyverse)  library(scales)  library(fitzRoy)   # for reading AFL scores  library(frs)       # for Elo functions  library(Cairo)  library(ggrepel)  library(lubridate)  library(foreach)  library(doParallel)  library(knitr)  library(clipr)    the_caption <- "Source: afltables via fitZroy; analysis by freerangestats.info"  update_geom_defaults("line", list(colour = "steelblue"))    results <- get_match_results()    # reshape and organise the data:  r <- results %>%    rename_all(tolower) %>%    rename_all(function(x){gsub("\\.", "_", x)}) %>%    select(home_team, away_team, date, game, margin) %>%    gather(location, team, -date, -game, -margin) %>%    mutate(location = gsub("_team", "", location),           margin = ifelse(location == "home", margin, -margin)) %>%    mutate(winner = margin > 0) %>%    arrange(date) %>%    mutate(starting_elo = 1500,           new_elo = 1500) %>%    group_by(game) %>%    # sequence so the winner is the first listed each time for each game:    arrange(game, desc(winner)) %>%    ungroup() %>%    mutate(season = year(date)) %>%    group_by(season, team) %>%    mutate(round = 1:n()) %>%    ungroup()    #==================analysis of the home team advantage===================    # Proportion of home winners over full range of history:  r %>%    filter(location == "home") %>%    group_by(season) %>%    summarise(prop_home = mean(winner)) %>%    ggplot(aes(x = season, y = prop_home)) +    geom_line() +    scale_y_continuous("Percentage of games won by the home team", label = percent_format(accuracy = 1)) +      labs(caption = the_caption,         x = "") +    ggtitle("Home teams win around 59 percent of the time in the AFL")    # Home team advantage from 1997 onwards - the modern era:  d <- r %>%    group_by(team) %>%    summarise(start = min(date),              end = max(date),              home_wins = mean(winner[location == "home" & date > "1997-01-01"]),              away_wins = mean(winner[location == "away" & date > "1997-01-01"]),              home_adv = (home_wins - away_wins) / 2) %>%    arrange(desc(home_adv)) %>%    filter(end > "2000-01-01")    d %>%    ggplot(aes(x = away_wins, y = home_wins, label = team, colour = home_adv)) +    geom_abline(intercept = 0, slope = 1) +    geom_segment(aes(xend = away_wins, yend = away_wins)) +    geom_text(hjust = 0) +    scale_x_continuous(label = percent_format(accuracy = 1)) +    scale_y_continuous(label = percent_format(accuracy = 1)) +    scale_color_viridis_c("Home\nadvantage",      option = "D", direction = -1, label = percent_format(accuracy = 1),      breaks = c(6,8,10,12,14) / 100) +    ggtitle("Wins at home compared to away in the AFL, 1997 to present",            "All teams win more at home than away.") +    labs(x = "Proportion of away games won",         y = "Proportion of home games won",         caption = the_caption) +    theme(legend.position = "right") +    coord_flip() +    expand_limits(y = c(0.2, 0.85),                  x= c(0.15, 0.55))    adjustments <- d %>%    mutate(home =  home_adv / 2,           away = -home_adv / 2) %>%    select(team, home, away) %>%    gather(location, location_adjustment, -team)    r2 <- r %>%    left_join(adjustments, by = c("team", "location")) %>%    mutate(location_adjustment = replace_na(location_adjustment, 0),           predicted_winner = NA) 

Choosing a better set of parameters

Seeing as I had to revisit my predictive method to adjust for home and away advantage/disadvantage, I decided to also take a more systematic approach to some parameters that I had set either arbitrarily or without noticing they were parameters in last week's post. These fit into two areas:

  • The FIBS-based Elo rating method I use requires a "match length" parameter. Longer matches mean the better player is more likely to win, and also lead to larger adjustments in Elo ratings once the result is known. I had used the winning margin as a proxy/equivalent for match length, reasoning that a team that won by a large amount had shown their superiority in a similar way to a backgammon player winning a longer match. But should a 30 point margin count as match length of 30 (ie equivalent to points) or 5 (equivalent to goals) or something in-between? And what margin or "match length" should I use for predicting future games, where even a margin of 1 is enough to win? And even more philosophically, is margin really related to the backgammon concept of match length in a linear way, or should there be discounting of larger margins? Today, I created three parameters to scale the margin, to choose a margin for predicting future matches, and put the scaled margin to the power of a number from zero to one for non-linearity.
  • Elo ratings are by the nature sticky and based on a player/team's whole career. I had noted that I got better performance in predicting the 2018 season by restarting all teams' ratings at 1500 at the beginning of the 2017 season. But this is fairly crude. The FIBS Elo rating method has a parameter for teams' past experience which in effect lets you control how responsive the rating is to new information (the idea being to help new players in a competition quickly move from 1500 up or down to their natural level). I have now added to this with a "new round factor" which shrinks ratings for the first game of the season towards 1500, effectively discounting past experience

Here's the code that describes and defines those parameters a bit better, in an enhanced version of my afl_elos() function I introduced last week.

Post continues after code extract

#---------------Function to determine ratings for one particular set of parameters-----------------    #' Calculate Elo ratings for a historical set of AFL results  #'   #' @param r a data frame with columns including date, team, game, starting_elo, new_elo  #' @param sc scaling factor for dividing the winning margin by to convert it to "match length"   #' for Elo calculation purposes (FIBS method requires a match length)  #' @param pred_margin the prediction margin in points to use as match length for prediction purposes.  #' Note that this gets divided by sc so it is on the same scale as the historical margins.  #' @param margin_power parameter that scaled prediction margin is put to the power of in the match length  #' calculation. 0 will mean all match lengths are 1 (so sc and pred_margin make no difference)  #' @param experience how much experience (sum of previous match length) teams are presumed to have, for the  #' FIBS-style Elo rating. Numbers below 400 make the rating adjustments more responsive to recent results.  #' @param new_round_factor how much to shrink Elo ratings towards 1500 in the first match of each season.  #' 1 means no shrinkage, 0 means every team starts the season fresh with a rating of 1500 regardless of   #' past performance.  afl_elos <- function(r, sc = 1, pred_margin = 30, margin_power = 1, experience = 100, new_round_factor = 1){    r <- r %>%      group_by(game) %>%      arrange(date, game, desc(winner))        all_games <- unique(r$game)        # This seems inherently iterative so perhaps a loop is the logical way to do it:    for(g in all_games){            this_game <- r[r$game == g, ] %>%        mutate(starting_elo = ifelse(round == 1,                                      (starting_elo - 1500) * new_round_factor + 1500,                                     starting_elo))            # er: calculate elo rating arising from this game to use for changes in ratings      er <- elo_rating(a = this_game[1, "starting_elo"],                        b = this_game[2, "starting_elo"],                       winner = "a",                       ml = (round(this_game[1, "margin"] / sc) ^ margin_power),                       axp = experience,                       bxp = experience,                       a_adv = this_game[1, "location_adjustment"],                       b_adv = this_game[2, "location_adjustment"])            # er: calculate elo rating arising from this game to use for the prediction, for use in      # measureing prediction success      er2 <- elo_rating(a = this_game[1, "starting_elo"],                        b = this_game[2, "starting_elo"],                       winner = "a",                       ml = (round(pred_margin / sc) ^ margin_power),                       axp = experience,                       bxp = experience,                       a_adv = this_game[1, "location_adjustment"],                       b_adv = this_game[2, "location_adjustment"])                  r[r$game == g, "new_elo"]  <- unlist(c(er$a, er$b))      r[r$game == g, "predicted_winner"] <- c(er2$winproba >= 0.5, er2$winproba < 0.5)      r <- r %>%        group_by(team) %>%        mutate(starting_elo = lag(new_elo, default = 1500)) %>%        ungroup()    }    r <- r %>%      mutate(season = lubridate::year(date)) %>%      group_by(game) %>%      mutate(successful_prediction = predicted_winner == winner) %>%      ungroup()        return(r)  }

This function is now pretty slow when run on the whole AFL history form 1897. Unlike last week, it calls the underlying frs::elo_rating() function for each game twice – once (the object er above) to determine the ratings after the match outcome is known, and once to determine the prediction of the match's result, for benchmarking purposes (the object er2). Last week I didn't need to use elo_rating() twice, because the prediction of the winner was as simple as choosing the team with the highest Elo rating going in to the match. Now, we have to calculate the actual probability of winning, adjusted for home and away advantage and disadvantage. This calculation depends on the parameter choices that impact on converting margin to "match length" and what winning margin we base our predictions on, so the calculation is an additional one to the change in rating that came about from the actual result of the game.

There are doubtless efficiencies that could be made, but I'm not enthused to spend too much time refactoring at this point…

I have no theories and hardly any hunches on what parameter combinations will give the best performance, so the only way to choose a set is to try many and pick the one that would have worked best in predicting AFL games to date. I defined about 2,500 combinations of parameters, removed some that were effective duplicates (because if margin_power is 0, then the value of sc and pred_margin are immaterial) and for the purposes of this blog ran just 100 random prediction competitions, based on the games from 1950 onwards. With each individual run taking five minutes or more, I used parallel processing to do 7 runs simultaneously and get the total time for those 100 runs down to about 90 minutes, all I was prepared to do today for the purposes of this blog. I might run it for a longer period of time overnight later.

The top twenty parameter sets from this competition are in the table below. The best combination of factors led to an overall prediction success of about 69%, which is better than last week's 65%, the crude "always pick the home team" success of 59% and a coin flip of 50%; but not as much better as I hoped. Clearly picking these winners is hard – AFL is more like backgammon or poker in terms of predicting outcomes than it is like chess.

sc pred_margin margin_power experience new_round_factor success_rate
1 30 0.8333333 100 0.4 0.6853485
1 20 0.8333333 200 0.8 0.6839260
1 20 1.0000000 100 0.6 0.6828829
1 10 0.8333333 0 0.4 0.6822191
1 20 0.6666667 100 0.8 0.6812707
3 20 0.8333333 0 0.8 0.6806069
1 20 1.0000000 400 0.8 0.6785206
3 10 1.0000000 0 0.8 0.6777620
1 20 0.5000000 0 0.8 0.6776671
1 10 1.0000000 300 0.8 0.6774775
3 30 0.8333333 200 0.6 0.6762447
1 10 1.0000000 300 0.6 0.6747274
1 20 0.8333333 400 0.6 0.6676150
3 10 0.6666667 100 0.8 0.6668563
6 20 1.0000000 100 1.0 0.6661925
1 30 0.5000000 100 1.0 0.6634424
1 10 0.5000000 0 0.4 0.6629682
3 20 0.6666667 100 1.0 0.6617354
6 20 0.8333333 100 0.6 0.6592698
6 30 0.6666667 200 0.8 0.6570887

The best models had a modest shrinkage of ratings towards 1500 (new_round_factor of 0.4 to 0.8, compared to 0 which would mean everyone starting at 1500 in each round 1); and modest if any non-linearity in the conversion of winning margin to a notional "match length". They had relatively low levels of "experience", effectively increasing the importance of recent results and downplaying long term momentum; while treating match results in points (sc = 1) and predicting based on a relatively large margin.

I only had time to try a random sample of parameter combinations, and would be very lucky indeed if I have ended up with the best set. How confident can I be that I've got something close enough? Here's the distribution of success rates for that post 1950 series:

Without over-thinking it, it's reasonable to infer a few more extreme values on the right are possible if we looked at the full set of parameters; but that they wouldn't be that much more successful. It's certainly good enough for a workplace footy tipping competition.

Here's the predictive success of the best model over time, now applied to the full range of data not just the post 1950 period for which it was optimised:

… and the code that did the above "parameter competition", using the foreach and doParallel R packages for parallel processing to bring the elapsed time down to reasonable levels:

Post continues after code extract

#---------------Define all possible combinations of parameters---------------------    set.seed(123)  params <- expand.grid(    sc = c(1, 3, 6),    pred_margin = c(1, 10, 20, 30),    margin_power = 0:6 / 6,    experience = 0:4 * 100,    new_round_factor = 0:5 / 5  ) %>%    # if margin_power is 0 then sc and pred_margin make no difference, so we can drop a few    # paramter combinations:    mutate(sc = ifelse(margin_power == 0, 1, sc),           pred_margin = ifelse(margin_power == 0, 1, margin_power)) %>%    distinct() %>%    # sort in random order, as we're not going to have time to run all combinations    mutate(rnd = runif(n())) %>%    arrange(rnd) %>%    mutate(row_num = 1:n()) %>%    select(-rnd)      #-------------------------Run the model with a subset of the combinations of parameters------------------  cluster <- makeCluster(7) # only any good if you have at least 7 processors :)  registerDoParallel(cluster)    clusterEvalQ(cluster, {    library(foreach)    library(tidyverse)    library(frs)  })    clusterExport(cluster, c("r2", "afl_elos", "params"))    # 519 seconds for 10 sets of parameters. So can do about 1.1 per minute (with 7 processors); 180 will take 3 hours.  system.time({    suppressWarnings(rm(res))    res60 <- foreach(i = 1:100, .combine = rbind) %dopar% {      x <- params[i, ]            success_rate <- r2 %>%              filter(season >= 1950) %>%  			afl_elos(sc = x$sc,                 pred_margin = x$pred_margin,                 margin_power = x$margin_power,                 experience = x$experience,                 new_round_factor = x$new_round_factor) %>%        filter(location == "home") %>%        summarise(sp = mean(successful_prediction)) %>%        pull(sp)            return(data.frame = data.frame(row_num = i, success_rate = success_rate))    }  })    stopCluster(cluster)    #------------------Examine results------------------------------    params_with_res <- params %>%    left_join(res, by = "row_num") %>%    filter(!is.na(success_rate)) %>%    arrange(desc(success_rate)) %>%    select(-row_num)     params_with_res%>%    slice(1:20) %>%    kable() %>%    write_clip()    ggplot(params_with_res, aes(x = success_rate)) +    geom_density(fill = "steelblue", alpha = 0.5) +    geom_rug() +    ggtitle("Distribution of success rates in parameter contest")    best_params <- params_with_res[1, ]    elos_best <- r2 %>%    afl_elos(sc               = best_params$sc,             pred_margin      = best_params$pred_margin,             margin_power     = best_params$margin_power,             experience       = best_params$experience,             new_round_factor = best_params$new_round_factor)    elos_best %>%    filter(location == "away" & date < "2019-01-01") %>%    group_by(season) %>%    summarise(successful_prediction = mean(successful_prediction)) %>%    ggplot(aes(x = season, y = successful_prediction)) +    geom_line() +    scale_y_continuous("Percentage of successful predictions", label = percent_format(accuracy = 1)) +    ggtitle("Predictions of past AFL games",            "Using the best combination found of parameters for Elo ratings and shrinkage during the off-season.  Prediction success of 70% is difficult to achieve in the modern era.") +    labs(caption = the_caption, x = "")

This weeks' predictions are…

To turn this model into my tips for this week, I need to extract the final Elo ratings from the best model, join them with the actual fixture and then use the model to predict actual probabilities of winning. Here's what I get:

home away home_elo away_elo home_adjustment away_adjustment final_prob winner fair_returns_home fair_returns_away
Richmond Collingwood 1623.362 1552.782 0.0392339 -0.0305072 0.6527701 Richmond 1.531933 2.879936
Sydney Adelaide 1506.489 1476.328 0.0467485 -0.0632875 0.6457876 Sydney 1.548497 2.823164
Essendon St Kilda 1507.538 1404.657 0.0505882 -0.0428714 0.7132448 Essendon 1.402043 3.487295
Port Adelaide Carlton 1544.746 1340.797 0.0571146 -0.0257644 0.8077331 Port Adelaide 1.238033 5.201104
Geelong Melbourne 1552.818 1521.317 0.0747884 -0.0402286 0.6523510 Geelong 1.532917 2.876464
West Coast GWS 1543.355 1565.494 0.0699069 -0.0648148 0.6084577 West Coast 1.643499 2.554003
North Melbourne Brisbane Lions 1461.086 1483.270 0.0406327 -0.0558655 0.5701814 North Melbourne 1.753828 2.326563
Hawthorn Footscray 1567.933 1482.579 0.0490421 -0.0383632 0.6873887 Hawthorn 1.454781 3.198861
Gold Coast Fremantle 1382.276 1483.175 0.0402515 -0.0729052 0.4955912 Fremantle 2.017792 1.982519

That final_prob column is the estimated probability of the home team winning.

As you can see, I translate my probabilities into a "fair return", which I'm using to scan for opportunities with poorly chosen odds from the bookies. These opportunities don't arrive very often as the bookies are professionals, but when they are paying 50% more than the model predicts to be "fair" I'm going to punt $5 and we'll see how we go at the end of the season. So far I'm $26 up from this strategy but it's early days and I'm far from assured the luck will continue.

Judging from the tips and odds by the public, the only controversial picks in the above are for North Melbourne to beat Brisbane and Gold Coast to be nearly a coin flip in contest with Fremantle. In both cases my algorithm is tipping a home advantage to equalise the comparative relative strength of the away team. For the North Melbourne match, the bookies agree with me, whereas the tippers on tipping.afl.com.au are going for a Brisbane win, so I think we can say that reasonable people disagree about the outcome there and it is uncertain. For the other match, I have grave doubts about Gold Coast's chances against Fremantle (who had a stellar victory last weekend), but am inclined to think the $3.50 return bookies are offering to pay for a Gold Coast win is over-generous and underestimating how much Fremantle struggle when playing away from home. So that's my recommended match to watch for a potential surprise outcome.

At the time of writing, the first two of these predictions in my table above have already gone astray (for me, the average punters and the average tippers) in the Thursday and Friday night matchs, as 2019 continues its run of surprise results. Collingwood and Adelaide both pulled off against-the-odds wins against teams that were both stronger on paper and playing at home. I won't say my predictions were "wrong", because when you say something has a 0.6 chance of happening and it doesn't, there's a good chance you were just unlucky, not wrong.

But as they say, prediction is hard, particularly about the future.

Final chunk of R code for today – converting the model into predictions for this round:

#==================Predictions for this round!=====================  elos_latest <- elos_best %>%    group_by(team) %>%    arrange(desc(date)) %>%    slice(1) %>%    ungroup() %>%    arrange(desc(new_elo)) %>%    select(team, elo = new_elo)    fixture <- tibble(    home = c("Richmond", "Sydney", "Essendon", "Port Adelaide", "Geelong", "West Coast",             "North Melbourne", "Hawthorn", "Gold Coast"),    away = c("Collingwood", "Adelaide", "St Kilda", "Carlton", "Melbourne", "GWS", "Brisbane Lions",              "Footscray", "Fremantle")  )    fixture %>%    left_join(elos_latest, by = c("home" = "team")) %>%    rename(home_elo = elo) %>%    left_join(elos_latest, by = c("away" = "team")) %>%    rename(away_elo = elo)  %>%    left_join(filter(adjustments, location == "home"), by = c("home" = "team")) %>%    rename(home_adjustment = location_adjustment) %>%    select(-location)  %>%    left_join(filter(adjustments, location == "away"), by = c("away" = "team")) %>%    rename(away_adjustment = location_adjustment) %>%    select(-location) %>%    mutate(final_prob = elo_prob(home_elo, away_elo,                                  ml = (best_params$pred_margin / best_params$sc) ^ best_params$margin_power,                                  a_adv = home_adjustment,                                  b_adv = away_adjustment),           winner = ifelse(final_prob > 0.5, home, away),           fair_returns_home = 1/final_prob,           fair_returns_away = 1/ (1- final_prob)) %>%    kable() %>%    write_clip()

That's all.

Here's the R packages used in producing this post:

thankr::shoulders() %>%     mutate(maintainer = str_squish(gsub("<.+>", "", maintainer))) %>%    group_by(maintainer) %>%    summarise(`Number packages` = sum(no_packages),              packages = paste(packages, collapse = ", ")) %>%    knitr::kable() %>%     clipr::write_clip()
maintainer Number packages packages
Hadley Wickham 15 assertthat, dplyr, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, tidyr, tidyverse
R Core Team 12 base, compiler, datasets, graphics, grDevices, grid, methods, parallel, stats, tools, utils, nlme
Yihui Xie 5 evaluate, highr, knitr, rmarkdown, xfun
Kirill Müller 4 DBI, hms, pillar, tibble
Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1
Gábor Csárdi 3 cli, crayon, pkgconfig
Jim Hester 3 glue, withr, readr
Lionel Henry 3 purrr, rlang, tidyselect
Rich Calaway 3 doParallel, foreach, iterators
Yixuan Qiu 3 showtext, showtextdb, sysfonts
Dirk Eddelbuettel 2 digest, Rcpp
Jennifer Bryan 2 cellranger, readxl
Jeroen Ooms 2 curl, jsonlite
Simon Urbanek 2 audio, Cairo
Achim Zeileis 1 colorspace
Alex Hayes 1 broom
Brodie Gaslam 1 fansi
Charlotte Wickham 1 munsell
Deepayan Sarkar 1 lattice
James Day 1 fitzRoy
James Hester 1 xml2
Jeremy Stephens 1 yaml
Joe Cheng 1 htmltools
Justin Talbot 1 labeling
Kamil Slowikowski 1 ggrepel
Kevin Ushey 1 rstudioapi
Luke Tierney 1 codetools
Marek Gagolewski 1 stringi
Matthew Lincoln 1 clipr
Max Kuhn 1 generics
Michel Lang 1 backports
Patrick O. Perry 1 utf8
Peter Ellis 1 frs
Rasmus Bååth 1 beepr
Simon Garnier 1 viridisLite
Stefan Milton Bache 1 magrittr
Vitalie Spinu 1 lubridate

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Comments