When we TATVA AIvisit our clients, often both data scientists and higher management ask us, how we deal with both Python and R simultaneously for client requests; as there is no universal preference among clients.
Though solution is not straight forward, however, I suggest to exploit common libraries for quick deployments, such as, dfply (python) and dplyr (R). Below is a quick example:
## R Code library(dplyr) testdata %>%filter(col.1==10)
Imagine you want to make a Manhattan-style plot or anything else where you want a series of intervals laid out on one axis after one another. If it's actually a Manhattan plot you may have a friendly R package that does it for you, but here is how to cobble the plot together ourselves with ggplot2.
We start by making some fake data. Here, we have three contigs (this could be your chromosomes, your genomic intervals or whatever) divided into one, two and three windows, respectively. Each window has a value that we'll put on the y-axis.
We will need to know how long each contig is. In this case, if we assume that the windows cover the whole thing, we can get this from the data. If not, say if the windows don't go up to the end of the chromosome, we will have to get this data from elsewhere (often some genome assembly metadata). This is also where we can decide in what order we want the contigs.
Now, we need to transform the coordinates on each contig to coordinates on our new axis, where we lay the contings after one another. What we need to do is to add an offset to each point, where the offset is the sum of the lengths of the contigs we've layed down before this one. We make a function that takes three arguments: two vectors containing the contig of each point and the position of each point, and also the table of lengths we just made.
Now, we use this to transform the start and end of each window. We also transform the vector of the length of the contigs, so we can use it to add vertical lines between the contigs.
It would be nice to label the x-axis with contig names. One way to do this is to take the coordinates we just made for the vertical lines, add a zero, and shift them one position, like so:
Now it's time to plot! We add one layer of points for the values on the y-axis, where each point is centered on the middle of the window, followed by a layer of vertical lines at the borders between contigs. Finally, we add our custom x-axis, and also some window dressing.
managing connections, like the (just for fun) package I wrote called {fryingpane}…
All this seems pretty nifty but so far you have never find the time to sit down and learn how to master these skills. Then, you've come to the right place — my useR workshop will teach you how to push the boundaries of your basic RStudio usage, in order to become more efficient in your day to day usage of RStudio.
Target audience
We expect people to be familiar with RStudio, and have a little knowledge about programming with R. Knowing how to build a package will be better, but is not mandatory.
Instruction
Please come with a recent version of RStudio installed.
Sign up for useR
useR! 2019 will be hold from the 9th to the 12th of July, in Toulouse.
In our last post mapedit and leaflet.js > 1.0 we discussed remaining tasks for the RConsortium funded project mapedit. mapedit 0.5.0 fixes a couple of lingering issues, but primarily focuses on bringing the power of Leaflet.pm as an alternate editor. Leaflet.draw, the original editor in mapedit provided by leaflet.extras, is a wonderful tool but struggles with snapping and those pesky holes that we commonly face in geospatial tasks. Depending on the task, a user might prefer to continue using Leaflet.draw, so we will maintain full support for both editors. We'll spend the rest of the post demonstrating where Leaflet.pm excels to help illustrate when you might want to choose editor = "leafpm".
Install/Update
At a minimum, to follow along with the rest of this post, please update mapedit and install the new standalone package leafpm. While we are it, we highly recommend updating your other geospatial dependencies.
install.packages(c("sf", "leaflet", "leafpm", "mapview", "mapedit")) # lwgeom is optional but nice when working with holes in leaflet.pm # install.packages("lwgeom")
Holes
mapedit now supports holes. Let's look at a quick example in which we add, edit, and delete holes.
Please note that right mouse click deletes vertexes. For a more real world application franconia[5,] from mapview has a hole. Try to edit it with the following code.
Leaflet.pm gives us a very pleasant snapping experience, so if you want to snap, set editor = "leafpm" and snap away. Snapping is particular important when drawing/digitizing features from scratch. Here is how it looks with the example from above.
Snapping is enabled by default.
Fixes For Lingering Issues
GeoJSON Precision
Robin Lovelace discovered that at leaflet zoom level > 17 we lose coordinate precision. Of course, this is not good enough, so we will prioritize a fix as discussed in issue. Hopefully, this leaflet.js pull request will make this fix fairly straightforward.
I am happy to report that we have found a solution for the loss of precision. Please let us know if you discover any remaining problems.
Mulitlinestring Editing
Leaflet.js and multilinestrings don't get along as Tim Appelhans reported in issue. For complete support of sf, mapedit should work with multilinestring, so we have promoted this to issue 62.
We backed into a solution with MULTILINESTRING since Leaflet.pm's approach fits better with MULTI* features. As an example, let's edit one of the trails from mapview.
As of this post we have reached the end of the extremely generous RConsortium funding of mapedit. Although the funding is over, we still expect to actively maintain and improve mapedit. One feature that we had hoped to implement as part of the mapedit toolset was editing of feature attributes. This turned out to be very ambitious, and unfortunately we were not able to implement a satisfactory solution for this feature during the funding period. We plan, however, to develop a solution. Your participation, ideas, and feedback are as vital as ever, so please continue to engage. Thanks to all those who have contributed so far and thanks to all open source contributors in the R and JavaScript communities.
To leave a comment for the author, please follow the link and comment on their blog: r-spatial.
In this blog post I'm going to show you how you can extract text from scanned pdf files, or pdf files where no text recognition was performed. (For pdfs where text recognition was performed, you can read my other blog post).
The pdf I'm going to use can be downloaded from here. It's a poem titled, D'Léierchen (Dem Léiweckerche säi Lidd), written by Michel Rodange, arguably Luxembourg's most well known writer and poet. Michel Rodange is mostly known for his fable, Renert oder De Fuuß am Frack an a Ma'nsgrëßt, starring a central European trickster anthropomorphic red fox.
Anyway, back to the point of this blog post. How can be get data from a pdf where no text recognition was performed (or, how can we get text from an image)? The pdf we need the text from looks like this:
To get the text from the pdf, we can use the {tesseract} package, which provides bindings to the tesseract program. tesseract is an open source OCR engine developed by Google. But before that, let's use the {pdftools} package to convert the pdf to png. This is because {tesseract} requires images as input (if you provide a pdf file, it will converted on the fly). Let's first load the needed packages:
The images object is a list of magick-images, which we can parse. BUUUUUT! There's a problem. The text is laid out in two columns. Which means that the first line after performing OCR will be the first line of the first column, and the first line of the second column joined together. Same for the other lines of course. So ideally, I'd need to split the file in the middle, and then perform OCR. This is easily done with the {magick} package:
Because the pngs are 4614 by 6962 pixels, I can get the first half of the png by cropping at "2307×6462" (I decrease the height a bit to get rid of the page number), and the second half by applying the same logic, but starting the cropping at the "2307+0" position. The result looks like this:
Much better! Now I need to join these two lists together. I cannot simply join them. Consider the following example:
one <- list(1, 3, 5) two <- list(2, 4, 6)
This is the setup I currently have; first_half contains odd pages, and second_half contains even pages. The result I want would look like this:
There is a very elegant solution, with reduce2() from the {purrr} package. reduce() takes one list and a function, and … reduces the list to a single element. For instance:
reduce(list(1, 2, 3), paste)
## [1] "1 2 3"
reduce2() is very similar, but takes in two lists, but the second list must be one element shorter:
reduce2(list(1, 2, 3), list("a", "b"), paste)
## [1] "1 2 a 3 b"
So we cannot simply use reduce2() on lists one and two, because they're the same length. So let's prepend a value to one, using the prepend() function of {purrr}:
prepend(one, 0) %>% reduce2(two, c)
## [1] 0 1 2 3 4 5 6
Exactly what we need! Let's apply this trick to our lists:
merged_list <- prepend(first_half, NA) %>% reduce2(second_half, c) %>% discard(is.na)
I've prepended NA to the first list, and then used reduce2() and then used discard(is.na) to remove the NA I've added at the start. Now, we can use OCR to get the text:
text_list <- map(merged_list, ocr)
ocr() uses a model trained on English by default, and even though there is a model trained on Luxembourguish, the one trained on English works better! Very likely because the English model was trained on a lot more data than the Luxembourguish one. I was worried the English model was not going to recognize characters such as é, but no, it worked quite well.
This how it looks like:
text_list [[1]] [1] "Lhe\n| Kaum huet d'Feld dat fréndlecht Feier\nVun der Aussentssonn gesunn\nAs mam Plou aus Stall a Scheier\n* D'lescht e Bauer ausgezunn.\nFir de Plou em nach ze dreiwen\nWar sai Jéngelchen alaert,\nDeen nét wéllt doheem méi bleiwen\n8 An esouz um viischte Paerd.\nOp der Schéllche stoung ze denken\nD'Léierche mam Hierz voll Lidder\nFir de Béifchen nach ze zanken\n12 Duckelt s'an de Som sech nidder.\nBis e laascht war, an du stémmt se\nUn e Liddchen, datt et kraacht\nOp der Nouteleder klémmt se\n16 Datt dem Béifchen d'Haerz alt laacht.\nAn du sot en: Papp, ech mengen\nBal de Vull dee kénnt och schwatzen.\nLauschter, sot de Papp zum Klengen,\n20 Ech kann d'Liddchen iwersetzen.\nI\nBas de do, mii léiwe Fréndchen\nMa de Wanter dee war laang!\nKuck, ech hat keng fréilech Sténnchen\n24 *T war fir dech a mech mer baang.\nAn du koum ech dech besichen\nWell du goungs nét méi eraus\nMann wat hues jo du eng Kichen\n28 Wat eng Scheier wat en Haus.\nWi zerguttster, a wat Saachen!\nAn déng Frache gouf mer Brout.\nAn déng Kanner, wi se laachen,\n32, An hir Backelcher, wi rout!\nJo, bei dir as Rot nét deier!\nJo a kuck mer wat eng Méscht.\nDat gét Saache fir an d'Scheier\n36 An och Sué fir an d'Késcht.\nMuerges waars de schuns um Dreschen\nIr der Daudes d'Schung sech stréckt\nBas am Do duurch Wis a Paschen\n40 Laascht all Waassergruef geschréckt.\n" .... ....
text_list [[1]] [[1]][[1]] [1] "Lhe" "| Kaum huet d'Feld dat fréndlecht Feier" [3] "Vun der Aussentssonn gesunn" "As mam Plou aus Stall a Scheier" [5] "* D'lescht e Bauer ausgezunn." "Fir de Plou em nach ze dreiwen" [7] "War sai Jéngelchen alaert," "Deen nét wéllt doheem méi bleiwen" [9] "8 An esouz um viischte Paerd." "Op der Schéllche stoung ze denken" [11] "D'Léierche mam Hierz voll Lidder" "Fir de Béifchen nach ze zanken" [13] "12 Duckelt s'an de Som sech nidder." "Bis e laascht war, an du stémmt se" [15] "Un e Liddchen, datt et kraacht" "Op der Nouteleder klémmt se" [17] "16 Datt dem Béifchen d'Haerz alt laacht." "An du sot en: Papp, ech mengen" [19] "Bal de Vull dee kénnt och schwatzen." "Lauschter, sot de Papp zum Klengen," [21] "20 Ech kann d'Liddchen iwersetzen." "I" [23] "Bas de do, mii léiwe Fréndchen" "Ma de Wanter dee war laang!" [25] "Kuck, ech hat keng fréilech Sténnchen" "24 *T war fir dech a mech mer baang." [27] "An du koum ech dech besichen" "Well du goungs nét méi eraus" [29] "Mann wat hues jo du eng Kichen" "28 Wat eng Scheier wat en Haus." [31] "Wi zerguttster, a wat Saachen!" "An déng Frache gouf mer Brout." [33] "An déng Kanner, wi se laachen," "32, An hir Backelcher, wi rout!" [35] "Jo, bei dir as Rot nét deier!" "Jo a kuck mer wat eng Méscht." [37] "Dat gét Saache fir an d'Scheier" "36 An och Sué fir an d'Késcht." [39] "Muerges waars de schuns um Dreschen" "Ir der Daudes d'Schung sech stréckt" [41] "Bas am Do duurch Wis a Paschen" "40 Laascht all Waassergruef geschréckt." [43] "" ... ...
Perfect! Some more cleaning would be needed though. For example, I need to remove the little annotations that are included:
I don't know yet how I'm going to do that.I also need to remove the line numbers at the beginning of every fourth line, but this is easily done with a simple regular expression:
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.
To cut to the chase, here's my prediction – 76 percent chance of ALP being able to form government by themselves:
Because the forecast is built on division-level simulations of what will happen when local randomness adds to an uncertain prediction of two-party-preferred swing, I also get probabilistic forecasts for each individual seat. This lets me produce charts like this one:
…which shows what is likely to happen seat by seat when we get the actual election.
The ozfedelect R package
The ozfedelect R package continues to grow. Just today, I've added to it:
a vector of colours for the Australian political parties involved in my forecasting
a useful data frame oz_pendulum_2019 of the margins of the various House of Representatives seats going in to the 2019 election.
update of polling data.
All this is in addition to the historical polling and division-level election results it already contains.
Code for these forecasts
The code for conducting the forecasts or just installing ozfedelect for other purposes is available in GitHub. The ozfedelect GitHub repository not only will build the ozfedelect R package from scratch (ie downloading polling data from Wikipedia and election results from the Australian Electoral Commission) it also has the R and Stan code for fitting the model of two-party-preferred vote and turning it into division-level simulated results.
It should be regarded as a work in progress. Comments and suggestions are warmly welcomed!
To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.
This is the fourth blog on the stars project, an it completes the R-Consortium funded project for spatiotemporal tidy arrays with R. It reports on the current status of the project, and current development directions. Although this project ends, with the release of stars 0.3 on CRAN, the adoption, update, enthusiasm and participation in the development of the stars project have really only started, and will without doubt increase and continue.
Status
The stars package has now five vignettes (called "Articles" on the pkgdown site) that explain its main features. Besides writing these vignettes, a lot of work over the past few months went into
writing support for stars_proxy objects, objects for which the metadata has been read but for which the payload is still on disk. This allows handling raster files or data cubes that do not fit into memory. Manipulating them uses lazy evaluation: only when pixel values are really needed they are read and processed: this is for instance when a plot is needed, or results are to be written with write_stars. In case of plotting, no more pixels are processed than can be seen on the device.
making rectilinear and curvilinear grids work, by better parsing NetCDF files directly (rather than through GDAL), reading their bounds, and by writing conversions to sf objects so that they can be plotted;
writing a tighter integration with GDAL, e.g. for warping grids, contouring grids, and rasterizing polygons;
supporting 360-day and 365-day (noleap) calendars, which are used often in climate model data;
providing an off-cran starsdata package, with around 1 Gb of real imagery, too large for submitting to CRAN or GitHub, but used for testing and demonstration;
resolving issues (we're at 154 now) and managing pull requests;
adding stars support to gstat, a package for spatial and spatiotemporal geostatistical modelling, interpolation and simulation.
I have used stars and sf successfully last week in a two-day course at Munich Re on Spatial Data Science with R (online material), focusing on data handling and geostatistics. Both packages worked out beautifully (with a minor amount of rough edges), in particular in conjunction with each other and with the tidyverse.
Further resources on the status of the project are found in
the video of my rstudio::conf presentation on "Spatial data science in the Tidyverse"
chapter 4 of the Spatial Data Science book (under development)
Future
Near future development will entail experiments with very large datasets, such as the entire Sentinel-2 archive. We secured earlier some funding from the R Consortium for doing this, and first outcomes will be presented shortly in a follow-up blog. A large challenge here is the handling of multi-resolution imagery, imagery spread over different coordinate reference systems (e.g., crossing multiple UTM zones) and the temporal resampling needed to form space-time raster cubes. This is being handled gracefully by the gdalcubes C++ library and R package developed by Marius Appel. The gdalcubes package has been submitted to CRAN.
Who doesn't like a wikipedia entry control chartIf analysis of the control chart indicates that the process is currently under control (i.e., is stable, with variation only coming from sources common to the process), then no corrections or changes to process control parameters are needed or desired I mean gee whiz this sure could relate to something like I don't know AFL total game scores?
There seems to always be talk about the scores in AFLM see AFL website, foxsports just to name a couple. Of course you could find more if you searched out for it as well.
Let's use fitzRoy and the good people over at statsinsder who have kindly provided me with the expected score data you can get from the herald sun.
library(ggQC) fitzRoy::match_results%>% mutate(total=Home.Points+Away.Points)%>% group_by(Season,Round)%>% summarise(meantotal=mean(total))%>% filter(Season>1989 & Round=="R1")%>% ggplot(aes(x=Season,y=meantotal))+geom_point()+geom_line()+stat_QC(method="XmR")+ylab("Mean Round 1 Total for Each Game") +ggtitle("Stop Freaking OUT over ONE ROUND")
So if we were to look at the control chart just for round 1 in each AFLM season since the 90s it would seem as though that even though this round was lower scoring that there isn't much too see here.
After all we can and should expect natural variation in scores, wouldn't footy be boring if scores were the same every week.
To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.
Warning – this post discusses gambling odds and even describes me placing small $5 bets, which I can easily afford to lose. In no way should this be interpreted as advice to anyone else to do the same, and I accept absolutely no liability for anyone who treats this blog post as a basis for gambling advice. If you find yourself losing money at gambling or suspect you or someone close to you has a gambling problem, please seek help from https://www.gamblinghelponline.org.au/ or other services.
Appreciating home team advantage
Last week I blogged about using Elo ratings to predict the winners in the Australian Football League (AFL) to help me blend in with the locals in my Melbourne workplace footy-tipping competition. Several people pointed out that my approach ignored the home team advantage, which is known to be a significant factor in the AFL. How significant? Well, here's the proportion of games won by the home team in each season since the AFL's beginning:
Overall, 59% of games in the AFL history have been won by the home team, although that proportion varies materially from season to season. Another way of putting this – if my AFL prediction system was as simple as "always pick the home team" I would expect (overall) to score a respectable 59% of successful picks.
My first inclination in response to this was just to add 0.09 to the chance of the home team winning from my model's base probability, but first I thought I should check whether this adjustment varies by team. Well, it does, somewhat dramatically. This next chart, based on just modern era games (1997 and onwards), defines the home advantage as the proportion of home games won minus the proportion of away games won, divided by 2. The all-teams all-seasons average for this figure would be 0.09.
The teams that are conspicuously high on this measure are all non-Melbourne teams:
Geelong
Adelaide
West Coast
Fremantle
Greater Western Sydney
Gold Coast (another non-Melbourne team, for any non-Australians reading) would show up as material if a success ratio measure were used instead of my simple (and crude) additive indicator; because their overall win rate is so low. Sydney and perhaps Brisbane stand out as good performers overall that don't have as marked a home town advantage (or away-town disadvantage) as their non-Melbourne peers.
I presume this issue is well known to AFL afficianados. With the majority (or at least plurality – I haven't counted) of games played in Melbourne, Melbourne-based clubs generally play many of their "away" matches still relatively close to players' homes. Whereas (for example) the West Coast Eagles flying across the Nullarbor for a match is bound to cost their players, compared to the alternative situation of being at home and making the opponents fly west instead.
Geelong surprised me – Geelong is much closer to Melbourne than the inter-state teams, so no long flights are involved. But simply travelling even an hour up the M1, added to the enthusiastic partisanship of Geelong-Melbourne fan rivalry, perhaps explains the strong advantage.
Here's the R code that performs these steps:
load in functionality for the analysis
download the AFL results from 1897 onwards from afltables using the fitzRoy package and store in the object r (for "results")
draw the chart of home win rates per season
estimate and plot teams' recent-decades home advantage
create a new r2 object with home and away advantage and disadvantage adjustments for probabilities
Post continues after code extract
Choosing a better set of parameters
Seeing as I had to revisit my predictive method to adjust for home and away advantage/disadvantage, I decided to also take a more systematic approach to some parameters that I had set either arbitrarily or without noticing they were parameters in last week's post. These fit into two areas:
The FIBS-based Elo rating method I use requires a "match length" parameter. Longer matches mean the better player is more likely to win, and also lead to larger adjustments in Elo ratings once the result is known. I had used the winning margin as a proxy/equivalent for match length, reasoning that a team that won by a large amount had shown their superiority in a similar way to a backgammon player winning a longer match. But should a 30 point margin count as match length of 30 (ie equivalent to points) or 5 (equivalent to goals) or something in-between? And what margin or "match length" should I use for predicting future games, where even a margin of 1 is enough to win? And even more philosophically, is margin really related to the backgammon concept of match length in a linear way, or should there be discounting of larger margins? Today, I created three parameters to scale the margin, to choose a margin for predicting future matches, and put the scaled margin to the power of a number from zero to one for non-linearity.
Elo ratings are by the nature sticky and based on a player/team's whole career. I had noted that I got better performance in predicting the 2018 season by restarting all teams' ratings at 1500 at the beginning of the 2017 season. But this is fairly crude. The FIBS Elo rating method has a parameter for teams' past experience which in effect lets you control how responsive the rating is to new information (the idea being to help new players in a competition quickly move from 1500 up or down to their natural level). I have now added to this with a "new round factor" which shrinks ratings for the first game of the season towards 1500, effectively discounting past experience
Here's the code that describes and defines those parameters a bit better, in an enhanced version of my afl_elos() function I introduced last week.
Post continues after code extract
This function is now pretty slow when run on the whole AFL history form 1897. Unlike last week, it calls the underlying frs::elo_rating() function for each game twice – once (the object er above) to determine the ratings after the match outcome is known, and once to determine the prediction of the match's result, for benchmarking purposes (the object er2). Last week I didn't need to use elo_rating() twice, because the prediction of the winner was as simple as choosing the team with the highest Elo rating going in to the match. Now, we have to calculate the actual probability of winning, adjusted for home and away advantage and disadvantage. This calculation depends on the parameter choices that impact on converting margin to "match length" and what winning margin we base our predictions on, so the calculation is an additional one to the change in rating that came about from the actual result of the game.
There are doubtless efficiencies that could be made, but I'm not enthused to spend too much time refactoring at this point…
I have no theories and hardly any hunches on what parameter combinations will give the best performance, so the only way to choose a set is to try many and pick the one that would have worked best in predicting AFL games to date. I defined about 2,500 combinations of parameters, removed some that were effective duplicates (because if margin_power is 0, then the value of sc and pred_margin are immaterial) and for the purposes of this blog ran just 100 random prediction competitions, based on the games from 1950 onwards. With each individual run taking five minutes or more, I used parallel processing to do 7 runs simultaneously and get the total time for those 100 runs down to about 90 minutes, all I was prepared to do today for the purposes of this blog. I might run it for a longer period of time overnight later.
The top twenty parameter sets from this competition are in the table below. The best combination of factors led to an overall prediction success of about 69%, which is better than last week's 65%, the crude "always pick the home team" success of 59% and a coin flip of 50%; but not as much better as I hoped. Clearly picking these winners is hard – AFL is more like backgammon or poker in terms of predicting outcomes than it is like chess.
sc
pred_margin
margin_power
experience
new_round_factor
success_rate
1
30
0.8333333
100
0.4
0.6853485
1
20
0.8333333
200
0.8
0.6839260
1
20
1.0000000
100
0.6
0.6828829
1
10
0.8333333
0
0.4
0.6822191
1
20
0.6666667
100
0.8
0.6812707
3
20
0.8333333
0
0.8
0.6806069
1
20
1.0000000
400
0.8
0.6785206
3
10
1.0000000
0
0.8
0.6777620
1
20
0.5000000
0
0.8
0.6776671
1
10
1.0000000
300
0.8
0.6774775
3
30
0.8333333
200
0.6
0.6762447
1
10
1.0000000
300
0.6
0.6747274
1
20
0.8333333
400
0.6
0.6676150
3
10
0.6666667
100
0.8
0.6668563
6
20
1.0000000
100
1.0
0.6661925
1
30
0.5000000
100
1.0
0.6634424
1
10
0.5000000
0
0.4
0.6629682
3
20
0.6666667
100
1.0
0.6617354
6
20
0.8333333
100
0.6
0.6592698
6
30
0.6666667
200
0.8
0.6570887
The best models had a modest shrinkage of ratings towards 1500 (new_round_factor of 0.4 to 0.8, compared to 0 which would mean everyone starting at 1500 in each round 1); and modest if any non-linearity in the conversion of winning margin to a notional "match length". They had relatively low levels of "experience", effectively increasing the importance of recent results and downplaying long term momentum; while treating match results in points (sc = 1) and predicting based on a relatively large margin.
I only had time to try a random sample of parameter combinations, and would be very lucky indeed if I have ended up with the best set. How confident can I be that I've got something close enough? Here's the distribution of success rates for that post 1950 series:
Without over-thinking it, it's reasonable to infer a few more extreme values on the right are possible if we looked at the full set of parameters; but that they wouldn't be that much more successful. It's certainly good enough for a workplace footy tipping competition.
Here's the predictive success of the best model over time, now applied to the full range of data not just the post 1950 period for which it was optimised:
… and the code that did the above "parameter competition", using the foreach and doParallel R packages for parallel processing to bring the elapsed time down to reasonable levels:
Post continues after code extract
This weeks' predictions are…
To turn this model into my tips for this week, I need to extract the final Elo ratings from the best model, join them with the actual fixture and then use the model to predict actual probabilities of winning. Here's what I get:
home
away
home_elo
away_elo
home_adjustment
away_adjustment
final_prob
winner
fair_returns_home
fair_returns_away
Richmond
Collingwood
1623.362
1552.782
0.0392339
-0.0305072
0.6527701
Richmond
1.531933
2.879936
Sydney
Adelaide
1506.489
1476.328
0.0467485
-0.0632875
0.6457876
Sydney
1.548497
2.823164
Essendon
St Kilda
1507.538
1404.657
0.0505882
-0.0428714
0.7132448
Essendon
1.402043
3.487295
Port Adelaide
Carlton
1544.746
1340.797
0.0571146
-0.0257644
0.8077331
Port Adelaide
1.238033
5.201104
Geelong
Melbourne
1552.818
1521.317
0.0747884
-0.0402286
0.6523510
Geelong
1.532917
2.876464
West Coast
GWS
1543.355
1565.494
0.0699069
-0.0648148
0.6084577
West Coast
1.643499
2.554003
North Melbourne
Brisbane Lions
1461.086
1483.270
0.0406327
-0.0558655
0.5701814
North Melbourne
1.753828
2.326563
Hawthorn
Footscray
1567.933
1482.579
0.0490421
-0.0383632
0.6873887
Hawthorn
1.454781
3.198861
Gold Coast
Fremantle
1382.276
1483.175
0.0402515
-0.0729052
0.4955912
Fremantle
2.017792
1.982519
That final_prob column is the estimated probability of the home team winning.
As you can see, I translate my probabilities into a "fair return", which I'm using to scan for opportunities with poorly chosen odds from the bookies. These opportunities don't arrive very often as the bookies are professionals, but when they are paying 50% more than the model predicts to be "fair" I'm going to punt $5 and we'll see how we go at the end of the season. So far I'm $26 up from this strategy but it's early days and I'm far from assured the luck will continue.
Judging from the tips and odds by the public, the only controversial picks in the above are for North Melbourne to beat Brisbane and Gold Coast to be nearly a coin flip in contest with Fremantle. In both cases my algorithm is tipping a home advantage to equalise the comparative relative strength of the away team. For the North Melbourne match, the bookies agree with me, whereas the tippers on tipping.afl.com.au are going for a Brisbane win, so I think we can say that reasonable people disagree about the outcome there and it is uncertain. For the other match, I have grave doubts about Gold Coast's chances against Fremantle (who had a stellar victory last weekend), but am inclined to think the $3.50 return bookies are offering to pay for a Gold Coast win is over-generous and underestimating how much Fremantle struggle when playing away from home. So that's my recommended match to watch for a potential surprise outcome.
At the time of writing, the first two of these predictions in my table above have already gone astray (for me, the average punters and the average tippers) in the Thursday and Friday night matchs, as 2019 continues its run of surprise results. Collingwood and Adelaide both pulled off against-the-odds wins against teams that were both stronger on paper and playing at home. I won't say my predictions were "wrong", because when you say something has a 0.6 chance of happening and it doesn't, there's a good chance you were just unlucky, not wrong.
But as they say, prediction is hard, particularly about the future.
Final chunk of R code for today – converting the model into predictions for this round:
That's all.
Here's the R packages used in producing this post:
Comments
Post a Comment