[R-bloggers] Predicting the next decade in the stock market (and 4 more aRticles) |
- Predicting the next decade in the stock market
- von Bertalanffy Growth Plots I
- RStudio Blogs 2019
- Introduction to Data Science in R, Free for 3 days
- Can Genealogical data be tidy?
Predicting the next decade in the stock market Posted: 31 Dec 2019 03:17 AM PST [This article was first published on Data based investing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Making accurate predictions using the vast amount of data produced by the stock markets and the economy itself is difficult. In this post we will examine the performance of five different machine learning models and predict the future ten-year returns for the S&P 500 using state of the art libraries such as caret, xgboostExplainer and patchwork. We will use data from Shiller, Goyal and BLS. The training data is between the years 1948 and 1991, and the test data set is from 1991 and only until 2009, because the target variable is lagged by ten years. Different investing strategies tend to work at different times, and you should expect the accuracy of the model you are using to move in cycles; sometimes the connection with returns is very strong, and sometimes very weak. Value investing strategies are a great example of a strategy that has not really worked for the past twelve years (source, pdf). Spurious correlations are another cause of trouble, since for example two stocks might move in tandem by just random chance. This highlights the need for some manual feature selection of intuitive features. We will use eight different predictors; P/E, P/D, P/B, the CAPE ratio, total return CAPE, inflation, unemployment rate and the 10-year US government bond rate. All five of the valuation measures are calculated for the entire S&P 500. Let's start by inspecting the correlation clusters of the different predictors and the future ten-year return (without dividends), which is used as the target. The different valuation measures are strongly correlated with each other as expected. All expect P/B have a very strong negative correlation with the future 10-year returns. CAPE and total return CAPE, which is a new measure that considers also reinvested dividends, are very strongly correlated with each other. Total return CAPE is also slightly less correlated with the future ten-year return than the normal CAPE. The machine learning modelsFirst, we will create a naïve model which predicts the future return to be same as the average return in the training set. After training the five models we will also make one ensemble model of them to see if it can reach a higher accuracy as any of the five models, which is usually the case. The models we are going to use are quite different from each other. The glmnet model is just like the linear model, except it shrinks the coefficients according to a penalty to avoid overfitting. It therefore has a very low flexibility and also performs automated feature selection (except if the alpha hyperparameter is exactly zero as in ridge regression). K-nearest-neighbors makes its predictions by comparing the observation to similar observations. MARS on the other hand takes into account nonlinearities in the data, and also considers interactions between the features. XGBoost is a tree model, which also takes into account both nonlinearities and interactions. It however improves each tree by building it based on the residuals of the previous tree (boosting), which may lead to better accuracies. Both MARS and SVM (support vector machines) are really flexible and therefore may overfit quite easily, especially if the data size is small enough. The XGBoost model is also quite flexible but does not overfit easily since it performs regularization and pruning. Finally, we have the ensemble model which simply gives the mean of the predictions of all the models. Ensemble models are a quite popular strategy in machine learning competitions to reach accuracies beyond the accuracy of any single model. The models will be built using the caret wrapper, and the optimal hyperparameters are chosen using time slicing, which is a cross validation technique that is suitable for time series. We will use five timeslices to capture as many periods while having enough observations in each of them. We will do the cross validation on training data consists of 70 percent of the data, while keeping the remaining 30 percent as a test set. The results are shown below: ResultsClick to enlarge images The predictions are less accurate after the red line, which separates the training and test sets. The model has not seen the data on the right side of the line, so its accuracy can be thought as a proxy for how well the model would perform in the future. We will examine the model accuracies on the test set by using two measures; mean absolute error (MAE) and R-squared (R²). The results are shown in the table below:
Upon closer inspection of the feature importances, we see that the MARS model uses just the CAPE ratio as a feature, while rest of the models use the features more evenly. Most of the models perform some sort of feature selection, which can also be seen from the plot. Future predictionsLastly, we will predict the next ten years in the stock market and compare the predictions of the different models. We will also look closer at the best performing single model, XGBoost, by inspecting the composition of the prediction. The current values of the features are mostly obtained from the sources listed in the first chapter, but also from Trading Economics and multpl.
The MARS model is the most pessimistic, with a return prediction that is quite strongly negative. The model should however not be trusted too much since it uses only one variable and does not behave well on the test data. The XGBoost model is surprisingly optimistic, with a prediction of almost nine percent per year. The prediction of the ensemble model is quite low but would be three percentage points higher without the MARS model. Let's then look at the XGBoost model more closely by using the xgboostExplainer library. The resulting plot is a waterfall chart which shows the composition of a single prediction, in this case the predicted CAGR (plus one) for the next ten years. The high CAPE ratio reduces the predicted CAGR by seven percentage points, but the P/B ratio increases it by six percentage points. This is because the model contains interactions between the CAPE and P/B ratios. The effect of the interest rate level is just a bit positive at two percentage points, but the currently high P/E ratio reduces it back to the same level. The rest of the features have a very small effect on the prediction. The benefit of predicting the returns of a single stock market is mostly limited to the fact that you can adjust your expectations for the future. However, predicting the returns of multiple stock markets and investing in the ones with the highest return predictions is most likely a very profitable strategy. Klement (2012) has shown that the CAPE ratio alone does a quite good job at predicting the returns of different stock markets. Adding more variables that are sensible to the model is likely to make the model more stable and perhaps better at predicting the outcome. Be sure to follow me on Twitter for updates about new blog posts like this! The R code used in the analysis can be found here. To leave a comment for the author, please follow the link and comment on their blog: Data based investing. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||
von Bertalanffy Growth Plots I Posted: 30 Dec 2019 10:00 PM PST [This article was first published on fishR Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. IntroductionI am continuing to learn
Here I demonstrate how to produce such plots with lengths and ages of Lake Erie Walleye (Sander vitreus) captured during October-November, 2003-2014. These data are available in my The workflow below requires understanding the minimum and maximum observed ages.
Fitting a von Bertalanffy Growth FunctionMethods for fitting a von Bertalannfy growth function (VBGF) are detailed in my Introductory Fisheries Analyses with R book and in Chapter 12 of Age and Growth of Fishes: Principles and Techniques book. Briefly, a function for the typical VBGF is constructed with Reasonable starting values for the optimization algorithm may be obtained with The The parameter estimates are extracted from the saved Bootstrapped confidence intervals for the parameter estimates are computed by giving the saved
Preparing Predicted Values for PlottingPredicted lengths-at-age from the fitted VBGF is needed to plot the fitted VBGF curve. The What is need, however, is the predicted mean lengths at ages for each bootstrap sample, so that bootstrapped confidence intervals for each mean length-at-age can be derived. To do this with Predicted mean lengths-at-age, with bootstrapped confidence intervals, can then be constructed by giving The vector of ages, the predicted mean lengths-at-age (from For my purposes below, I also want predicted mean lengths only for observed ages. To make the code below cleaner, a new data.frame restricted to the observed ages is made here.
Constructing the PlotA The plot begins with a polygon that encases the lower and upper confidence interval values for mean length at each age. This polygon is constructed with Observed lengths and ages in the The fitted curve over the entire range of ages used above (i.e., using The fitted curve for just the observed range of ages (i.e., using The y- and x-axes are labelled ( Finally, the classic black-and-white theme (primarily to remove the gray background) was used (
BONUS – Equation on PlotBelow is an undocumented bonus for how to put the equation of the best-fit VBGM on the plot. This is hacky so I would not expect it to be very general (e.g., it likely will not work across facets).
Final ThoughtsThis post is likely not news to those of you that are familiar with
Footnotes
To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||
Posted: 30 Dec 2019 04:00 PM PST [This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. If you are lucky enough to have some extra time for discretionary reading during the holiday season, you may find it interesting (and rewarding) to sample some of the nearly two hundred posts written across the various RStudio blogs. R ViewsR Views, our blog devoted to the R Community and the R Language, published over sixty posts in 2019. Many of these were contributed by guest authors from the R Community who volunteered to share some outstanding work. Among my favorites are the multi-part posts that explored data science modeling issues in some detail. These include Roland Stevenson's three-part series on Multiple Hypothesis Testing and A/B Testing, the four-part series on Analyzing the HIV pandemic by Andrie de Vries and Armand Bester, and Jonathan Reginstein's two-part series on Tech Dividends. RStudio BlogThe RStudio blog is the place to go for official information on RStudio. It includes posts on open-source and commercial products, events, and company news. Just scanning the summary paragraphs will give you a good overview of what went on at RStudio this past year. Among my favorite posts for the year is Lou Bajuk's take on the complementary roles of R and Python: R vs. Python: What's the best language for Data Science?. TensorFlow for R BlogThe TensorFlow for R Blog provides "nuts and bolts" reading on building TensorFlow models that ought to be on the list of every data scientist working in R. The posts cover an amazingly wide range of cutting edge topics. For example, see Sigrid Keydana's recent posts Differential Privacy with TensorFlow, and Getting started with Keras from R – the 2020 edition. Tidyverse BlogThe Tidyverse Blog offers insight into Tidyverse packages and capabilities at all levels. Scan the summaries like you would a bookshelf in your favorite technical bookstore, and pick out something new like Davis Vaughan's exposition of the new hardhat package which provides tools for developing new modeling packages, or take a deep dive into task queues with Gábor Csárdi's Multi Process Task Queue in 100 Lines of R Code. Ursa Labs BlogUrsa Labs is a project devoted to open source data science and cross-language software sponsored by RStudio along with several other organizations for which we have great hope. Wes McKinney's post Happy Reading!
To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||
Introduction to Data Science in R, Free for 3 days Posted: 30 Dec 2019 10:30 AM PST [This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. To celebrate the new year and the recent release of Practical Data Science with R 2nd Edition, we are offering a free coupon for our video course "Introduction to Data Science." The following URL and code should get you permanent free access to the video course, if used between now and January 1st 2020:
To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||
Can Genealogical data be tidy? Posted: 29 Dec 2019 04:00 PM PST [This article was first published on R on R-house, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post, I'll be exploring how genealogical data stored in the de-facto standard format, GEDCOM, could be made tidy, and arguing that this is not really ideal. About 6 years ago, long before I got involved with Data Science and when R was just the 18th letter of the alphabet, I started researching my family history. It was really interesting, hugely rewarding, and I rapidly found myself inundated with various pieces of information – a lot of it conflicting – from various sources. Desperate to organise it all, I discovered the Genealogical Data Communication (GEDCOM) format. I used this format to record all I had found and used some special freeware to generate family tree diagrams in PDF format. Fast forward to today. I now find myself in a situation where I'm keen to dig out my old GEDCOM file and see what R can do with it! I searched on GitHub for repos that manipulate GEDCOM files in R, and perhaps the most promising was one by Peter Prevos who had written a short article describing the format of the file and its limitations. I highly recommend you give it a read. For all its faults, the GEDCOM data format has been the standard for decades, so a fundamental constraint here is that I'm not going to try to invent a whole new format, I'm just going to try to deal with the standard we have. Files contain data on more than one type of observational unit, including individuals, families, and data sources. It's inappropriate to try to fit all of that in one big dataframe, so I'll just be focusing on individuals in this post. Peter has not only written some code to read GEDCOM files, but also code to do some simple analysis and generate some visualisations using the One of the strengths of the GEDCOM format is the ability to record several possible values of an individual's attribute. For example, if one source tells you an ancestor was born in 1900, and another tells you they were born in 1901, you don't have to choose one as correct and dismiss the other – you can record both and capture the uncertainty – which is an absolutely crucial capability of any genealogical data format. If we were to try to capture these possible values using the dataframe format, one might imagine having a row for every combination of possible values, e.g.
Unfortunately, this has two drawbacks; you could feasibly end up with hundreds of rows for a single individual as the different possibilities for dozens of fields multiply up – with only one row being 'correct' – resulting in a lot of unnecessary data duplication. You could employ nested list columns to get around this, but this would make the dataframe complex to deal with and difficult to share with non-R users. It also wouldn't solve the second issue – being able to record the data source for each conflicting piece of data. These limitations rapidly lead you down a path of considering an 'ultra-tidy' dataframe instead, where each row records a possible value for an individual attribute and a source can be recorded for each, e.g.
This is a lot better, especially considering you could add a 'notes' column (which is one of the tags in a GEDCOM file), that you could attach to any data value. Unfortunately, uncertainty isn't the only reason why a field would have more than one value. Fields like occupation and address could have several values as an individual may have had several over their lifetime. So, we might consider adding further fields to the above capturing instants or periods of time for which the value applies. Now we encounter a real problem. There is a very good reason why the GEDCOM data structure is nested in nature – in order to handle things like name and address. The NAME field may contain the individual's full name, but child fields may decompose this into given name (GIVN) and surname (SURN), as well as other child fields not found in the parent NAME field, such as nicknames (FONE). Similarly, the address field has child fields for city, state, and country. I have considered having something like three attribute columns (for 3 levels of nesting), but we lose the benefit of having one row per attribute, and it feels like a fudge too far. I've therefore abandoned my intention of converting my GEDCOM files to tidy dataframes and have looked for alternatives. I know Peter has begun exploring network data structures and I can certainly see why. I have since discovered an open source genealogy project called Gramps which seems to rely on XML data structures. Sounds promising. I intend to try installing this and seeing how it fares with converting my existing GEDCOM files. To leave a comment for the author, please follow the link and comment on their blog: R on R-house. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Want to become a Facebook Ads Master? Learn to create high converting facebook marketing course in hindi, lookalike audiences, set up pixels & more. Facebook ad is the best and cost-effective way to sell a product and generate leads for your website. To learn complete best digital marketing course in india visit our most trending courses:
ReplyDeletegraphic design classes
masters in UX design india