[R-bloggers] Le Monde puzzle [#1157] (and 7 more aRticles) |
- Le Monde puzzle [#1157]
- How to Switch from Excel to R Shiny: First Steps
- Evaluating American Funds Portfolio
- Visualization of COVID-19 Cases in Arkansas
- RStudio v1.4 Preview: Visual Markdown Editing
- New Polished Feature – User Roles
- Time Series Forecasting: KNN vs. ARIMA
- How to Convert Continuous variables into Categorical by Creating Bins
Posted: 30 Sep 2020 09:20 AM PDT
[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The weekly puzzle from Le Monde is an empty (?) challenge:
Indeed, if the sums are equal, then the sum of their sums is even, meaning the sum of the ten integers is even. Any choice of these integers such that the sum is even is a sure win for Aputsiaq. End of the lame game (if I understood its wording correctly!). If some integers can be left out of the groups, then both solutions seem possible: usng the R code P=1;M=1e3 while (P | ||||
How to Switch from Excel to R Shiny: First Steps Posted: 30 Sep 2020 02:15 AM PDT
[This article was first published on r – Appsilon Data Science | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. tl;drIf you're still using Excel or Google Sheets for business, you might already know that Excel is obsolete for many business use-cases. But how do you switch from Excel to a better alternative like R Shiny? It's easier to get started than you might think, especially if you're already an Excel power user. This article will walk you through a sample migration of an analytics tool built with Excel and Google Sheets to a dashboard built with R Shiny. We'll show you how to prepare the data for migration, how to create a simple R Shiny app, how to view/filter tables in R Shiny, how to modify a table using SQL, and also how to add interactive filters using the library ShinyWidgets.
This article shouldn't take too much of your time to read – perhaps no more than 15 minutes. The final Shiny dashboard has only 45 lines of well-formatted code, so it won't be too demanding to code along. You can download the dataset here. Excel is a Bad Choice for Businesses – Here's WhyExcel has been an excellent tool for decades due to its relative intuitiveness and the WYSIWYG (What You See Is What You Get) user interface. Still, running an important business process in an Excel workbook in 2020 is a big mistake, as there are a variety of more stable and sophisticated tools readily available. Excel is prone to human errors. Everyone with access to an Excel document can edit all of its features, which can lead to unintentional changes or the removal of crucial functions. As a result, you can easily end up with a broken spreadsheet whenever a user accidentally (or intentionally) changes a parameter. This is a serious issue because it is currently not possible to fully version control Excel. This means that you cannot effectively keep track of changes in an Excel spreadsheet and revert to a previous version if something breaks. To put it simply – it is not safe to keep a company's critical information in Excel workbooks, especially when it comes to business analytics. Excel can still be useful as an alternative tool to SQL-based solutions to create simple 'databases' as it is very intuitive, and anyone can contribute. However, in the long run, Excel is just too fragile to handle complex business analyses.
Alternatives to Excel: Python and RTo effectively plan your next business move, you need access to features that Excel doesn't offer, such as integration with machine learning models. You also might need a scalable way of connecting to external data sources via APIs. Due to the variety of analytical approaches that a modern business needs to draw proper conclusions, it is crucial to use a more flexible and reliable tool. Therefore, it is necessary to move as many analytical operations as possible away from Excel. There are plenty of better-equipped tools for data analysis and data visualization, and they should not be too difficult to master for someone already proficient in Excel. Python and R are two of the most popular beginner-friendly programming languages. Many non-IT people find R easier and more intuitive to use than Python. This is good news because R and R Shiny alone can cover the majority of business operations stored in Excel and they really can open the door to modern, ground-breaking data analysis.
Migration from Excel to R: Getting StartedEvery migration starts with proper data preparation. This step might take some time if a company has a lot of Excel workbooks. The most important thing is to extract original data from analyzed data, create tables, and save them in a CSV format. If you're just starting – start small. Pick one or two CSVs that move the needle for a start. In almost any use case, you don't have to start with a complete dataset. Remember:
It would make sense to switch from CSV to SQL in the future, but using CSV is not a dealbreaker in the beginning. Moreover, you've likely performed some useful and effective analytical operations within Excel. Don't get rid of them just yet. By all means, analyze/verify these operations one more time, and describe their logic in detail (to recreate it later in R). Sample Case: Job Hunt AnalysisLast year, one of our consultants was searching for a full-time job, and they started tracking application processes to see a more accurate picture of their career prospects – primarily to find out which companies/industries find a single profile interesting. Information about all applications sent in 2019 and 2020 were stored in two Excel sheets (original data) and one in Google Sheets (analysis). Let's take a quick preview of this data. Original data: Analysis: As you can see in the screen recording above, the dashboard works fine. However, there are some issues we encountered while using and maintaining it:
Because of the reasons above, it makes sense to migrate this dashboard to R Shiny and see if these problems can be eliminated. Data PreparationIn this case, data preparation was fairly easy. Two sheets were merged into one, some values got replaced (YES -> 1, NO -> empty cell) using Find and Replace, and the data was saved to a CSV file: The next logical step is describing the features we need to migrate from Excel to R Shiny. The following table summarizes the steps pretty well: Loading a DatasetTo start the migration process, we downloaded RStudio (a free development environment for R from our partner RStudio, PBC), found the CSV file we wanted to use, and imported it into RStudio: After a successful import, the file appeared in the Global Environment. R had no problem with recognizing the CSV table.
Creating Your First R Shiny DashboardLet's start simple with something that remotely resembles the original dashboard. The main goal is to make a simple app that displays the source data and filters it by Job Category. It's important to understand two main components of an R Shiny app – the UI (User Interface) and the server. UI is a graphic layout of an app – everything the user sees on a webpage. The server is the backend of an application. The app is stored on the computer that runs R in the form of a page that can be viewed in a web browser.
Note: To share the R Shiny app with others, you either need to send them a copy of a script or host this page via an external web server. To start, let's use the most basic Shiny app template: library(shiny) ui <- fluidPage() server <- function(input, output) {} shinyApp(ui=ui, server=server) Defining Input and OutputInput is everything the user can interact with on a website. To name a few:
Each input must have an inputId (local name, e.g. 'value'), and a label (a description that will be displayed in an app, e.g. 'Select value'). In addition, depending on the type of input, you can provide additional parameters that will specify/limit the actions a user can perform. For more on defining input and output, and other aspects of Shiny, read this tutorial by RStudio. In the first draft of the app, let's create a reactive select box from which the user can choose any job category that appears in the dataset. Therefore, besides defining inputId and a label we need a list of choices for a dropdown list (choices = TableName$ColumnName): selectInput('jobcategory', 'Select a category', choices = j_h$JOB_CATEGORY) output is the second argument in tableOutput('jobhuntData') Like many basic Shiny apps, our draft Shiny app is quite ugly by default. Let's fix this with some elements: titlePanel(), sidebarLayout(), sidebarPanel(), and mainPanel(). At this point, after adding all elements to a fluidPage() function, our code looks like this: library(shiny) ui <- fluidPage( titlePanel('JOB HUNT RESULTS'), sidebarLayout( sidebarPanel( selectInput('jobcategory', 'Select a category', choices=j_h$JOB_CATEGORY) ), mainPanel( tableOutput('jobhuntData') ) ) ) server <- function(input, output) {} shinyApp(ui=ui, server=server) We can see the filter, but there is no table yet. This is because R Shiny does not know what kind of table we want to generate. Let's introduce server requirements to address this.
How to Use Shiny ServerTo build the first draft of the app, we need to create a source for the tableOutput() function by using a Render Function. Render Functions (e.g. renderImage() to render an image, renderPlot() to render a plot/graph, renderText() to render text, etc.) turn an R object into HTML, and place it in a Shiny webpage. Below you can see how we assigned the outputId ("jobhuntData") to a function that renders the desired output – in our case, renderTable() to render a table. Inside this function, we specified data that we want to see in the table. Please mind that input$jobcategory refers to the Input Function from the UI, and it is always equal to the current value of the input (a value selected by a user). library(shiny) ui <- fluidPage( titlePanel('JOB HUNT RESULTS'), sidebarLayout( sidebarPanel( selectInput('jobcategory', 'Select a category', choices=j_h2$JOB_CATEGORY) ), mainPanel( tableOutput('jobhuntData') ) ) ) server <- function(input, output) { output$jobhuntData <- renderTable({ jobcategoryFilter <- subset(j_h2, j_h2$JOB_CATEGORY == input$jobcategory) }) } shinyApp(ui=ui, server=server) The current version of the app does not look amazing, but we can see that the correct data is shown, and the server generates proper output according to the input provided by the user: Migration – SQL and ShinyWidgetsNow that we know how to create a basic dashboard in R Shiny, we are going to migrate other features from our original dashboard. First and foremost, we had to not only create filters for all columns but also aggregate/group data by YEAR and COUNTRY. There are several ways to modify the dataset in R, but we decided to do it using an SQL SELECT statement. SQL is another topic on its own, but we recommend that you learn the basics of SQL if you work with data on a daily (or even weekly) basis. This is one of the SQL statements we used to create an aggregated view in Google Sheets: Below is the logic that we applied in R using the sqldf library. It enables us to see how many phone screenings, interviews, and offers we had each year in every country: library(sqldf) aggregated_data = sqldf('SELECT YEAR, COUNTRY, JOB_CATEGORY, COUNT(PHONE_SCREENING) AS PHONE_SCREENING, COUNT(INTERVIEW) AS INTERVIEW, COUNT(OFFER) AS OFFER FROM j_h2 GROUP BY JOB_CATEGORY, YEAR, COUNTRY ORDER BY JOB_CATEGORY') This is how the new table "aggregated_data" looks like: Adding multiple filters that are conditional can be a very difficult task, but the ShinyWidgets library offers a perfect solution: selectizeGroup-module. Having imported ShinyWidgets, we've replaced selectInput() with selecticizeGroupUI() and added one more function – callModule(). This way we have eliminated the possibility of choosing a combination that does not exist. Below you can see the entire solution: library(sqldf) library(shiny) library(shinyWidgets) aggregated_data = sqldf("SELECT YEAR, COUNTRY, JOB_CATEGORY, COUNT(PHONE_SCREENING) AS PHONE_SCREENING, COUNT(INTERVIEW) AS INTERVIEW, COUNT(OFFER) AS OFFER FROM j_h2 GROUP BY JOB_CATEGORY, YEAR, COUNTRY ORDER BY JOB_CATEGORY") shinyApp( ui = fluidPage( titlePanel("JOB HUNT RESULTS"), sidebarPanel( selectizeGroupUI( id = "fancy_filters", inline = FALSE, params = list( YEAR = list(inputId = "YEAR", title = "Year", placeholder = 'All'), COUNTRY = list(inputId = "COUNTRY", title = "Country", placeholder = 'All'), JOB_CATEGORY = list(inputId = "JOB_CATEGORY", title = "Job category", placeholder = 'All'), PHONE_SCREENING = list(inputId = "PHONE_SCREENING", title = "Number of positive replies", placeholder = 'All'), INTERVIEW = list(inputId = "INTERVIEW", title = "Number of interview invitations", placeholder = 'All'), OFFER = list(inputId = "OFFER", title = "Number of offers", placeholder = 'All') ) ) ), mainPanel( tableOutput("jobhuntData") ) ), server = function(input, output, session) { res_mod <- callModule( module = selectizeGroupServer, id = "fancy_filters", data = aggregated_data, vars = c("YEAR", "COUNTRY", "JOB_CATEGORY", "PHONE_SCREENING", "INTERVIEW", "OFFER") ) output$jobhuntData <- renderTable({ res_mod() }) }) ConclusionWorking with a new tool like R Shiny can be intimidating at first, but in some ways it can be even easier to learn and understand than Excel or Google Sheets. It is more flexible in terms of adding new features or modifying existing ones. Because we replaced four tables with one, the dashboard not only looks better than our Excel and Google Sheets tool – it is also much easier to use. Moreover, we managed to create an app where the user is in complete control of the displayed data but does not have access to the backend. This means we do not need to worry about non-technical users making accidental changes to the source code or breaking the app. We can also apply version control and store the source code of the app on services like GitHub in a way that allows us to safely revert to previous versions. This way, anyone who I want to share my code with can download it and make contributions in a controlled environment.
Learn More
This article was originally written by Zuzanna Danowska with further edits from Appsilon team members Marcin Dubel and Dario Radečić. Article How to Switch from Excel to R Shiny: First Steps comes from Appsilon Data Science | End to End Data Science Solutions. To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End to End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post How to Switch from Excel to R Shiny: First Steps first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now | ||||
Evaluating American Funds Portfolio Posted: 29 Sep 2020 02:30 PM PDT
[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. IntroductionActive funds have done poorly over the last ten years, and in most cases, struggled to justify their fees. A growing list of commentators appropriately advocate for index funds, although sometimes go a little beyond what we believe to be fairly representing the facts. The inspiration for this article is this post by Asset Builder blog site American Funds Says, "We Can Beat Index Funds" scrutinizing claims by the fund group. Asset Builder asserts that "Even without this commission, the S&P 500 beat the aggregate returns of these ("American") funds over the past 1-, 3-, 5-, 10- and 15-year periods". In the post, there is a supporting chart showing a group of American Funds ("AF") funds compared to the Vanguard Total Market ("TMI") index. This analysis struck us in conflict with our own experience as actual holders of a core portfolio of eight AF over the last 20 years, so this post will be about exploring this data. In this article, we will download the weekly closing prices of the relevant AF and the most comparable Vanguard Funds, re-construct our portfolio and estimate the corresponding weighting of different asset classes for each, replicate a relevant benchmark portfolio of Vanguard index funds, and explore their relative performance histories over the period to try to square the two perspectives. We will also consider the possibility that AF's declining out-performance versus our customized benchmark over the last 15 years may have to do with growing fee differentials with index alternatives. As usual, Redwall would like to avoid defending to any particular viewpoint other than to follow the data and see where it leads. If we have made any mistakes in our assumptions or the data used, we welcome polite commentary to set us straight. We have no relationship with the AF, and for the most part are sympathetic to those who say that index funds may be the best choice for most investors. All the code is available on Github for anybody to replicate. Also to be clear, Redwall is not an investment adviser and is making no investment recommendations. Set Up of AF PortfolioDuring the 2000 bear market, Redwall put substantial research into its investment strategy, and concluded that the AF had a competitive advantage over other mutual fund groups. Capital Group, the operator of the AF, was founded at the beginning of the Great Depression in 1932. Capital had a large group of experienced managers sitting in different locations around the world, with varied perspectives, owning a heavy component of their own funds, with each investing in concentrated portfolio of their own highest conviction ideas. Managers had strong incentive to think long-term instead of for the next quarter. If the style of one manager of the fund was out of sync with the current flavor of the market, others might pick up the pace. The cost of research could be leveraged over a much larger asset base than most mutual funds while still keeping running costs at a manageable level. Being one of the largest managers, analysts and managers would always have access to the best information and advice. Convinced that AF were a solid set-it and forget-it portfolio, investments were were made with monthly dollar-cost averaging without paying loads, and mostly between 2001-2004.
Description of American Funds HeldThe AF don't fit well into the traditional Morningstar investment categories. By in large, its portfolios are many times larger than other active funds, and mostly stick to the largest of the large capitalization global stocks. Washington Mutual mostly owns US mega caps value stocks and holds no cash, while Amcap often moves down the market capitalization spectrum a bit with growth stocks, and will hold a substantial amount of cash. Capital Income builder has a mix of US and overseas stocks which pay high dividends with room to grow. Income Fund of America is similar to Capital Income builder, but has a more US oriented mix and takes more credit risk. Capital World Growth and Income is like Washington Mutual in its stock selection, but will hold a small amount of credit at times when it makes more sense than the equity. New Perspective owns the largest multinational companies domiciled in the US and around the world, but have acquired the competency to expand across borders. New Geography of InvestingIt was probably from operating New Perspective, set up to invest in companies having a majority of revenues of coming from outside of their country of domicile, which led AF to discover a new way of looking at its portfolios. In the New Geography of Investing campaign launched in 2016, they do an excellent job of explaining the concept that a portfolio shouldn't be constrained by company domicile, a central pillar of the Morningstar ratings platform. In addition to the country of domicile, AF now disclose the aggregated geographic mix of revenues of all of its portfolios on their website, and explain clearly that it doesn't prioritize fitting its portfolios into Morningstar regional boxes at the expense of finding the best investments. Because of this, a single index benchmark may be less applicable to AF funds than some others. Doublecheck Asset Builder ValuesWe believe that Asset Builder were referring to no-load AF in their table, but were not sure. It has been possible to buy American Fund F-1 class shares load-free since 2016 (with a 3 bps higher annual expense ratio), so there is no reason for anyone that doesn't want to pay the up-front sales changes for advice to pay one. As shown below, we calculate that Asset Builder's ending value for is 3-4% too high for the Vanguard Fund, but also too low for 4 out of the 5 AF without loads. For the most part, their assertion that AF's funds lose to the S&P still holds up, even with these adjustments. If taxes were taken into account, it would widen the performance advantage of TMI. Still, this is a strange pattern (tilting the calculation in favor of TMI and against AF), and makes us a little suspicious of Asset Builder. The assertion doesn't take into account risk. As we will discuss below, the AF funds are all less volatile than the market over the period. Customized Vanguard Benchmark Index PortfolioThere is nothing wrong with Asset Builder's choice of Vanguard Total Market Index (TMI) as a comp for the US funds, but our portfolio also includes several non-US and balanced funds. As shown below, we will be comparing our portfolio to 54.5% of the S&P index. The S&P has an average market capitalization almost twice as large as the Total Market Index, and we believe is more comparable to typical holdings of the AF. We are also including 24.5% of our benchmark in non-US stocks based on our estimated weightings shown in the matrix below. AF also run with a higher amount of cash than index funds, as can be seen with our estimated 7.35% weighting in VFISX below. Cash reserves are a drag on performance during bull markets, so has likely been weighing on AF in recent years. During the 2000 tech crash, extra cash gave AF room to maneuver, and as we show below, helped them achieve ~30% out-performance through the bear market. Our benchmark is more granular, and we believe a more fair comparison than the TMI for our portfolio, but in the end is still only an estimate. Weightings over time have not been static as we have assumed, and we have chosen one set of weightings for the entire 20-year period. A future analysis may look at ways of flexing our weightings matrix over time. Download Raw Weekly Mutual Fund Price Data with QuantmodIn the course of writing this blog, Redwall has frequently expressed amazement that so many analyses, not possible previously, are now enabled so quickly with a few lines of code. Using the Preprocess Data into Weekly Log Returns for AnalysisOur data list contains 14 xts (time series) objects with dates and prices of each fund over the period. AF Steadily Outperforming our Customized BenchmarkThe chart below gives a much better "apples-to-apples" benchmark for comparison to our portfolio than the Vanguard Total Market Index would have. It is true that the mainly US-oriented AF that we may not have outperformed as much as the non-US heavy portfolios. But our portfolio is global, and as can be seen here in aggregate, outperforming steadily except for a few relatively short periods. We can see three periods of either under-performance or treading of water relative to the benchmarks at the tail end of the previous two bulls, but then the subsequent out-performance. Money Difference of AF vs Index BenchmarksThe annual active premium of the AF portfolio over the whole period has been about 1.8% per annum, but as we will discuss below, the fund group's premium may be compressing. If we choose the starting point to be the beginning of 2003, it falls to 1.02%. Over the full period as shown below in blue, a dollar invested in 1997 would be worth $4.47 while the benchmark would yield $3.03 for the benchmark in orange (a considerable reward for hiring AF even ignoring likely greater tax inefficiency). If we move to 2002 (around when we built our portfolio), the difference falls to $3.16 and $2.66. Mutual Fund Grading Ready for OverhaulMorningstar came up with the ideas of mutual fund Star Ratings in 1985 to compare funds across broadly defined categories. They took it a step further, they created investment style and regional boxes in 1992, which all made sense at the time. Just like other report cards though, investors began to try to game the system by moving funds among categories, launching and merging funds when advantageous, and creating incentives for managers chasing quarterly or calendar year returns. It doesn't seem to make a lot of sense now make decisions about manager skill over any particular year or group of years when it is possible to break a fund into weekly performance, and build new benchmarks all in a matter of a day or two, as we have done in this analysis. It is easily possible to extract all periods to see how persistently or not a fund has out-performed. American Fund itself did an analysis along these lines last year The Select Investment Scorecard, but unfortunately hasn't updated or made the data available for others to reproduce, though a quick glance at the methodology, it seemed robust. It is hard to understand why Morningstar wouldn't want to improve its measurement process along these lines. Looking at Number of Weeks with OutperformanceWe took all of our 1196 weeks, and calculated the percentage of weeks by quarter where our AF portfolio outperformed the benchmark. We can see that the ratio of weeks outperforming greater than 0.5 in almost all periods, though it broke below briefly during 2007 and again last week. The confidence bars are wide, and so hard to conclude definitively that the ratio has been above 0.5 since 2006-7. After a while looking at this chart, the trend downward since 2005 certainly struck us. Modeling Fee Reductions in Line with Index Fund BenchmarksIn 2005, the cost of many of the index funds we used in comparison exceeded 30 bps, and today the best in class index funds are at or below 10bps. We might have to study it more, but it seems like there was a bigger reduction in overseas and bond index funds than for the S&P, which was already low by 2010. Meanwhile, AF haven't lowered its expense ratios meaningfully in 20 years. That means that its managers would have to generate that much higher gross returns just to maintain the same active return. If we model in a 1 bp fee reduction per year, or 23 bps over the full period, the out-performance trajectory improves noticeably, though we are still not sure it is greater than 50%. ConclusionThis has been a quick analysis to become accustomed to the Author: David Lucy, Founder of Redwall Analytics
To leave a comment for the author, please follow the link and comment on their blog: business-science.io. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Evaluating American Funds Portfolio first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now | ||||
Visualization of COVID-19 Cases in Arkansas Posted: 29 Sep 2020 11:40 AM PDT
[This article was first published on R – Nathan Chaney, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Throughout the COVID-19 pandemic, the main sources of information for case numbers in the State of Arkansas have been daily press conferences by the Governor's Office (until recently, when they moved to weekly) and the website arkansascovid.com. I haven't been particularly impressed with the visualizations used by either source. Today I'm sharing some code that I have been using throughout the pandemic to keep track of how Arkansas is doing with the pandemic. We'll use several libraries, the purpose of which is indicated in the comments: library(tidyverse) library(lubridate) # Date wrangling library(gganimate) # GIF production library(tidycensus) # Population estimates library(transformr) # used by gganimate library(ggthemes) # map themes library(viridis) # Heatmap color palette library(scales) # Pretty axis labels library(zoo) # rollapply knitr::opts_chunk$set( message = F, echo = T, include = T ) options( scipen = 10 ) # print full numbers, not scientific notation We'll use the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, which is maintained at Github. The data can be read in a single line, although we'll reorganize the case counts into a long format for ease of further wrangling. Here's a snippet of the table: covid_cases <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv") covid_cases <- pivot_longer(covid_cases, 12:length(covid_cases), names_to = "date", values_to = "cases") %>% mutate(date = lubridate::as_date(date, format = "%m/%d/%y")) %>% filter(Province_State == 'Arkansas') %>% arrange(date, Combined_Key) tail(covid_cases %>% select(Combined_Key, date, cases)) ## # A tibble: 6 x 3 ## Combined_Key date cases ## Because we'll be doing per-capita calculations, we need to load population estimates. Fortunately, the tidycensus package provides a convenient method of obtaining that information. Here's a snapshot of the population data: population <- tidycensus::get_estimates(geography = "county", "population") %>% mutate(GEOID = as.integer(GEOID)) %>% pivot_wider( names_from = variable, values_from = value ) %>% filter(grepl("Arkansas", NAME)) head(population) ## # A tibble: 6 x 4 ## NAME GEOID POP DENSITY ## The Governor's Office makes several design choices designed to spin statistics so that it looks like Arkansas is doing a good job with managing the crisis (as of the writing of this post, the statistics suggest otherwise). For example, the bar chart of rolling cases often splits out prison cases and community spread cases so the overall trend is obscured. Further, the use of bar charts rather than line charts also makes it harder to visualize the trend of new cases. We'll use a trendline of overall cases without spin: ark_covid_cases <- covid_cases %>% filter(`Province_State` == 'Arkansas') p <- ark_covid_cases %>% filter(cases > 0) %>% group_by(Province_State, date) %>% mutate(cases = sum(cases)) %>% ggplot(aes(x = date, y = cases)) + geom_line() + scale_x_date(breaks = scales::pretty_breaks()) + scale_y_continuous(labels = unit_format(unit = "k", sep = "", big.mark = ",", scale = 1/1000)) + labs( title = "Total COVID-19 cases in Arkansas", x = "", y = "", caption = paste0("Image generated: ", Sys.time(), "\n", "Data source: https://github.com/CSSEGISandData/COVID-19", "\n", "COVID-19 Data Repository by CSSE at Johns Hopkins University") ) ggsave(filename = "images/ark_covid_total_cases.png", plot = p, height = 3, width = 5.25) p This trendline shows no real signs of leveling off. As we'll see later on, the number of new cases isn't going down. Both the Governor's Office and the website arkansascovid.com both use arbitrarily selected population metrics to depict per-capita cases (typically 10,000 residents). We'll use a different per-capita metric that is reasonably close to the median county size in the state. As such, for many counties, the per-capita number will be reasonably close to the actual population of the county. That number can be calculated from the state's population metrics: per_capita <- population %>% filter(grepl("Arkansas", NAME)) %>% summarize(median = median(POP)) %>% # Get median county population unlist() per_capita ## median ## 18188 Instead of using the actual median, we'll round it to the nearest 5,000 residents: per_capita <- plyr::round_any(per_capita, 5e3) # Round population to nearest 5,000 per_capita ## median ## 20000 Now that we have the population figure we want to use for the per-capita calculations, we will perform those using the lag function to calculate the new cases per day, and then using the rollapply function to smooth the number of daily cases over a sliding 1-week (7-day) window. The results look like this: roll_ark_cases <- ark_covid_cases %>% arrange(date) %>% group_by(UID) %>% mutate(prev_count = lag(cases)) %>% mutate(prev_count = ifelse(is.na(prev_count), 0, prev_count)) %>% mutate(new_cases = cases - prev_count) %>% mutate(roll_cases = round(zoo::rollapply(new_cases, 7, mean, fill = 0, align = "right", na.rm = T)))%>% ungroup() %>% select(-prev_count) %>% left_join( population %>% select(-NAME), by = c("FIPS" = "GEOID") ) %>% mutate( cases_capita = round(cases / POP * per_capita), # cases per per_capita residents new_capita = round(new_cases / POP * per_capita), # cases per per_capita residents roll_capita = round(roll_cases / POP * per_capita) # rolling new cases per per_capita residents ) tail(roll_ark_cases %>% select(date, Admin2, POP, cases, new_cases, roll_cases, roll_capita)) ## # A tibble: 6 x 7 ## date Admin2 POP cases new_cases roll_cases roll_capita ## We can summarize those results in order to get a total number of rolling cases for the entire state, which looks like this: roll_agg_ark_cases <- roll_ark_cases %>% group_by(date) %>% summarize(roll_cases = sum(roll_cases)) tail(roll_agg_ark_cases) ## # A tibble: 6 x 2 ## date roll_cases ## We can then plot the aggregate number of rolling cases over time. We'll show a couple of different time points relevant to the spread of the coronavirus, including the Governor's mask/social distancing mandate and the reopening of public schools: p <- roll_agg_ark_cases %>% ggplot(aes(date, roll_cases)) + geom_line() + geom_vline(xintercept = as.Date("2020-08-24"), color = "gray10", linetype = "longdash") + annotate(geom ="text", label = "School\nstarts", x = as.Date("2020-08-05"), y = 200, color = "gray10") + annotate(geom = "segment", y = 290, yend = 400, x = as.Date("2020-08-05"), xend = as.Date("2020-08-24")) + geom_vline(xintercept = as.Date("2020-07-16"), color = "gray10", linetype = "longdash") + annotate(geom ="text", label = "Mask\nmandate", x = as.Date("2020-06-21"), y = 100, color = "gray10") + annotate(geom = "segment", y = 190, yend = 300, x = as.Date("2020-06-21"), xend = as.Date("2020-07-16")) + geom_smooth(span = 1/5) + labs( title = "7-Day Rolling Average of New COVID-19 Cases in Arkansas", x = "", y = "", caption = paste0("Image generated: ", Sys.time(), "\n", "Data source: https://github.com/CSSEGISandData/COVID-19", "\n", "COVID-19 Data Repository by CSSE at Johns Hopkins University") ) + theme( title = element_text(size = 10) ) ggsave(filename = "images/ark_covid_rolling_cases.png", plot = p, height = 3, width = 5.25) p From this plot, it appears that the mask mandate may have had a positive effect in leveling off the number of new COVID-19 cases. Conversely, it appears that the reopening of schools may have led to a rapid increase in the number of new cases. Of course, the rate of virus transmission has a multitude of causes, and the correlation here doesn't necessarily imply causation. The website arkansascovid.com contains better visualizations than what the Governor's Office uses, but the default Tableau color scheme doesn't do a very good job of showing hotspots. Counties with a higher number of cases are depicted in dark blue (a color associated with cold), while counties with fewer cases are shown in pale green (a color without a heat association). In addition, there aren't visualizations that show changes at the county level over time. So, we'll use a county-level visualization that shows the number of rolling new cases over time with a color scheme that intuitively shows hot spots: # Start when 7-day rolling cases in state > 0 first_date <- min({ roll_ark_cases %>% group_by(date) %>% summarize(roll_cases = sum(roll_cases)) %>% ungroup() %>% filter(roll_cases > 0) %>% select(date) }$date) temp <- roll_ark_cases %>% filter(date >= first_date) %>% mutate(roll_capita = ifelse(roll_capita <= 0, 1, roll_capita)) %>% # log10 scale plot mutate(roll_cases = ifelse(roll_cases <= 0, 1, roll_cases)) # log10 scale plot # Prefer tigris projection for state map temp_sf <- tigris::counties(cb = T, resolution = "20m") %>% mutate(GEOID = as.numeric(GEOID)) %>% inner_join(temp %>% select(FIPS, roll_cases, roll_capita, date), by = c("GEOID" = "FIPS")) %>% select(GEOID, roll_cases, roll_capita, date, geometry) # tidycensus projection is skewed for state map # data("county_laea") # data("state_laea") # temp_sf <- county_laea %>% # mutate(GEOID = as.numeric(GEOID)) %>% # inner_join(temp, by = c("GEOID" = "FIPS")) days <- NROW(unique(temp$date)) p <- ggplot(temp_sf) + geom_sf(aes(fill = roll_capita), size = 0.25) + scale_fill_viridis( name = "7-day rolling cases: ", trans = "log10", option = "plasma", ) + ggthemes::theme_map() + theme(legend.position = "bottom", legend.justification = "center") + labs( title = paste0("Arkansas 7-day rolling average of new COVID cases per ", scales::comma(per_capita), " residents"), subtitle = "Date: {frame_time}", caption = paste0("Image generated: ", Sys.time(), "\n", "Data source: https://github.com/CSSEGISandData/COVID-19", "\n", "COVID-19 Data Repository by CSSE at Johns Hopkins University") ) + transition_time(date) Sys.time() anim <- animate( p, nframes = days + 10 + 30, fps = 5, start_pause = 10, end_pause = 30, res = 96, width = 600, height = 600, units = "px" ) Sys.time() anim_save("images/ark_covid_rolling_cases_plasma.gif", animation = anim) # anim There are a couple of design choices here that are worth explaining. First, we're animating the graphic over time, which shows where hotspots occur during the course of the pandemic. Second, we're using the plasma color palette from the viridis package. This palette goes from indigo on the low end to a hot yellow on the high end, so it intuitively shows hotspots. Third, we're using a log scale for the number of new cases – the idea here is that jumps of an order of magnitude or so are depicted in different colors (i.e., indigo, purple, red, orange, yellow) along the plasma palette. If we use a standard numerical scale for the number of new cases, jumps from 1-20 or so get washed out due to the large size of the worst outbreaks. ConclusionI hope you found my alternate visualizations for COVID-19 in Arkansas useful. The charts are set to update nightly, so these data should be current throughout the pandemic. If you have suggestions for improvements or notice that the figures aren't updating, please comment! Thanks for reading. To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Visualization of COVID-19 Cases in Arkansas first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now | ||||
RStudio v1.4 Preview: Visual Markdown Editing Posted: 29 Sep 2020 11:00 AM PDT
[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Today we're excited to announce availability of our first Preview Release for RStudio 1.4, a major new release which includes the following new features:
You can try out these new features now in the RStudio v1.4 Preview Release. Over the next few weeks we'll be blogging about each of these new features in turn. Visual Markdown EditingR Markdown users frequently tell us that they'd like to see more of their content changes in real-time as they write, both to reduce the time required by the edit/preview cycle, and to improve their flow of composition by having a clearer view of what they've already written. To switch into visual mode for a markdown document, use the button with the compass icon at the top-right of the editor toolbar: With visual mode, we've tried to create a WYSIWYM editor for people that love markdown. You can also configure visual mode to write markdown using one sentence per-line, which makes working with markdown files on GitHub much easier (enabling line-based comments for sentences and making diffs more local to the actual text that has changed). Anything you can express in pandoc markdown (including tables, footnotes, attributes, etc.) can be edited in visual mode. Embedded CodeR, Python, SQL and other code chunks can be edited using the standard RStudio source editor. Chunk output is displayed inline (you can switch to show the output in the console instead using the Options toolbar button, accessible via the gear icon), and all of the customary commands from source mode for executing multiple chunks, clearing chunk output, etc. are available. TablesYou can insert a table using the Table menu. Note that if you select multiple rows or columns the Insert or Delete command will behave accordingly. Try editing a table in visual mode then see what it looks like in source mode: all of the table columns will be perfectly aligned (with cell text wrapped as required). CitationsVisual mode uses the standard Pandoc markdown representation for citations (e.g.
Use the toolbar button or the Cmd+Shift+F8 keyboard shortcut to show the Insert Citation dialog: If you insert citations from Zotero, DOI look-up, or a search, they are automatically added to your document bibliography. You can also insert citations directly using markdown syntax (e.g. EquationsLaTeX equations are authored using standard Pandoc markdown syntax (the editor will automatically recognize the syntax and treat the equation as math). As shown above, when you select an equation with the keyboard or mouse you can edit the equation's LaTeX. ImagesYou can insert images using either the Insert -> Image command (Ctrl+Shift+I keyboard shortcut) or by dragging and dropping images from the local filesystem. Select an image to re-size it in place (automatically preserving their aspect ratio if you wish): Cross ReferencesThe bookdown package includes markdown extensions for cross-references and part headers. Bookdown cross-references enable you to easily link to figures, equations, and even arbitrary labels within a document. Cross-references are largely the same in visual mode, but you don't need the leading As shown above, when entering a cross-reference you can search across all cross-references in your project to easily find the right reference ID. Similar to hyperlinks, you can also navigate to the location of a cross-reference by clicking the popup link that appears when it's selected: You can also navigate directly to any cross-reference using IDE global search: See the bookdown documentation for more information on cross-references. FootnotesYou can include footnotes using the Insert -> Footnote command (or the Cmd+Shift+F7 keyboard shortcut). EmojisTo insert an emoji, you can use either the Insert menu or use the requisite markdown shortcut plus auto-complete:
For markdown formats that support text representations of emojis (e.g. LaTeX and HTMLYou can include raw LaTeX commands or HTML tags when authoring in visual mode. The above examples utilize inline LaTeX and HTML. Learning MoreSee the Visual Markdown Editing documentation to learn more about using visual mode. You can try out the visual editor by installing the RStudio 1.4 Preview Release. To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post RStudio v1.4 Preview: Visual Markdown Editing first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now | ||||
New Polished Feature – User Roles Posted: 29 Sep 2020 11:00 AM PDT
[This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
user roles Under the hood, user roles are just strings that you define in your polished.tech dashboard. e.g. you can make a "super_user" role. You can then assign the "super_user" role to 1 or more users of your As another example, let's say you have a table of data stored in your You may create as many roles as you need. The following is the step my step process for how to create roles in the polished dashboard, and add them to your user:
As noted earlier, you can access your users' roles in the user object provided by polished at: session$userData$user()$roles Here is a Shiny app using the polshed "editor" role that we created above: This is a simple feature, but we have found it helps keep our apps consistent and well If you want to check out roles and other new features today, sign up for an account at polished.tech. And make sure to install the newly released install.packages("polished") Please reach out if you have questions or feedback! To leave a comment for the author, please follow the link and comment on their blog: Posts on Tychobra. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post New Polished Feature - User Roles first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now | ||||
Time Series Forecasting: KNN vs. ARIMA Posted: 29 Sep 2020 05:47 AM PDT
[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. It is always hard to find a proper model to forecast time series data. One of the reasons is that models that use time-series data often expose to serial correlation. In this article, we will compare k nearest neighbor (KNN) regression which is a supervised machine learning method, with a more classical and stochastic process, autoregressive integrated moving average (ARIMA). We will use the monthly prices of refined gold futures(XAUTRY) for one gram in Turkish Lira traded on BIST(Istanbul Stock Exchange) to forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here. #building the time series data library(readxl) df_xautry <- read_excel("xau_try.xlsx") xautry_ts <- ts(df_xautry$price,start = c(2013,1),frequency = 12) KNN Regression We are going to use tsfknn package which can be used to forecast time series in R programming language. KNN regression process consists of instance, features, and targets components. Below is an example to understand the components and the process. library(tsfknn) pred <- knn_forecasting(xautry_ts, h = 6, lags = 1:12,k=3) autoplot(pred, highlight = "neighbors",faceting = TRUE) The lags parameter indicates the lagged values of the time series data. The lagged values are used as features or explanatory variables. In this example, because our time series data is monthly, we set the parameters to 1:12. The last 12 observations of the data build the instance, which is shown by purple points on the graph. This instance is used as a reference vector to find features that are the closest vectors to that instance. The relevant distance metric is calculated by the Euclidean formula as shown below: denotes the instance and indicates the features that are ranked in order by the distance metric. The k parameter determines the number of k closest features vectors which are called k nearest neighbors. nearest_neighbors function shows the instance, k nearest neighbors, and the targets. nearest_neighbors(pred) #$instance #Lag 12 Lag 11 Lag 10 Lag 9 Lag 8 Lag 7 Lag 6 Lag 5 Lag 4 Lag 3 Lag 2 #272.79 277.55 272.91 291.12 306.76 322.53 345.28 382.02 384.06 389.36 448.28 # Lag 1 #462.59 #$nneighbors # Lag 12 Lag 11 Lag 10 Lag 9 Lag 8 Lag 7 Lag 6 Lag 5 Lag 4 Lag 3 Lag 2 #1 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12 306.76 #2 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12 #3 223.97 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 # Lag 1 H1 H2 H3 H4 H5 H6 #1 322.53 345.28 382.02 384.06 389.36 448.28 462.59 #2 306.76 322.53 345.28 382.02 384.06 389.36 448.28 #3 291.12 306.76 322.53 345.28 382.02 384.06 389.36 Targets are the time-series data that come right after the nearest neighbors and their number is the value of the h parameter. The targets of the nearest neighbors are averaged to forecast the future h periods. As you can see from the above plotting, features or targets might overlap the instance. This is because the time series data has no seasonality and is in a specific uptrend. This process we mentioned so far is called MIMO(multiple-input-multiple-output) strategy that is a forecasting method used as a default with KNN. Decomposing and analyzing the time series data Before we mention the model, we first analyze the time series data on whether there is seasonality. The decomposition analysis is used to calculate the strength of seasonality which is described as shown below: #Seasonality and trend measurements library(fpp2) fit <- stl(xautry_ts,s.window = "periodic",t.window = 13,robust = TRUE) seasonality <- fit %>% seasonal() trend <- fit %>% trendcycle() remain <- fit %>% remainder() #Trend 1-var(remain)/var(trend+remain) #[1] 0.990609 #Seasonality 1-var(remain)/var(seasonality+remain) #[1] 0.2624522 The stl function is a decomposing time series method. STL is short for seasonal and trend decomposition using loess, which loess is a method for estimating nonlinear relationships. The t.window(trend window) is the number of consecutive observations to be used for estimating the trend and should be odd numbers. The s.window(seasonal window) is the number of consecutive years to estimate each value in the seasonal component, and in this example, is set to 'periodic' to be the same for all years. The robust parameter is set to 'TRUE' which means that the outliers won't affect the estimations of trend and seasonal components. When we examine the results from the above code chunk, it is seen that there is a strong uptrend with 0.99, weak seasonality strength with 0.26, because that any value less than 0.4 is accepted as a negligible seasonal effect. Because of that, we will prefer the non-seasonal ARIMA model. Non-seasonal ARIMA This model consists of differencing with autoregression and moving average. Let's explain each part of the model. Differencing: First of all, we have to explain stationary data. If data doesn't contain information pattern like trend or seasonality in other words is white noise that data is stationary. White noise time series has no autocorrelation at all. Differencing is a simple arithmetic operation that extracts the difference between two consecutive observations to make that data stationary. The above equation shows the first differences that difference at lag 1. Sometimes, the first difference is not enough to obtain stationary data, hence, we might have to do differencing of the time series data one more time(second-order differencing). In autoregressive models, our target variable is a linear combination of its own lagged variables. This means the explanatory variables of the target variable are past values of that target variable. The AR(p) notation denotes the autoregressive model of order p and the denotes the white noise. Moving average models, unlike autoregressive models, they use past error(white noise) values for predictor variables. The MA(q) notation denotes the autoregressive model of order q. If we integrate differencing with autoregression and the moving average model, we obtain a non-seasonal ARIMA model which is short for the autoregressive integrated moving average. is the differenced data and we must remember it may have been first and second order. The explanatory variables are both lagged values of and past forecast erros. This is denoted as ARIMA(p,d,q) where p; the order of the autoregressive; d, degree of first differencing; q, the order of the moving average. Modeling with non-seasonal ARIMA Before we model the data, first we split the data as train and test to calculate accuracy for the ARIMA model. #Splitting time series into training and test data test <- window(xautry_ts, start=c(2019,3)) train <- window(xautry_ts, end=c(2019,2)) #ARIMA modeling library(fpp2) fit_arima<- auto.arima(train, seasonal=FALSE, stepwise=FALSE, approximation=FALSE) fit_arima #Series: train #ARIMA(0,1,2) with drift #Coefficients: # ma1 ma2 drift # -0.1539 -0.2407 1.8378 #s.e. 0.1129 0.1063 0.6554 #sigma^2 estimated as 86.5: log likelihood=-264.93 #AIC=537.85 AICc=538.44 BIC=547.01 As seen above code chunk, Modeling with KNN #Modeling and forecasting library(tsfknn) pred <- knn_forecasting(xautry_ts, h = 18, lags = 1:12,k=3) #Forecasting plotting for KNN autoplot(pred, highlight = "neighbors", faceting = TRUE) Forecasting and accuracy comparison between the models #ARIMA accuracy f_arima<- fit_arima %>% forecast(h =18) %>% accuracy(test) f_arima[,c("RMSE","MAE","MAPE")] # RMSE MAE MAPE #Training set 9.045488 5.529203 4.283023 #Test set 94.788638 74.322505 20.878096 For forecasting accuracy, we take the results of the test set shown above. #Forecasting plot for ARIMA fit_arima %>% forecast(h=18) %>% autoplot()+ autolayer(test) #KNN Accuracy ro <- rolling_origin(pred, h = 18,rolling = FALSE) ro$global_accu # RMSE MAE MAPE #137.12465 129.77352 40.22795 The rolling_origin function is used to evaluate the accuracy based on rolling origin. The rolling parameter should be set to FALSE which makes the last 18 observations as the test set and the remaining as the training set; just like we did for ARIMA modeling before. The test set would not be a constant vector if we had set the rolling parameter to its default value of TRUE. Below, there is an example for h=6 that rolling_origin parameter set to TRUE. You can see the test set dynamically changed from 6 to 1 and they eventually build as a matrix, not a constant vector. #Accuracy plot for KNN plot(ro) When we compare the results of the accuracy measurements like RMSE or MAPE, we can easily see that the ARIMA model is much better than the KNN model for our non-seasonal time series data. References
To leave a comment for the author, please follow the link and comment on their blog: DataGeeek. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Time Series Forecasting: KNN vs. ARIMA first appeared on R-bloggers. | ||||
How to Convert Continuous variables into Categorical by Creating Bins Posted: 29 Sep 2020 05:29 AM PDT
[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. A very common task in data processing is the transformation of the numeric variables (continuous, discrete etc) to categorical by creating bins. For example, is quite ofter to convert the We will consider a random variable from the Poisson distribution with parameter λ=20 library(dplyr) # Generate 1000 observations from the Poisson distribution # with lambda equal to 20 df<-data.frame(MyContinuous = rpois(1000,20)) # get the histogtam hist(df$MyContinuous) Create specific BinsLet's say that you want to create the following bins:
We can easily do that using the df<-df%>%mutate(MySpecificBins = cut(MyContinuous, breaks = c(-Inf,15,25,Inf))) head(df,10) Let's have a look at the counts of each bin. df%>%group_by(MySpecificBins)%>%count() Notice that you can define also you own labels within the Create Bins based on QuantilesLet's say that you want each bin to have the same number of observations, like for example 4 bins of an equal number of observations, i.e. 25% each. We can easily do it as follows: numbers_of_bins = 4 df<-df%>%mutate(MyQuantileBins = cut(MyContinuous, breaks = unique(quantile(MyContinuous,probs=seq.int(0,1, by=1/numbers_of_bins))), include.lowest=TRUE)) head(df,10) We can check the df%>%group_by(MyQuantileBins)%>%count() Notice that in case that you want to split your continuous variable into bins of equal size you can also use the To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post How to Convert Continuous variables into Categorical by Creating Bins first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment