[R-bloggers] Le Monde puzzle [#1157] (and 7 more aRticles)

Le Monde puzzle [#1157]
How to Switch from Excel to R Shiny: First Steps
Evaluating American Funds Portfolio
Visualization of COVID-19 Cases in Arkansas
RStudio v1.4 Preview: Visual Markdown Editing
New Polished Feature – User Roles
Time Series Forecasting: KNN vs. ARIMA
How to Convert Continuous variables into Categorical by Creating Bins

Posted: 30 Sep 2020 09:20 AM PDT

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The weekly puzzle from Le Monde is an empty (?) challenge:

Kimmernaq and Aputsiaq play a game where Kimmernaq picks ten different integers between 1 and 100, and Aputsiaq must find a partition of these integers into two groups with identical sums. Who is winning?

Indeed, if the sums are equal, then the sum of their sums is even, meaning the sum of the ten integers is even. Any choice of these integers such that the sum is even is a sure win for Aputsiaq. End of the lame game (if I understood its wording correctly!). If some integers can be left out of the groups, then both solutions seem possible: usng the R code

P=1;M=1e3  while (P  I found no solution (i.e. exiting the outer while loop) for M not too large…  So Kimmernaq is apparently winning. Le Monde solution considers the 2¹⁰-1=1023 possible sums made out of 10 integers, which cannot exceed 955, hence some of these sums must be equal (and the same applies when removing the common terms from both sums!). When considering the second half of the question
    What if Kimmernaq picks 6 distinct integers between 1 and 40, and Aputsiaq must find a partition of these integers into two groups with identical sums. Who is winning?
  
  
  recycling the above R code produced subsets systematically hitting the upper bound M, for much larger values. So Aputsiaq should have a mean to pick 6 integers such that any subgroup cannot be broken into two parts with identical sums. One of the outcomes being
     > a  [1] 36 38 30 18  1 22  
  one can check that all the possible sums differ:
  aa=a  for(i in 2:5){   bb=NULL   while(length(bb)  and the outcome is indeed of length 2⁶-2=62!
  As an aside, a strange [to me at least] R "mistake" was that when recycling the variable F in a code-golfing spirit, since it is equal to zero by default, rather than defining a new Q:
  while((P  the counter P was not getting updated!
      		  
  To leave a comment for the author, please follow the link and comment on their blog:  R – Xi'an's Og.
  
  R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.    
Want to share your content on R-bloggers? click here if you have a blog, or  here if you don't.  
The post Le Monde puzzle [#1157] first appeared on R-bloggers.

How to Switch from Excel to R Shiny: First Steps

Posted: 30 Sep 2020 02:15 AM PDT

[This article was first published on r – Appsilon Data Science | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tl;dr

If you're still using Excel or Google Sheets for business, you might already know that Excel is obsolete for many business use-cases. But how do you switch from Excel to a better alternative like R Shiny? It's easier to get started than you might think, especially if you're already an Excel power user.

This article will walk you through a sample migration of an analytics tool built with Excel and Google Sheets to a dashboard built with R Shiny. We'll show you how to prepare the data for migration, how to create a simple R Shiny app, how to view/filter tables in R Shiny, how to modify a table using SQL, and also how to add interactive filters using the library ShinyWidgets.

This article shouldn't take too much of your time to read – perhaps no more than 15 minutes. The final Shiny dashboard has only 45 lines of well-formatted code, so it won't be too demanding to code along. You can download the dataset here.

Excel is a Bad Choice for Businesses – Here's Why

Excel has been an excellent tool for decades due to its relative intuitiveness and the WYSIWYG (What You See Is What You Get) user interface. Still, running an important business process in an Excel workbook in 2020 is a big mistake, as there are a variety of more stable and sophisticated tools readily available.

Excel is prone to human errors. Everyone with access to an Excel document can edit all of its features, which can lead to unintentional changes or the removal of crucial functions. As a result, you can easily end up with a broken spreadsheet whenever a user accidentally (or intentionally) changes a parameter. This is a serious issue because it is currently not possible to fully version control Excel. This means that you cannot effectively keep track of changes in an Excel spreadsheet and revert to a previous version if something breaks.

To put it simply – it is not safe to keep a company's critical information in Excel workbooks, especially when it comes to business analytics. Excel can still be useful as an alternative tool to SQL-based solutions to create simple 'databases' as it is very intuitive, and anyone can contribute. However, in the long run, Excel is just too fragile to handle complex business analyses.

Learn more: Excel is Obsolete: Here Are the Top 2 Alternatives from R and Python

Alternatives to Excel: Python and R

To effectively plan your next business move, you need access to features that Excel doesn't offer, such as integration with machine learning models. You also might need a scalable way of connecting to external data sources via APIs. Due to the variety of analytical approaches that a modern business needs to draw proper conclusions, it is crucial to use a more flexible and reliable tool.

Therefore, it is necessary to move as many analytical operations as possible away from Excel. There are plenty of better-equipped tools for data analysis and data visualization, and they should not be too difficult to master for someone already proficient in Excel.

Python and R are two of the most popular beginner-friendly programming languages. Many non-IT people find R easier and more intuitive to use than Python. This is good news because R and R Shiny alone can cover the majority of business operations stored in Excel and they really can open the door to modern, ground-breaking data analysis.

Learn More: Better than Excel: Use these R Shiny Packages Instead

Migration from Excel to R: Getting Started

Every migration starts with proper data preparation. This step might take some time if a company has a lot of Excel workbooks. The most important thing is to extract original data from analyzed data, create tables, and save them in a CSV format. If you're just starting – start small. Pick one or two CSVs that move the needle for a start. In almost any use case, you don't have to start with a complete dataset. Remember:

When dealing with multiple sheets in a workbook, you need to combine them into one or divide them into different CSVs.
When combining all sheets from the workbook into one table, make sure all sheets have the same number of columns and the same column names.

It would make sense to switch from CSV to SQL in the future, but using CSV is not a dealbreaker in the beginning. Moreover, you've likely performed some useful and effective analytical operations within Excel. Don't get rid of them just yet. By all means, analyze/verify these operations one more time, and describe their logic in detail (to recreate it later in R).

Sample Case: Job Hunt Analysis

Last year, one of our consultants was searching for a full-time job, and they started tracking application processes to see a more accurate picture of their career prospects – primarily to find out which companies/industries find a single profile interesting. Information about all applications sent in 2019 and 2020 were stored in two Excel sheets (original data) and one in Google Sheets (analysis). Let's take a quick preview of this data.

Original data:

001_original_data

Analysis:

Analysis Overview

As you can see in the screen recording above, the dashboard works fine. However, there are some issues we encountered while using and maintaining it:

When sharing the analysis with others, it is impossible to filter data without granting editing rights to users.
We can't select multiple options in filters, e.g. we are not able to check the results for Poland and Spain simultaneously.
We need to maintain four different tables and functions that show almost the same thing.

Because of the reasons above, it makes sense to migrate this dashboard to R Shiny and see if these problems can be eliminated.

Data Preparation

In this case, data preparation was fairly easy. Two sheets were merged into one, some values got replaced (YES -> 1, NO -> empty cell) using Find and Replace, and the data was saved to a CSV file:

The next logical step is describing the features we need to migrate from Excel to R Shiny. The following table summarizes the steps pretty well:

001_data_table

Loading a Dataset

To start the migration process, we downloaded RStudio (a free development environment for R from our partner RStudio, PBC), found the CSV file we wanted to use, and imported it into RStudio:

Data loading

After a successful import, the file appeared in the Global Environment. R had no problem with recognizing the CSV table.

New to R or want to speed up your workflow? Check out our favorite RStudio Shortcuts and Tricks.

Creating Your First R Shiny Dashboard

Let's start simple with something that remotely resembles the original dashboard. The main goal is to make a simple app that displays the source data and filters it by Job Category.

It's important to understand two main components of an R Shiny app – the UI (User Interface) and the server. UI is a graphic layout of an app – everything the user sees on a webpage. The server is the backend of an application. The app is stored on the computer that runs R in the form of a page that can be viewed in a web browser.

If you are a beginner with R Shiny, here's an additional resource to help you get started: (Video Tutorial) Create and Customize a Simple Shiny Dashboard

Note: To share the R Shiny app with others, you either need to send them a copy of a script or host this page via an external web server.

To start, let's use the most basic Shiny app template:

library(shiny)  ui <- fluidPage()  server <- function(input, output) {}  shinyApp(ui=ui, server=server)

Defining Input and Output

001_input_output

Input is everything the user can interact with on a website. To name a few:

select boxes – selectInput()
radio buttons – radioButtons()
sliders – sliderInput()
date ranges – dateRangeInput()
passwords – passwordInput()

Each input must have an inputId (local name, e.g. 'value'), and a label (a description that will be displayed in an app, e.g. 'Select value'). In addition, depending on the type of input, you can provide additional parameters that will specify/limit the actions a user can perform. For more on defining input and output, and other aspects of Shiny, read this tutorial by RStudio.

In the first draft of the app, let's create a reactive select box from which the user can choose any job category that appears in the dataset. Therefore, besides defining inputId and a label we need a list of choices for a dropdown list (choices = TableName$ColumnName):

selectInput('jobcategory', 'Select a category', choices = j_h$JOB_CATEGORY)

output is the second argument in fluidPage(). In this case, it is the result of actions taken by the user in inputs. It can be displayed in the form of a graph – plotOutput(), table – tableOutput(), text – textOutput(), image – imageOutput(), etc. Just like input, output needs to have an ID – outputId. We'll display the results as a table, so let's use the tableOutput() function and name our reactive output 'jobhuntData':

tableOutput('jobhuntData')

Like many basic Shiny apps, our draft Shiny app is quite ugly by default. Let's fix this with some elements: titlePanel(), sidebarLayout(), sidebarPanel(), and mainPanel().

At this point, after adding all elements to a fluidPage() function, our code looks like this:

library(shiny)    ui <- fluidPage(      titlePanel('JOB HUNT RESULTS'),      sidebarLayout(          sidebarPanel(              selectInput('jobcategory', 'Select a category', choices=j_h$JOB_CATEGORY)          ),          mainPanel(              tableOutput('jobhuntData')          )      )  )    server <- function(input, output) {}    shinyApp(ui=ui, server=server)

001_dropdown

We can see the filter, but there is no table yet. This is because R Shiny does not know what kind of table we want to generate. Let's introduce server requirements to address this.

Want to Make a Beautiful Shiny App Fast? Use Appsilon's shiny.semantic open source package, which brings the Fomantic UI library to Shiny for attractive UI and rapid development.

How to Use Shiny Server

To build the first draft of the app, we need to create a source for the tableOutput() function by using a Render Function. Render Functions (e.g. renderImage() to render an image, renderPlot() to render a plot/graph, renderText() to render text, etc.) turn an R object into HTML, and place it in a Shiny webpage.

Below you can see how we assigned the outputId ("jobhuntData") to a function that renders the desired output – in our case, renderTable() to render a table. Inside this function, we specified data that we want to see in the table. Please mind that input$jobcategory refers to the Input Function from the UI, and it is always equal to the current value of the input (a value selected by a user).

library(shiny)    ui <- fluidPage(      titlePanel('JOB HUNT RESULTS'),      sidebarLayout(          sidebarPanel(              selectInput('jobcategory', 'Select a category', choices=j_h2$JOB_CATEGORY)          ),          mainPanel(              tableOutput('jobhuntData')          )      )  )    server <- function(input, output) {      output$jobhuntData <- renderTable({          jobcategoryFilter <- subset(j_h2, j_h2$JOB_CATEGORY == input$jobcategory)      })  }    shinyApp(ui=ui, server=server)

The current version of the app does not look amazing, but we can see that the correct data is shown, and the server generates proper output according to the input provided by the user:

First overview

Migration – SQL and ShinyWidgets

Now that we know how to create a basic dashboard in R Shiny, we are going to migrate other features from our original dashboard. First and foremost, we had to not only create filters for all columns but also aggregate/group data by YEAR and COUNTRY. There are several ways to modify the dataset in R, but we decided to do it using an SQL SELECT statement. SQL is another topic on its own, but we recommend that you learn the basics of SQL if you work with data on a daily (or even weekly) basis.

This is one of the SQL statements we used to create an aggregated view in Google Sheets:

001_formula

Below is the logic that we applied in R using the sqldf library. It enables us to see how many phone screenings, interviews, and offers we had each year in every country:

library(sqldf)    aggregated_data = sqldf('SELECT YEAR, COUNTRY, JOB_CATEGORY,        COUNT(PHONE_SCREENING) AS PHONE_SCREENING,         COUNT(INTERVIEW) AS INTERVIEW,         COUNT(OFFER) AS OFFER        FROM j_h2        GROUP BY JOB_CATEGORY, YEAR, COUNTRY        ORDER BY JOB_CATEGORY')

This is how the new table "aggregated_data" looks like:

Adding multiple filters that are conditional can be a very difficult task, but the ShinyWidgets library offers a perfect solution: selectizeGroup-module. Having imported ShinyWidgets, we've replaced selectInput() with selecticizeGroupUI() and added one more function – callModule(). This way we have eliminated the possibility of choosing a combination that does not exist. Below you can see the entire solution:

library(sqldf)  library(shiny)  library(shinyWidgets)    aggregated_data = sqldf("SELECT YEAR, COUNTRY, JOB_CATEGORY,        COUNT(PHONE_SCREENING) AS PHONE_SCREENING,         COUNT(INTERVIEW) AS INTERVIEW,         COUNT(OFFER) AS OFFER        FROM j_h2        GROUP BY JOB_CATEGORY, YEAR, COUNTRY        ORDER BY JOB_CATEGORY")    shinyApp(    ui = fluidPage(      titlePanel("JOB HUNT RESULTS"),      sidebarPanel(        selectizeGroupUI(          id = "fancy_filters",          inline = FALSE,          params = list(            YEAR = list(inputId = "YEAR", title = "Year", placeholder = 'All'),            COUNTRY = list(inputId = "COUNTRY", title = "Country", placeholder = 'All'),            JOB_CATEGORY = list(inputId = "JOB_CATEGORY", title = "Job category", placeholder = 'All'),            PHONE_SCREENING = list(inputId = "PHONE_SCREENING", title = "Number of positive replies", placeholder = 'All'),            INTERVIEW = list(inputId = "INTERVIEW", title = "Number of interview invitations", placeholder = 'All'),            OFFER = list(inputId = "OFFER", title = "Number of offers", placeholder = 'All')          )        )      ),       mainPanel(        tableOutput("jobhuntData")      )    ),      server = function(input, output, session) {      res_mod <- callModule(        module = selectizeGroupServer,        id = "fancy_filters",        data = aggregated_data,        vars = c("YEAR", "COUNTRY", "JOB_CATEGORY", "PHONE_SCREENING", "INTERVIEW", "OFFER")      )      output$jobhuntData <- renderTable({        res_mod()      })    })

Finished Dashboard

Conclusion

Working with a new tool like R Shiny can be intimidating at first, but in some ways it can be even easier to learn and understand than Excel or Google Sheets. It is more flexible in terms of adding new features or modifying existing ones. Because we replaced four tables with one, the dashboard not only looks better than our Excel and Google Sheets tool – it is also much easier to use.

Moreover, we managed to create an app where the user is in complete control of the displayed data but does not have access to the backend. This means we do not need to worry about non-technical users making accidental changes to the source code or breaking the app. We can also apply version control and store the source code of the app on services like GitHub in a way that allows us to safely revert to previous versions. This way, anyone who I want to share my code with can download it and make contributions in a controlled environment.

You deserve a great dashboard. Need help creating a beautiful, durable, and scalable enterprise Shiny app? Reach out.

Learn More

This article was originally written by Zuzanna Danowska with further edits from Appsilon team members Marcin Dubel and Dario Radečić.

Article How to Switch from Excel to R Shiny: First Steps comes from Appsilon Data Science | End to End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End to End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Switch from Excel to R Shiny: First Steps first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Evaluating American Funds Portfolio

Posted: 29 Sep 2020 02:30 PM PDT

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Active funds have done poorly over the last ten years, and in most cases, struggled to justify their fees. A growing list of commentators appropriately advocate for index funds, although sometimes go a little beyond what we believe to be fairly representing the facts. The inspiration for this article is this post by Asset Builder blog site American Funds Says, "We Can Beat Index Funds" scrutinizing claims by the fund group. Asset Builder asserts that "Even without this commission, the S&P 500 beat the aggregate returns of these ("American") funds over the past 1-, 3-, 5-, 10- and 15-year periods". In the post, there is a supporting chart showing a group of American Funds ("AF") funds compared to the Vanguard Total Market ("TMI") index. This analysis struck us in conflict with our own experience as actual holders of a core portfolio of eight AF over the last 20 years, so this post will be about exploring this data.

In this article, we will download the weekly closing prices of the relevant AF and the most comparable Vanguard Funds, re-construct our portfolio and estimate the corresponding weighting of different asset classes for each, replicate a relevant benchmark portfolio of Vanguard index funds, and explore their relative performance histories over the period to try to square the two perspectives. We will also consider the possibility that AF's declining out-performance versus our customized benchmark over the last 15 years may have to do with growing fee differentials with index alternatives.

As usual, Redwall would like to avoid defending to any particular viewpoint other than to follow the data and see where it leads. If we have made any mistakes in our assumptions or the data used, we welcome polite commentary to set us straight. We have no relationship with the AF, and for the most part are sympathetic to those who say that index funds may be the best choice for most investors. All the code is available on Github for anybody to replicate. Also to be clear, Redwall is not an investment adviser and is making no investment recommendations.

Set Up of AF Portfolio

During the 2000 bear market, Redwall put substantial research into its investment strategy, and concluded that the AF had a competitive advantage over other mutual fund groups. Capital Group, the operator of the AF, was founded at the beginning of the Great Depression in 1932. Capital had a large group of experienced managers sitting in different locations around the world, with varied perspectives, owning a heavy component of their own funds, with each investing in concentrated portfolio of their own highest conviction ideas. Managers had strong incentive to think long-term instead of for the next quarter. If the style of one manager of the fund was out of sync with the current flavor of the market, others might pick up the pace. The cost of research could be leveraged over a much larger asset base than most mutual funds while still keeping running costs at a manageable level. Being one of the largest managers, analysts and managers would always have access to the best information and advice. Convinced that AF were a solid set-it and forget-it portfolio, investments were were made with monthly dollar-cost averaging without paying loads, and mostly between 2001-2004.

Description of American Funds Held

The AF don't fit well into the traditional Morningstar investment categories. By in large, its portfolios are many times larger than other active funds, and mostly stick to the largest of the large capitalization global stocks. Washington Mutual mostly owns US mega caps value stocks and holds no cash, while Amcap often moves down the market capitalization spectrum a bit with growth stocks, and will hold a substantial amount of cash. Capital Income builder has a mix of US and overseas stocks which pay high dividends with room to grow. Income Fund of America is similar to Capital Income builder, but has a more US oriented mix and takes more credit risk. Capital World Growth and Income is like Washington Mutual in its stock selection, but will hold a small amount of credit at times when it makes more sense than the equity. New Perspective owns the largest multinational companies domiciled in the US and around the world, but have acquired the competency to expand across borders.

New Geography of Investing

It was probably from operating New Perspective, set up to invest in companies having a majority of revenues of coming from outside of their country of domicile, which led AF to discover a new way of looking at its portfolios. In the New Geography of Investing campaign launched in 2016, they do an excellent job of explaining the concept that a portfolio shouldn't be constrained by company domicile, a central pillar of the Morningstar ratings platform. In addition to the country of domicile, AF now disclose the aggregated geographic mix of revenues of all of its portfolios on their website, and explain clearly that it doesn't prioritize fitting its portfolios into Morningstar regional boxes at the expense of finding the best investments. Because of this, a single index benchmark may be less applicable to AF funds than some others.

Doublecheck Asset Builder Values

We believe that Asset Builder were referring to no-load AF in their table, but were not sure. It has been possible to buy American Fund F-1 class shares load-free since 2016 (with a 3 bps higher annual expense ratio), so there is no reason for anyone that doesn't want to pay the up-front sales changes for advice to pay one. As shown below, we calculate that Asset Builder's ending value for is 3-4% too high for the Vanguard Fund, but also too low for 4 out of the 5 AF without loads. For the most part, their assertion that AF's funds lose to the S&P still holds up, even with these adjustments. If taxes were taken into account, it would widen the performance advantage of TMI. Still, this is a strange pattern (tilting the calculation in favor of TMI and against AF), and makes us a little suspicious of Asset Builder. The assertion doesn't take into account risk. As we will discuss below, the AF funds are all less volatile than the market over the period.

# Get data from quantmod  tickers <-c("AGTHX","AMCPX","AWSHX","AIVSX","AMRMX", "VTSAX")  asset_builder_data <- lapply(tickers, function(fund) {    getSymbols(      fund,      src = "yahoo",      env = NULL,      from = as.Date("2004-11-30"),      to = as.Date("2019-11-30")    )  })    # Calculate holding period return of $100 invested monthly  get_data <- function(xts_obj, load = 0) {        # Build data.table    dt <- data.table(            date = index(xts_obj),            price = (Ad(xts_obj[, 6]))          )        # Filter monthly    dt[, month:=zoo::as.yearmon(date)]    dt <- dt[, .SD[1], month]        # Adjust load if needed    if (!str_detect(names(dt)[3], "price.A.*")) {      dt[, shares := 100 / .SD, .SDcols=3]    } else {      dt[, shares := (100 * (1 - load)) / .SD, .SDcols=3]    }        # Calculate final value    final_price <- as.numeric(dt[nrow(dt), 3])    dt[, final_value := shares * final_price]    return <- sum(dt$final_value)        # Return final value    return    }      # Values from Asset Builder table  asset_builder <-     c(42402, 41827, 39981, 37125, 39814, 45112)    # Build comparison table  dt <-data.table(          fund = tickers,          asset_builder,          redwall_no_load = round(sapply(asset_builder_data, get_data), 0),          redwall_load = round(sapply(asset_builder_data, get_data, load=0.0575), 0)        )  dt

   fund   asset_builder  redwall_no_load  redwall_load  1: AGTHX         42402           43125        40646  2: AMCPX         41827           42539        40093  3: AWSHX         39981           41186        38818  4: AIVSX         37125           37883        35705  5: AMRMX         39814           38989        36747  6: VTSAX         45112           43505        43505

Customized Vanguard Benchmark Index Portfolio

There is nothing wrong with Asset Builder's choice of Vanguard Total Market Index (TMI) as a comp for the US funds, but our portfolio also includes several non-US and balanced funds. As shown below, we will be comparing our portfolio to 54.5% of the S&P index. The S&P has an average market capitalization almost twice as large as the Total Market Index, and we believe is more comparable to typical holdings of the AF. We are also including 24.5% of our benchmark in non-US stocks based on our estimated weightings shown in the matrix below. AF also run with a higher amount of cash than index funds, as can be seen with our estimated 7.35% weighting in VFISX below. Cash reserves are a drag on performance during bull markets, so has likely been weighing on AF in recent years. During the 2000 tech crash, extra cash gave AF room to maneuver, and as we show below, helped them achieve ~30% out-performance through the bear market. Our benchmark is more granular, and we believe a more fair comparison than the TMI for our portfolio, but in the end is still only an estimate. Weightings over time have not been static as we have assumed, and we have chosen one set of weightings for the entire 20-year period. A future analysis may look at ways of flexing our weightings matrix over time.

# Funds to query  am_funds <- c("AMCPX","AWSHX","CAIBX","AMECX","SMCWX","AEPGX", "ANWPX", "CWGIX")  van_funds <- c("VFINX", "VGTSX", "VBTIX", "VSCIX", "VFISX", "VBINX")  funds <- c(am_funds, van_funds)    # Assumed Vanguard weighting of fund  m <- matrix(    # vfinx, vgtsx, vbtix, vscix,  vfisx, vbinx    c(0.85,  0.05,  0,     0,      0.1,    0,  #amcpx      0.95,  0.02,  0,     0,      0.03,   0,  #awshx      0.35,  0.30,  0.25,  0,      0.1,    0,  #caibx      0.5,   0.15,  0.30,  0,      0.05,   0,  #amecx      0,     0,     0,     0.9,    0.1,    0,  #smcwx      0.05,  0.8,   0,     0.05,   0.1,    0,  #aepgx      0.45,  0.4,  0.05,   0,      0.1,    0,  #cwigx      0.5,   0.45,  0,     0,      0.05,   0), #anwpx    ncol = 6,     byrow=TRUE)    # Weighting of AF portfolio  portfolio <- c(0.15, 0.20, 0.15, 0.15, 0.05, 0.1, 0.1, 0.1)    # Implied benchmark portfolio weightings  benchmark <- as.vector(colSums(m * portfolio))    # US Equity, Intl Equity, Total Bonds, Smallcap Equity, Money Market, Balanced  benchmark

[1] 0.5450 0.2440 0.0875 0.0500 0.0735 0.0000

Download Raw Weekly Mutual Fund Price Data with Quantmod

In the course of writing this blog, Redwall has frequently expressed amazement that so many analyses, not possible previously, are now enabled so quickly with a few lines of code. Using the quantmod package, here we extract over 20 years of mutual fund data, 80,738 prices for our 14 funds in a matter of seconds, all for free. In addition to stock, mutual fund and index prices, we could just as easily query economic series from FRED with quantmod.

# Get data with quantmod  data <- lapply(funds, function(fund) {    getSymbols(      fund,      src = "yahoo",      env = NULL,      from = as.Date("1997-07-12"),      to = as.Date("2020-06-12")    )  })  names(data) <- funds    # Print a few rows of AWSHX  data$AWSHX['1997-07']

           AWSHX.Open AWSHX.High AWSHX.Low AWSHX.Close AWSHX.Volume AWSHX.Adjusted  1997-07-14      29.66      29.66     29.66       29.66            0       8.205210  1997-07-15      29.73      29.73     29.73       29.73            0       8.224575  1997-07-16      29.91      29.91     29.91       29.91            0       8.274372  1997-07-17      29.73      29.73     29.73       29.73            0       8.224575  1997-07-18      29.30      29.30     29.30       29.30            0       8.105614  1997-07-21      29.37      29.37     29.37       29.37            0       8.124981  1997-07-22      29.97      29.97     29.97       29.97            0       8.290968  1997-07-23      30.00      30.00     30.00       30.00            0       8.299267  1997-07-24      30.07      30.07     30.07       30.07            0       8.318631  1997-07-25      30.03      30.03     30.03       30.03            0       8.307570  1997-07-28      30.05      30.05     30.05       30.05            0       8.313101  1997-07-29      30.31      30.31     30.31       30.31            0       8.385027  1997-07-30      30.61      30.61     30.61       30.61            0       8.468021  1997-07-31      30.66      30.66     30.66       30.66            0       8.481853

Preprocess Data into Weekly Log Returns for Analysis

Our data list contains 14 xts (time series) objects with dates and prices of each fund over the period. quantmod also has a suite of tools for processing quantitative market data for stocks, mutual funds and portfolios. In the first line below, we magically select only the adjusted prices and convert them all to weekly log returns. In the second, we merge the time series of all 14 mutual funds on the respective dates into a data.frame. In the third line, we simulate the money growth on $1 of owning the funds in proportion to our portfolio and benchmark vectors and re-balancing every quarter when the target weightings move out of line.

# Convert weekly pries to log returns  fund_returns_list <-     lapply(data, function(fund)      log(1 + weeklyReturn(Ad(fund))))    # Build data frame of American and Vanguard funds with weekly log returns by date  fund_returns_df <-    Reduce(function(d1, d2)      merge.xts(d1, d2,                 join = 'left',                 check.names = TRUE),      fund_returns_list)  names(fund_returns_df) <- funds    # Calculate return on AF re-balanced quarterly with PerformanceAnalytics Return.Portfolio function  portfolio_return <-    Return.portfolio(fund_returns_df[, am_funds],                     rebalance_on = 'quarters',                     weights = portfolio)    # Calculate return on Vanguard benchmark re-balanced quarterly  benchmark_return <-    Return.portfolio(fund_returns_df[, van_funds],                     rebalance_on = 'quarters',                     weights = benchmark)    # Show a few lines of portfolio returns  portfolio_return[1:5]

           portfolio.returns  1997-07-18      -0.001874128  1997-07-25       0.014072466  1997-08-01       0.007248111  1997-08-08      -0.004673215  1997-08-15      -0.018078228

AF Steadily Outperforming our Customized Benchmark

The chart below gives a much better "apples-to-apples" benchmark for comparison to our portfolio than the Vanguard Total Market Index would have. It is true that the mainly US-oriented AF that we may not have outperformed as much as the non-US heavy portfolios. But our portfolio is global, and as can be seen here in aggregate, outperforming steadily except for a few relatively short periods. We can see three periods of either under-performance or treading of water relative to the benchmarks at the tail end of the previous two bulls, but then the subsequent out-performance.

chart.RelativePerformance(portfolio_return, benchmark_return)

Relative Performance

Money Difference of AF vs Index Benchmarks

The annual active premium of the AF portfolio over the whole period has been about 1.8% per annum, but as we will discuss below, the fund group's premium may be compressing. If we choose the starting point to be the beginning of 2003, it falls to 1.02%. Over the full period as shown below in blue, a dollar invested in 1997 would be worth $4.47 while the benchmark would yield $3.03 for the benchmark in orange (a considerable reward for hiring AF even ignoring likely greater tax inefficiency). If we move to 2002 (around when we built our portfolio), the difference falls to $3.16 and $2.66.

chart.CumReturns(    merge.xts(portfolio_return["2002-01-01/"]$portfolio.returns, benchmark_return["2002-01-01/"]$portfolio.returns, join = "left"),    colorset = 1,    begin = "first",    wealth.index = TRUE,    plot.engine = "plotly"  )

Relative Performance

Mutual Fund Grading Ready for Overhaul

Morningstar came up with the ideas of mutual fund Star Ratings in 1985 to compare funds across broadly defined categories. They took it a step further, they created investment style and regional boxes in 1992, which all made sense at the time. Just like other report cards though, investors began to try to game the system by moving funds among categories, launching and merging funds when advantageous, and creating incentives for managers chasing quarterly or calendar year returns. It doesn't seem to make a lot of sense now make decisions about manager skill over any particular year or group of years when it is possible to break a fund into weekly performance, and build new benchmarks all in a matter of a day or two, as we have done in this analysis.

It is easily possible to extract all periods to see how persistently or not a fund has out-performed. American Fund itself did an analysis along these lines last year The Select Investment Scorecard, but unfortunately hasn't updated or made the data available for others to reproduce, though a quick glance at the methodology, it seemed robust. It is hard to understand why Morningstar wouldn't want to improve its measurement process along these lines.

Looking at Number of Weeks with Outperformance

We took all of our 1196 weeks, and calculated the percentage of weeks by quarter where our AF portfolio outperformed the benchmark. We can see that the ratio of weeks outperforming greater than 0.5 in almost all periods, though it broke below briefly during 2007 and again last week. The confidence bars are wide, and so hard to conclude definitively that the ratio has been above 0.5 since 2006-7. After a while looking at this chart, the trend downward since 2005 certainly struck us.

# Combine AF and Benchmark for Comparison  joined <-     data.table(      date = index(portfolio_return),      am_funds = portfolio_return$portfolio.returns,      bench = benchmark_return$portfolio.returns,      diff = portfolio_return$portfolio.returns - benchmark_return$portfolio.returns    )    # Calculate weekly performance difference of AF vs benchmark  dt <-    joined[, (diff.portfolio.returns > 0),            zoo::as.yearqtr(date)][          ][, sum(V1) / .N, zoo]  setnames(dt, c("V1", "zoo"), c("comparison", "quarter"))    # Plot smoothed quarterly number of outperforming weeks  ggplot(dt, aes(quarter, comparison)) +     geom_smooth() +    theme_bw() +    labs(      title = "Percent of Weeks Where AF Outperformed Vanguard Benchmark Portfolio by Quarter",      y = "Percentage of Weeks Outperforming Benchmark",      x = "Date"      )

Weekly Returns

Modeling Fee Reductions in Line with Index Fund Benchmarks

In 2005, the cost of many of the index funds we used in comparison exceeded 30 bps, and today the best in class index funds are at or below 10bps. We might have to study it more, but it seems like there was a bigger reduction in overseas and bond index funds than for the S&P, which was already low by 2010. Meanwhile, AF haven't lowered its expense ratios meaningfully in 20 years. That means that its managers would have to generate that much higher gross returns just to maintain the same active return. If we model in a 1 bp fee reduction per year, or 23 bps over the full period, the out-performance trajectory improves noticeably, though we are still not sure it is greater than 50%.

# Model fee reduction of 0.001 per annum  fee_reduction <- seq(0,23)/10000  fee_reduction_db <- data.table( year = 1997:2020,                                  fee_reduction )  joined[, year := year(date)]  new_joined <- fee_reduction_db[joined, on = "year"][    ][, .(date, adj_return = diff.portfolio.returns + fee_reduction/52)]    # Calculate weekly performance difference of AF vs benchmark  dt <-    new_joined[, (adj_return > 0),            zoo::as.yearqtr(date)][          ][, sum(V1) / .N, zoo]  setnames(dt, c("V1", "zoo"), c("comparison", "quarter"))    # Plot smoothed quarterly number of outperforming weeks  ggplot(dt, aes(quarter, comparison)) +     geom_smooth() +    theme_bw() +    labs(      title = "Percent of Weeks Where AF Outperformed Vanguard Benchmark Portfolio by Quarter",      y = "Percentage of Weeks Outperforming Benchmark",      x = "Date"      )

Weekly Returns

Conclusion

This has been a quick analysis to become accustomed to the quantmod and PerformanceAnalytics tools. We may return to this subject to look at relative performance during bear markets, and also to try to replicate the American Fund's Select Investment Scorecard, which measured longer periods of out-performance. Another future study we would like to do to more precisely quantify the change in relative fees. AF assets under management have risen from about $1 trillion just before the GFC to $1.8 trillion today, and this seems like a business with a high degree of operating leverage. Rather than spending a more money advertising, like many the other mediocre investment managers, as they have begun doing daily on CNBC and Morningstar, an investment in lower fees and renewed quiet out-performance might be the best medicine.

Author: David Lucy, Founder of Redwall Analytics

David spent 25 years working with institutional global equity research with several top investment banking firms.

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Evaluating American Funds Portfolio first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Visualization of COVID-19 Cases in Arkansas

Posted: 29 Sep 2020 11:40 AM PDT

[This article was first published on R – Nathan Chaney, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Throughout the COVID-19 pandemic, the main sources of information for case numbers in the State of Arkansas have been daily press conferences by the Governor's Office (until recently, when they moved to weekly) and the website arkansascovid.com. I haven't been particularly impressed with the visualizations used by either source. Today I'm sharing some code that I have been using throughout the pandemic to keep track of how Arkansas is doing with the pandemic.

We'll use several libraries, the purpose of which is indicated in the comments:

library(tidyverse)  library(lubridate) # Date wrangling  library(gganimate) # GIF production  library(tidycensus) # Population estimates  library(transformr) # used by gganimate  library(ggthemes) # map themes  library(viridis) # Heatmap color palette  library(scales) # Pretty axis labels  library(zoo) # rollapply    knitr::opts_chunk$set(    message = F,    echo = T,    include = T  )    options( scipen = 10 ) # print full numbers, not scientific notation

We'll use the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, which is maintained at Github. The data can be read in a single line, although we'll reorganize the case counts into a long format for ease of further wrangling. Here's a snippet of the table:

covid_cases <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")  covid_cases <- pivot_longer(covid_cases, 12:length(covid_cases), names_to = "date", values_to = "cases") %>%    mutate(date = lubridate::as_date(date, format = "%m/%d/%y")) %>%    filter(Province_State == 'Arkansas') %>%     arrange(date, Combined_Key)    tail(covid_cases %>% select(Combined_Key, date, cases))    ## # A tibble: 6 x 3  ##   Combined_Key             date       cases  ##                              ## 1 Union, Arkansas, US      2020-09-28   894  ## 2 Van Buren, Arkansas, US  2020-09-28   174  ## 3 Washington, Arkansas, US 2020-09-28  9457  ## 4 White, Arkansas, US      2020-09-28   849  ## 5 Woodruff, Arkansas, US   2020-09-28    53  ## 6 Yell, Arkansas, US       2020-09-28  1256

Because we'll be doing per-capita calculations, we need to load population estimates. Fortunately, the tidycensus package provides a convenient method of obtaining that information. Here's a snapshot of the population data:

population <- tidycensus::get_estimates(geography = "county", "population") %>%     mutate(GEOID = as.integer(GEOID)) %>%    pivot_wider(      names_from = variable,      values_from = value    ) %>%    filter(grepl("Arkansas", NAME))    head(population)    ## # A tibble: 6 x 4  ##   NAME                      GEOID    POP DENSITY  ##                               ## 1 Arkansas County, Arkansas  5001  17769    18.0  ## 2 Ashley County, Arkansas    5003  20046    21.7  ## 3 Baxter County, Arkansas    5005  41619    75.1  ## 4 Benton County, Arkansas    5007 272608   322.   ## 5 Boone County, Arkansas     5009  37480    63.5  ## 6 Bradley County, Arkansas   5011  10897    16.8

The Governor's Office makes several design choices designed to spin statistics so that it looks like Arkansas is doing a good job with managing the crisis (as of the writing of this post, the statistics suggest otherwise). For example, the bar chart of rolling cases often splits out prison cases and community spread cases so the overall trend is obscured. Further, the use of bar charts rather than line charts also makes it harder to visualize the trend of new cases. We'll use a trendline of overall cases without spin:

ark_covid_cases <- covid_cases %>%     filter(`Province_State` == 'Arkansas')    p <- ark_covid_cases %>%    filter(cases > 0) %>%    group_by(Province_State, date) %>%    mutate(cases = sum(cases)) %>%    ggplot(aes(x = date, y = cases)) +    geom_line() +     scale_x_date(breaks = scales::pretty_breaks()) +    scale_y_continuous(labels = unit_format(unit = "k", sep = "", big.mark = ",", scale = 1/1000)) +    labs(      title = "Total COVID-19 cases in Arkansas",      x = "", y = "",      caption = paste0("Image generated: ", Sys.time(), "\n", "Data source: https://github.com/CSSEGISandData/COVID-19", "\n", "COVID-19 Data Repository by CSSE at Johns Hopkins University")    )    ggsave(filename = "images/ark_covid_total_cases.png", plot = p, height = 3, width = 5.25)  p

This trendline shows no real signs of leveling off. As we'll see later on, the number of new cases isn't going down.

Both the Governor's Office and the website arkansascovid.com both use arbitrarily selected population metrics to depict per-capita cases (typically 10,000 residents). We'll use a different per-capita metric that is reasonably close to the median county size in the state. As such, for many counties, the per-capita number will be reasonably close to the actual population of the county. That number can be calculated from the state's population metrics:

per_capita <- population %>%     filter(grepl("Arkansas", NAME)) %>%     summarize(median = median(POP)) %>% # Get median county population    unlist()    per_capita    ## median   ##  18188

Instead of using the actual median, we'll round it to the nearest 5,000 residents:

per_capita <- plyr::round_any(per_capita, 5e3) # Round population to nearest 5,000  per_capita    ## median   ##  20000

Now that we have the population figure we want to use for the per-capita calculations, we will perform those using the lag function to calculate the new cases per day, and then using the rollapply function to smooth the number of daily cases over a sliding 1-week (7-day) window. The results look like this:

roll_ark_cases <- ark_covid_cases %>%     arrange(date) %>%    group_by(UID) %>%    mutate(prev_count = lag(cases)) %>%    mutate(prev_count = ifelse(is.na(prev_count), 0, prev_count)) %>%    mutate(new_cases = cases - prev_count) %>%    mutate(roll_cases = round(zoo::rollapply(new_cases, 7, mean, fill = 0, align = "right", na.rm = T)))%>%    ungroup() %>%    select(-prev_count) %>%    left_join(      population %>% select(-NAME),      by = c("FIPS" = "GEOID")    ) %>%    mutate(      cases_capita = round(cases / POP * per_capita), # cases per per_capita residents      new_capita = round(new_cases / POP * per_capita), # cases per per_capita residents      roll_capita = round(roll_cases / POP * per_capita) # rolling new cases per per_capita residents    )    tail(roll_ark_cases %>% select(date, Admin2, POP, cases, new_cases, roll_cases, roll_capita))    ## # A tibble: 6 x 7  ##   date       Admin2        POP cases new_cases roll_cases roll_capita  ##                                    ## 1 2020-09-28 Union       39126   894         2          8           4  ## 2 2020-09-28 Van Buren   16603   174         0          1           1  ## 3 2020-09-28 Washington 236961  9457        42         64           5  ## 4 2020-09-28 White       78727   849         8         17           4  ## 5 2020-09-28 Woodruff     6490    53         0          1           3  ## 6 2020-09-28 Yell        21535  1256         2          3           3

We can summarize those results in order to get a total number of rolling cases for the entire state, which looks like this:

roll_agg_ark_cases <- roll_ark_cases %>%    group_by(date) %>%    summarize(roll_cases = sum(roll_cases))    tail(roll_agg_ark_cases)    ## # A tibble: 6 x 2  ##   date       roll_cases  ##               ## 1 2020-09-23        820  ## 2 2020-09-24        835  ## 3 2020-09-25        839  ## 4 2020-09-26        797  ## 5 2020-09-27        786  ## 6 2020-09-28        812

We can then plot the aggregate number of rolling cases over time. We'll show a couple of different time points relevant to the spread of the coronavirus, including the Governor's mask/social distancing mandate and the reopening of public schools:

p <- roll_agg_ark_cases %>%    ggplot(aes(date, roll_cases)) +       geom_line() +      geom_vline(xintercept = as.Date("2020-08-24"), color = "gray10", linetype = "longdash") +      annotate(geom ="text", label = "School\nstarts", x = as.Date("2020-08-05"), y = 200, color = "gray10") +      annotate(geom = "segment", y = 290, yend = 400, x = as.Date("2020-08-05"), xend = as.Date("2020-08-24")) +      geom_vline(xintercept = as.Date("2020-07-16"), color = "gray10", linetype = "longdash") +      annotate(geom ="text", label = "Mask\nmandate", x = as.Date("2020-06-21"), y = 100, color = "gray10") +      annotate(geom = "segment", y = 190, yend = 300, x = as.Date("2020-06-21"), xend = as.Date("2020-07-16")) +      geom_smooth(span = 1/5) +      labs(        title = "7-Day Rolling Average of New COVID-19 Cases in Arkansas",        x = "", y = "",        caption = paste0("Image generated: ", Sys.time(), "\n", "Data source: https://github.com/CSSEGISandData/COVID-19", "\n", "COVID-19 Data Repository by CSSE at Johns Hopkins University")      ) +    theme(      title = element_text(size = 10)    )    ggsave(filename = "images/ark_covid_rolling_cases.png", plot = p, height = 3, width = 5.25)  p

From this plot, it appears that the mask mandate may have had a positive effect in leveling off the number of new COVID-19 cases. Conversely, it appears that the reopening of schools may have led to a rapid increase in the number of new cases. Of course, the rate of virus transmission has a multitude of causes, and the correlation here doesn't necessarily imply causation.

The website arkansascovid.com contains better visualizations than what the Governor's Office uses, but the default Tableau color scheme doesn't do a very good job of showing hotspots. Counties with a higher number of cases are depicted in dark blue (a color associated with cold), while counties with fewer cases are shown in pale green (a color without a heat association). In addition, there aren't visualizations that show changes at the county level over time. So, we'll use a county-level visualization that shows the number of rolling new cases over time with a color scheme that intuitively shows hot spots:

# Start when 7-day rolling cases in state > 0   first_date <- min({    roll_ark_cases %>%      group_by(date) %>%      summarize(roll_cases = sum(roll_cases)) %>%      ungroup() %>%      filter(roll_cases > 0) %>%      select(date)  }$date)    temp <- roll_ark_cases %>%    filter(date >= first_date) %>%    mutate(roll_capita = ifelse(roll_capita <= 0, 1, roll_capita)) %>% # log10 scale plot    mutate(roll_cases = ifelse(roll_cases <= 0, 1, roll_cases)) # log10 scale plot    # Prefer tigris projection for state map  temp_sf <- tigris::counties(cb = T, resolution = "20m") %>%    mutate(GEOID = as.numeric(GEOID)) %>%    inner_join(temp %>% select(FIPS, roll_cases, roll_capita, date), by = c("GEOID" = "FIPS")) %>%    select(GEOID, roll_cases, roll_capita, date, geometry)    # tidycensus projection is skewed for state map  # data("county_laea")  # data("state_laea")  # temp_sf <- county_laea %>%  #   mutate(GEOID = as.numeric(GEOID)) %>%  #   inner_join(temp, by = c("GEOID" = "FIPS"))    days <- NROW(unique(temp$date))    p <- ggplot(temp_sf) +    geom_sf(aes(fill = roll_capita), size = 0.25) +    scale_fill_viridis(      name = "7-day rolling cases: ",      trans = "log10",      option = "plasma",    ) +    ggthemes::theme_map() +    theme(legend.position = "bottom", legend.justification = "center") +    labs(      title = paste0("Arkansas 7-day rolling average of new COVID cases per ", scales::comma(per_capita), " residents"),      subtitle = "Date: {frame_time}",      caption = paste0("Image generated: ", Sys.time(), "\n", "Data source: https://github.com/CSSEGISandData/COVID-19", "\n", "COVID-19 Data Repository by CSSE at Johns Hopkins University")    ) +    transition_time(date)    Sys.time()  anim <- animate(    p,     nframes = days + 10 + 30,     fps = 5,     start_pause = 10,     end_pause = 30,    res = 96,    width = 600,    height = 600,    units = "px"  )  Sys.time()    anim_save("images/ark_covid_rolling_cases_plasma.gif", animation = anim)    # anim

There are a couple of design choices here that are worth explaining. First, we're animating the graphic over time, which shows where hotspots occur during the course of the pandemic.

Second, we're using the plasma color palette from the viridis package. This palette goes from indigo on the low end to a hot yellow on the high end, so it intuitively shows hotspots.

Third, we're using a log scale for the number of new cases – the idea here is that jumps of an order of magnitude or so are depicted in different colors (i.e., indigo, purple, red, orange, yellow) along the plasma palette. If we use a standard numerical scale for the number of new cases, jumps from 1-20 or so get washed out due to the large size of the worst outbreaks.

Conclusion

I hope you found my alternate visualizations for COVID-19 in Arkansas useful. The charts are set to update nightly, so these data should be current throughout the pandemic. If you have suggestions for improvements or notice that the figures aren't updating, please comment! Thanks for reading.

To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Visualization of COVID-19 Cases in Arkansas first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

RStudio v1.4 Preview: Visual Markdown Editing

Posted: 29 Sep 2020 11:00 AM PDT

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today we're excited to announce availability of our first Preview Release for RStudio 1.4, a major new release which includes the following new features:

A visual markdown editor that provides improved productivity for composing longer-form articles and analyses with R Markdown.
New Python capabilities, including display of Python objects in the Environment pane, viewing of Python data frames, and tools for configuring Python versions and conda/virtual environments.
The ability to add source columns to the IDE workspace for side-by-side text editing.
A new command palette (accessible via Ctrl+Shift+P) that provides easy keyboard access to all RStudio commands, add-ins, and options.
Support for rainbow parentheses in the source editor (enabled via Options -> Code -> Display).
New RStudio Server Pro features including SAML authentication, local launcher load-balancing, and support for project sharing when using the launcher.
Dozens of other small improvements and bugfixes.

You can try out these new features now in the RStudio v1.4 Preview Release.

Over the next few weeks we'll be blogging about each of these new features in turn.
Today we'll take a quick tour of the new visual markdown editor (see the full Visual Markdown Editing documentation for more details).

Visual Markdown Editing

R Markdown users frequently tell us that they'd like to see more of their content changes in real-time as they write, both to reduce the time required by the edit/preview cycle, and to improve their flow of composition by having a clearer view of what they've already written.

To switch into visual mode for a markdown document, use the button with the compass icon at the top-right of the editor toolbar:

With visual mode, we've tried to create a WYSIWYM editor for people that love markdown.
The editor maintains a lightweight feel that emphasizes semantics over styling.
You can also still use most markdown constructs (e.g., ## or **bold**) directly for formatting, and when switching between visual and source mode your editing location and undo/redo state are fully preserved:

You can also configure visual mode to write markdown using one sentence per-line, which makes working with markdown files on GitHub much easier (enabling line-based comments for sentences and making diffs more local to the actual text that has changed).
See the documentation on markdown writing options for additional details.

Anything you can express in pandoc markdown (including tables, footnotes, attributes, etc.) can be edited in visual mode.
Additionally, there are many productivity enhancements aimed at authoring technical content like embedded code, equations, citations, cross-references, and inline HTML/LaTeX.

Embedded Code

R, Python, SQL and other code chunks can be edited using the standard RStudio source editor.
You can execute the currently selected code chunk using either the run button at the top right of the code chunk or using the Cmd+Shift+Enter keyboard shortcut:

Chunk output is displayed inline (you can switch to show the output in the console instead using the Options toolbar button, accessible via the gear icon), and all of the customary commands from source mode for executing multiple chunks, clearing chunk output, etc. are available.

Tables

You can insert a table using the Table menu.
You can then use either the main menu or a context menu to insert and delete table rows and columns:

Note that if you select multiple rows or columns the Insert or Delete command will behave accordingly.
For example, to insert two rows first select two rows then use the Insert command.

Try editing a table in visual mode then see what it looks like in source mode: all of the table columns will be perfectly aligned (with cell text wrapped as required).

Citations

Visual mode uses the standard Pandoc markdown representation for citations (e.g. [@citation]).
Citations can be inserted from a variety of sources:

Your document bibliography.
Zotero personal or group libraries.
DOI (Document Object Identifier) references.
Searches of Crossref, DataCite, or PubMed.

Use the toolbar button or the Cmd+Shift+F8 keyboard shortcut to show the Insert Citation dialog:

If you insert citations from Zotero, DOI look-up, or a search, they are automatically added to your document bibliography.

You can also insert citations directly using markdown syntax (e.g. [@cite]).
When you do this a completion interface is provided for searching available citations:

Equations

LaTeX equations are authored using standard Pandoc markdown syntax (the editor will automatically recognize the syntax and treat the equation as math).
When you aren't directly editing an equation it will appear as rendered math:

As shown above, when you select an equation with the keyboard or mouse you can edit the equation's LaTeX.
A preview of the equation will be shown below it as you type.

Images

You can insert images using either the Insert -> Image command (Ctrl+Shift+I keyboard shortcut) or by dragging and dropping images from the local filesystem.
If an image isn't already in your markdown document's directory, it will be copied to an images/ folder in your project.

Select an image to re-size it in place (automatically preserving their aspect ratio if you wish):

Cross References

The bookdown package includes markdown extensions for cross-references and part headers.
The blogdown package also supports bookdown style cross-references, as does the distill package.

Bookdown cross-references enable you to easily link to figures, equations, and even arbitrary labels within a document.
In raw markdown, you would for example write a cross-reference to a figure like this: \@ref(fig:label), where the label is the name of the code chunk used to make the figure.
For figure cross-referencing to work, you'll also need to add a figure caption to the same code chunk using the knitr chunk option fig.cap, such as fig.cap="A good plot".

Cross-references are largely the same in visual mode, but you don't need the leading \ (which in raw markdown is used to escape the @ character).
For example:

As shown above, when entering a cross-reference you can search across all cross-references in your project to easily find the right reference ID.

Similar to hyperlinks, you can also navigate to the location of a cross-reference by clicking the popup link that appears when it's selected:

You can also navigate directly to any cross-reference using IDE global search:

See the bookdown documentation for more information on cross-references.

Footnotes

You can include footnotes using the Insert -> Footnote command (or the Cmd+Shift+F7 keyboard shortcut).
Footnote editing occurs in a pane immediately below the main document:

Emojis

To insert an emoji, you can use either the Insert menu or use the requisite markdown shortcut plus auto-complete:

For markdown formats that support text representations of emojis (e.g. :grinning:), the text version will be written.
For other formats the literal emoji character will be written.
Currently, GitHub Flavored Markdown and Hugo (with enableEmjoi = true in the site config) both support text representation of emojis.

LaTeX and HTML

You can include raw LaTeX commands or HTML tags when authoring in visual mode.
The raw markup will be automatically recognized and syntax highlighted.
For example:

The above examples utilize inline LaTeX and HTML.
You can also include blocks of raw content using the commands on the Format -> Raw menu.
For example, here is a document with a raw LaTeX block:

Learning More

See the Visual Markdown Editing documentation to learn more about using visual mode.

You can try out the visual editor by installing the RStudio 1.4 Preview Release.
If you do, please let us know how we can make it better on the community forum!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post RStudio v1.4 Preview: Visual Markdown Editing first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

New Polished Feature – User Roles

Posted: 29 Sep 2020 11:00 AM PDT

[This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Polished is an R package that adds authentication and user administration to your Shiny apps. We continue to push forward with new improvements and are happy to introduce a new feature:

user roles

Under the hood, user roles are just strings that you define in your polished.tech dashboard. e.g. you can make a "super_user" role. You can then assign the "super_user" role to 1 or more users of your Shiny app. Then the next time the user signs in to your Shiny app they will have this "super_user" role on their session$userData$user()$roles vector that is made available by polished. You can then check the user's roles to allow/restrict access to different aspects of your app.

As another example, let's say you have a table of data stored in your Shiny app. You want to grant some users the ability to add and edit data, while the rest of your users should only be able to view the data. You can use polished's roles to create an "editor" role, and then assign that role to select users. You can then use the existance of this "editor" role to programmatically allow/restrict adding and editing data.

You may create as many roles as you need. The following is the step my step process for how to create roles in the polished dashboard, and add them to your user:

Go to the "Manage Roles" tab and click the "Add Role" button (top left) to create the "editor" role.

Click the blue button on the same row as your newly created "editor" role to assign users to the role. Here I assigned myself to the role of editor.

From within your Shiny app that is using polished, check your users' roles to enable/disable certain features of your app.

As noted earlier, you can access your users' roles in the user object provided by polished at:

session$userData$user()$roles

Here is a Shiny app using the polshed "editor" role that we created above:

This is a simple feature, but we have found it helps keep our apps consistent and well
organized.

If you want to check out roles and other new features today, sign up for an account at polished.tech. And make sure to install the newly released
0.2.0 version of polished from CRAN with:

install.packages("polished")

Please reach out if you have questions or feedback!

To leave a comment for the author, please follow the link and comment on their blog: Posts on Tychobra.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post New Polished Feature - User Roles first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Time Series Forecasting: KNN vs. ARIMA

Posted: 29 Sep 2020 05:47 AM PDT

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is always hard to find a proper model to forecast time series data. One of the reasons is that models that use time-series data often expose to serial correlation. In this article, we will compare k nearest neighbor (KNN) regression which is a supervised machine learning method, with a more classical and stochastic process, autoregressive integrated moving average (ARIMA).

We will use the monthly prices of refined gold futures(XAUTRY) for one gram in Turkish Lira traded on BIST(Istanbul Stock Exchange) to forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here.

  #building the time series data  library(readxl)    df_xautry <- read_excel("xau_try.xlsx")  xautry_ts <- ts(df_xautry$price,start = c(2013,1),frequency = 12)

KNN Regression

We are going to use tsfknn package which can be used to forecast time series in R programming language. KNN regression process consists of instance, features, and targets components. Below is an example to understand the components and the process.

  library(tsfknn)    pred <- knn_forecasting(xautry_ts, h = 6, lags = 1:12,k=3)  autoplot(pred, highlight = "neighbors",faceting = TRUE)

The lags parameter indicates the lagged values of the time series data. The lagged values are used as features or explanatory variables. In this example, because our time series data is monthly, we set the parameters to 1:12. The last 12 observations of the data build the instance, which is shown by purple points on the graph.

This instance is used as a reference vector to find features that are the closest vectors to that instance. The relevant distance metric is calculated by the Euclidean formula as shown below:

$\sqrt {\displaystyle\sum_{x=1}^{n} (f_{x}^i-q_x)^2 }$

$q_x$ denotes the instance and $f_{x}^i$ indicates the features that are ranked in order by the distance metric. The k parameter determines the number of k closest features vectors which are called k nearest neighbors.

nearest_neighbors function shows the instance, k nearest neighbors, and the targets.

  nearest_neighbors(pred)    #$instance  #Lag 12 Lag 11 Lag 10  Lag 9  Lag 8  Lag 7  Lag 6  Lag 5  Lag 4  Lag 3  Lag 2   #272.79 277.55 272.91 291.12 306.76 322.53 345.28 382.02 384.06 389.36 448.28   # Lag 1   #462.59     #$nneighbors  #  Lag 12 Lag 11 Lag 10  Lag 9  Lag 8  Lag 7  Lag 6  Lag 5  Lag 4  Lag 3  Lag 2  #1 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12 306.76  #2 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91 291.12  #3 223.97 225.74 240.87 245.78 248.24 260.94 258.68 288.16 272.79 277.55 272.91  #   Lag 1     H1     H2     H3     H4     H5     H6  #1 322.53 345.28 382.02 384.06 389.36 448.28 462.59  #2 306.76 322.53 345.28 382.02 384.06 389.36 448.28  #3 291.12 306.76 322.53 345.28 382.02 384.06 389.36

Targets are the time-series data that come right after the nearest neighbors and their number is the value of the h parameter. The targets of the nearest neighbors are averaged to forecast the future h periods.

$\displaystyle\sum_{i=1}^k \frac {t^i} {k}$

As you can see from the above plotting, features or targets might overlap the instance. This is because the time series data has no seasonality and is in a specific uptrend. This process we mentioned so far is called MIMO(multiple-input-multiple-output) strategy that is a forecasting method used as a default with KNN.

Decomposing and analyzing the time series data

Before we mention the model, we first analyze the time series data on whether there is seasonality. The decomposition analysis is used to calculate the strength of seasonality which is described as shown below:

  #Seasonality and trend measurements  library(fpp2)    fit <- stl(xautry_ts,s.window = "periodic",t.window = 13,robust = TRUE)  seasonality <- fit %>% seasonal()  trend <- fit %>% trendcycle()  remain <- fit %>% remainder()    #Trend  1-var(remain)/var(trend+remain)  #[1] 0.990609  #Seasonality  1-var(remain)/var(seasonality+remain)  #[1] 0.2624522

The stl function is a decomposing time series method. STL is short for seasonal and trend decomposition using loess, which loess is a method for estimating nonlinear relationships. The t.window(trend window) is the number of consecutive observations to be used for estimating the trend and should be odd numbers. The s.window(seasonal window) is the number of consecutive years to estimate each value in the seasonal component, and in this example, is set to 'periodic' to be the same for all years. The robust parameter is set to 'TRUE' which means that the outliers won't affect the estimations of trend and seasonal components.

When we examine the results from the above code chunk, it is seen that there is a strong uptrend with 0.99, weak seasonality strength with 0.26, because that any value less than 0.4 is accepted as a negligible seasonal effect. Because of that, we will prefer the non-seasonal ARIMA model.

Non-seasonal ARIMA

This model consists of differencing with autoregression and moving average. Let's explain each part of the model.

Differencing: First of all, we have to explain stationary data. If data doesn't contain information pattern like trend or seasonality in other words is white noise that data is stationary. White noise time series has no autocorrelation at all.

Differencing is a simple arithmetic operation that extracts the difference between two consecutive observations to make that data stationary.

$y'_t=y_t-y_{t-1}$

The above equation shows the first differences that difference at lag 1. Sometimes, the first difference is not enough to obtain stationary data, hence, we might have to do differencing of the time series data one more time(second-order differencing).

In autoregressive models, our target variable is a linear combination of its own lagged variables. This means the explanatory variables of the target variable are past values of that target variable. The AR(p) notation denotes the autoregressive model of order p and the $\boldsymbol\epsilon_t$ denotes the white noise.

$y_t=c + \phi_1 y_{t-1}+ \phi_2y_{t-2}+...+\phi_py_{t-p}+\epsilon_t$

Moving average models, unlike autoregressive models, they use past error(white noise) values for predictor variables. The MA(q) notation denotes the autoregressive model of order q.

$y_t=c + \theta_1 \epsilon_{t-1}+ \theta_2\epsilon_{t-2}+...+\theta_q\epsilon_{t-q}$

If we integrate differencing with autoregression and the moving average model, we obtain a non-seasonal ARIMA model which is short for the autoregressive integrated moving average.

$y_t'$ is the differenced data and we must remember it may have been first and second order. The explanatory variables are both lagged values of $y_t$ and past forecast erros. This is denoted as ARIMA(p,d,q) where p; the order of the autoregressive; d, degree of first differencing; q, the order of the moving average.

Modeling with non-seasonal ARIMA

Before we model the data, first we split the data as train and test to calculate accuracy for the ARIMA model.

  #Splitting time series into training and test data  test <- window(xautry_ts, start=c(2019,3))  train <- window(xautry_ts, end=c(2019,2))      #ARIMA modeling  library(fpp2)    fit_arima<- auto.arima(train, seasonal=FALSE,                         stepwise=FALSE, approximation=FALSE)    fit_arima    #Series: train   #ARIMA(0,1,2) with drift     #Coefficients:  #          ma1      ma2   drift  #      -0.1539  -0.2407  1.8378  #s.e.   0.1129   0.1063  0.6554    #sigma^2 estimated as 86.5:  log likelihood=-264.93  #AIC=537.85   AICc=538.44   BIC=547.01

As seen above code chunk, stepwise=FALSE, approximation=FALSE parameters are used to amplify the searching for all possible model options. The drift component indicates the constant c which is the average change in the historical data. From the results above, we can see that there is no autoregressive part of the model, but a second-order moving average with the first differencing.

Modeling with KNN

  #Modeling and forecasting  library(tsfknn)    pred <- knn_forecasting(xautry_ts, h = 18, lags = 1:12,k=3)        #Forecasting plotting for KNN  autoplot(pred, highlight = "neighbors", faceting = TRUE)

Forecasting and accuracy comparison between the models

  #ARIMA accuracy  f_arima<- fit_arima %>% forecast(h =18) %>%    accuracy(test)    f_arima[,c("RMSE","MAE","MAPE")]    #                 RMSE       MAE      MAPE        #Training set  9.045488  5.529203  4.283023   #Test set     94.788638 74.322505 20.878096

For forecasting accuracy, we take the results of the test set shown above.

  #Forecasting plot for ARIMA  fit_arima %>% forecast(h=18) %>% autoplot()+ autolayer(test)

  #KNN Accuracy  ro <- rolling_origin(pred, h = 18,rolling = FALSE)  ro$global_accu    #  RMSE       MAE      MAPE   #137.12465 129.77352  40.22795

The rolling_origin function is used to evaluate the accuracy based on rolling origin. The rolling parameter should be set to FALSE which makes the last 18 observations as the test set and the remaining as the training set; just like we did for ARIMA modeling before. The test set would not be a constant vector if we had set the rolling parameter to its default value of TRUE. Below, there is an example for h=6 that rolling_origin parameter set to TRUE. You can see the test set dynamically changed from 6 to 1 and they eventually build as a matrix, not a constant vector.

  #Accuracy plot for KNN  plot(ro)

When we compare the results of the accuracy measurements like RMSE or MAPE, we can easily see that the ARIMA model is much better than the KNN model for our non-seasonal time series data.

References

Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos
Time Series Forecasting with KNN in R: the tsfknn Package, Francisco Martínez, María P. Frías, Francisco Charte, and Antonio J. Rivera
Autoregression as a means of assessing the strength of seasonality in a time series: Rahim Moineddin, Ross EG Upshur, Eric Crighton & Muhammad Mamdani

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Time Series Forecasting: KNN vs. ARIMA first appeared on R-bloggers.

How to Convert Continuous variables into Categorical by Creating Bins

Posted: 29 Sep 2020 05:29 AM PDT

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A very common task in data processing is the transformation of the numeric variables (continuous, discrete etc) to categorical by creating bins. For example, is quite ofter to convert the age to the age group. Let's see how we can easily do that in R.

We will consider a random variable from the Poisson distribution with parameter λ=20

  library(dplyr)  # Generate 1000 observations from the Poisson distribution   # with lambda equal to 20  df<-data.frame(MyContinuous = rpois(1000,20))    # get the histogtam  hist(df$MyContinuous)

How to Convert Continuous variables into Categorical by Creating Bins 1

Create specific Bins

Let's say that you want to create the following bins:

Bin 1: (-inf, 15]
Bin 2: (15,25]
Bin 3: (25, inf)

We can easily do that using the cut command. Let's start:

  df<-df%>%mutate(MySpecificBins = cut(MyContinuous, breaks = c(-Inf,15,25,Inf)))  head(df,10)

How to Convert Continuous variables into Categorical by Creating Bins 2

Let's have a look at the counts of each bin.

  df%>%group_by(MySpecificBins)%>%count()

How to Convert Continuous variables into Categorical by Creating Bins 3

Notice that you can define also you own labels within the cut function.

Create Bins based on Quantiles

Let's say that you want each bin to have the same number of observations, like for example 4 bins of an equal number of observations, i.e. 25% each. We can easily do it as follows:

  numbers_of_bins = 4    df<-df%>%mutate(MyQuantileBins = cut(MyContinuous,                                    breaks = unique(quantile(MyContinuous,probs=seq.int(0,1, by=1/numbers_of_bins))),                                                    include.lowest=TRUE))    head(df,10)

How to Convert Continuous variables into Categorical by Creating Bins 4

We can check the MyQuantileBins if contain the same number of observations, and also to look at their ranges:

  df%>%group_by(MyQuantileBins)%>%count()

How to Convert Continuous variables into Categorical by Creating Bins 5

Notice that in case that you want to split your continuous variable into bins of equal size you can also use the ntile function of the dplyr package, but it does not create labels of the bins based on the ranges.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Convert Continuous variables into Categorical by Creating Bins first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now