[R-bloggers] Deep (learning) like Jacques Cousteau – Part 7 – Matrices (and 8 more aRticles)

Deep (learning) like Jacques Cousteau – Part 7 – Matrices
Update: Finding Economic Articles With Data
Parallel R: Socket or Fork
Medal Breakdown at Comrades Marathon (2019)
Coping with varying `gcc` versions and capabilities in R packages
Spurious correlations and random walks
Curly-Curly, the successor of Bang-Bang
Revisited: Forecasting Last Christmas Search Volume
Vignette: Write & Read Multiple Excel files with purrr

Deep (learning) like Jacques Cousteau – Part 7 – Matrices

Posted: 30 Jun 2019 02:45 PM PDT

(This article was first published on Embracing the Random | R, and kindly contributed to R-bloggers)

(TL;DR: matrices are rectangular arrays of numbers.)

LaTeX and MathJax warning for those viewing my feed: please view
directly on website!

I wanted to avoid a picture related to the film franchise!

Me

Last
time,
we learnt about dot products. We will now finally start talking
aobut matrices.

Today's topic: Matrices – a primer

We know what row vectors and column vectors are from our previous posts.

What would it look like if we were to stack a bunch of row vectors on
top of each other? Similarly, what would it look like if we were to put a
bunch of column vectors next to each other? Let's experiment!

Let's define some row vectors:

$\boldsymbol{a} = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} %]]>$

$\boldsymbol{b} = \begin{bmatrix} 4 & 5 & 6 \end{bmatrix} %]]>$

$\boldsymbol{c} = \begin{bmatrix} 7 & 8 & 9 \end{bmatrix} %]]>$

Let's stack them on top of each other:

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} %]]>$

This rectangular array of numbers is an example of a matrix!

Let's repeat the exercise with the same vectors represented as column
vectors:

$\boldsymbol{a} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$

$\boldsymbol{b} = \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix}$

$\boldsymbol{c} = \begin{bmatrix} 7 \\ 8 \\ 9 \end{bmatrix}$

Putting these column vectors next to each other gives us this:

$\begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \\ \end{bmatrix} %]]>$

Looks like another matrix to me!

In fact, our row vectors and column vectors we saw in previous posts
live double lives. A row vector goes by another name: "row matrix".
A column vector goes by another name, too: "column matrix".

Sneaky!

A little bit of notation

As before, we will be following the notation used in Goodfellow, Ian,
et al.. We will
depict matrices using uppercase, bold font:

$\boldsymbol{A}$

Describing the size of a matrix

From the above, we can guess that matrices can be described in terms of
rows and columns. How can we describe the number of rows and the
number of columns we have in our matrices?

Let's go back to our previous examples. They both happen to have three
rows and three columns. So we have ourselves two 3 x 3 ("three by
three") matrices. The number of rows and the number of columns
describe the "size" of our matrix.

Let's generalise this. By convention, we have $m$ rows and $n$ columns
in our matrix. Let's use our matrix $\boldsymbol{A}$ as an example. We
haven't defined what elements it contains. In fact, all we have said is
that $\boldsymbol{A}$ is a matrix. Whatever the contents of
$\boldsymbol{A}$ , it will will have $m$ rows and $n$ columns. So
$\boldsymbol{A}$ will be of size $m \times n$ .

Keep it real

Let's take this a step further. From previous posts, we are working with
real numbers so let's assume that our matrix $\boldsymbol{A}$ of
size $m \times n$ will contain real numbers. That is, it is a real
matrix. How can we depict this using some compact notation?

Remember when we touched on our vectors belonging to some abstract
set of all real-valued vectors of similar dimensions when describing the dimensions of a
vector?
We can say something similar for our real-valued matrix of size
$m \times n$ . We can describe this set of all $m \times n$ real-valued matrices like this:

$\mathbb{R}^{m \times n}$

So to show that our matrix $\boldsymbol{A}$ belongs to the set of all
$m \times n$ matrices that are made up of real numbers, we simply say:

$\boldsymbol{A} \in \mathbb{R}^{m \times n}$

Going back to our examples above:

Our real-valued row vectors (a.k.a. row matrices) are members of
the set $\mathbb{R}^{1 \times n}$ . That is, they are members of the
set of all possible matrices made up of one row and $n$ columns of
real numbers.
Our real-valued column vectors (a.k.a. column matrices) are
members of the set $\mathbb{R}^{m \times 1}$ . That is, they are
members of the set of all possible matrices made up of $m$ rows and one column of real numbers.
Our real-valued square matrices are members of the set
$\mathbb{R}^{m \times n}$ where $m == n$ . That is, they are members
of the set of all possible real matrices that have the same number
of rows as they have columns.

Not too bad, right?

How can we create matrices in R?

Stack those vectors

One way is to simply rbind and cbind vectors. Let's define 3 vectors
and rbind them (i.e. stack them row-wise):

  a <- c(1, 2, 3)  b <- c(4, 5, 6)  c <- c(7, 8, 9)    A <- rbind(a, b, c)    print(A)    

  ##   [,1] [,2] [,3]  ## a    1    2    3  ## b    4    5    6  ## c    7    8    9    

We have ourselves a matrix!

  print(class(A))    

## [1] "matrix"

Let's now stack them column-wise by using cbind on our vectors:

  A <- cbind(a, b, c)    print(A)    

##      a b c  ## [1,] 1 4 7  ## [2,] 2 5 8  ## [3,] 3 6 9

It's a matrix!

  print(class(A))    

## [1] "matrix"

Use the `matrix` function

The "usual" way to create a matrix would be to pass some vector or list
into the matrix function and specify its dimensions. For example, if we
want to use the values passed into the matrix function to create a 3×3 matrix by using the values in a column-wise manner, we can do this:

  matrix(1:9, nrow=3, ncol=3)    

  ##      [,1] [,2] [,3]  ## [1,]    1    4    7  ## [2,]    2    5    8  ## [3,]    3    6    9    

If instead we want to use the numbers in a row-wise manner, then we
can use the byrow argument and set it to TRUE:

  matrix(1:9, nrow=3, ncol=3, byrow=TRUE)    

  ##      [,1] [,2] [,3]  ## [1,]    1    2    3  ## [2,]    4    5    6  ## [3,]    7    8    9    

The data types of the elements in the vector or list you pass into the
matrix function will be coerced to a single data type following the
usual coercion rules. For example, if we use a vector containing a
numeric type and a integer type, let's see what happens:

  our_vec <- c(1.0, 1L)  A <- matrix(our_vec, nrow=2, ncol=1)    print(A)    

##      [,1]  ## [1,]    1  ## [2,]    1

From previous posts, we know that the values in our vector have already
been coerced by c before the matrix function sees them! So,
naturally, we have a matrix of numeric types:

  sapply(A, class)    

## [1] "numeric" "numeric"

We can even have a character matrix. Not sure why we'd want to create
one…but we can!

  x <- c("hello", "goodbye")  matrix(x)    

##      [,1]       ## [1,] "hello"    ## [2,] "goodbye"

Elements are recycled

Recycling occurs like in vector addition posts. For example, if we try to
create a matrix with four elements (2 rows times 2 columns) from a
vector with only two elements, this is what we get:

  matrix(c(1, 2), nrow=2, ncol=2)    

##      [,1] [,2]  ## [1,]    1    1  ## [2,]    2    2

If the number of elements in our vector is not a multiple or submultiple
of the number of rows or columns, we get warned of this fact:

  matrix(c(1, 2, 3), nrow=2, ncol=1)    

  ## Warning in matrix(c(1, 2, 3), nrow = 2, ncol = 1): data length [3] is not a  ## sub-multiple or multiple of the number of rows [2]    ##      [,1]  ## [1,]    1  ## [2,]    2    

Done!

Conclusion

Next time, we'll talk about matrix addition and introduce some notation we can use to access the individual elements contained within our matrices!

Justin

To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random | R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Update: Finding Economic Articles With Data

Posted: 30 Jun 2019 01:00 PM PDT

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

An earlier post from February, describes a Shiny app that allows to search among currently more than 4000 economic articles that have an accessible data and code supplement. Finally, I managed to configure an nginx reverse proxy server and now you can also access the app under a proper https link here:

https://ejd.econ.mathematik.uni-ulm.de

(I was very positively surprised how easy is it was to change http to https using certbot). Some colleagues told me that they could not access the app under the originally posted link:

http://econ.mathematik.uni-ulm.de:3200/ejd/

I am not sure about the exact reason, but perhaps some security settings don't allow to access web sites on a non-standard port like 3200. Hopefully the new link helps.

Since my initial post, the number of articles has grown, and I included new journals like Econometrica or AER Insights.

The main data for my app can be downloaded as a zipped SQLite database from my server. Let us do some analysis.

  library(RSQLite)  library(dbmisc)  library(dplyr)  db = dbConnect(RSQLite::SQLite(),"articles.sqlite") %>%    set.db.schemas(      schema.file=system.file("schema/articles.yaml",      package="EconJournalData")    )    articles = dbGet(db,"article")  fs = dbGet(db,"files_summary")    

Let us look grouped by journal at the share of articles whose code supplement has R files:

  fs %>%     left_join(select(articles, id, journ), by="id") %>%    group_by(journ) %>%    mutate(num_art = n_distinct(id)) %>%    filter(file_type=="r") %>%    summarize(      num_art = first(num_art),      num_with_r = n(),      share_with_r=round((num_with_r / first(num_art))*100,2)    ) %>%    arrange(desc(share_with_r))    

journ	num_art	num_with_r	share_with_r
ecta	109	17	15.6
jep	113	11	9.73
restud	216	12	5.56
aejmic	114	5	4.39
aer	1453	46	3.17
aejpol	378	11	2.91
aejapp	385	11	2.86
aejmac	282	8	2.84
jeea	115	2	1.74
restat	733	6	0.82

We see that there is quite some variation in the share of articles with R code going from 15.6% in Econometrica (ecta) to only 0.82% in the Review of Economics and Statistics (restat). (The statistics exclude all articles that don't have a code supplement or a supplement whose file types I did not analyse, e.g. because it is too large or the ZIP files are nested too deeply.)

Overall, we still have a clear dominance of Stata in economics:

  # Number of articles with analyes data & code supplementary  n_art = n_distinct(fs$id)    # Count articles by file types and compute shares  fs %>% group_by(file_type) %>%    summarize(      count = n(),       share=round((count / n_art)*100,2)    ) %>%    # note that all file extensions are stored in lower case    filter(file_type %in% c("do","r","py","jl","m")) %>%    arrange(desc(share))    

file_type	count	share
do	2834	70.18
m	979	24.24
r	129	3.19
py	42	1.04
jl	2	0.05

Roughly 70% of the articles have Stata do files and almost a quarter Matlab m files and only slightly above 3% R files.

I also meanwhile have added a log file to the app that anonymously stores data about which articles that have been clicked on. The code below shows the 20 most clicked on articles so far:

  dat = read.csv("article_click.csv")    dat %>%    group_by(article) %>%    summarize(count=n()) %>%    na.omit %>%    arrange(desc(count)) %>%    print(n=20)    

  ## # A tibble: 699 x 2  ##    article                                                           count  ##                                                                   ##  1 A Macroeconomic Model of Price Swings in the Housing Market          27  ##  2 Job Polarization and Jobless Recoveries                              20  ##  3 Tax Evasion and Inequality                                           19  ##  4 Public Debt and Low Interest Rates                                   16  ##  5 An Empirical Model of Tax Convexity and Self-Employment              13  ##  6 Alcohol and Self-Control: A Field Experiment in India                11  ##  7 Drug Innovations and Welfare Measures Computed from Market Deman~    11  ##  8 Food Deserts and the Causes of Nutritional Inequality                11  ##  9 Some Causal Effects of an Industrial Policy                          11  ## 10 Costs  Demand  and Producer Price Changes                            10  ## 11 Breaking Bad: Mechanisms of Social Influence and the Path to Cri~     9  ## 12 Government Involvement in the Corporate Governance of Banks           8  ## 13 Performance in Mixed-sex and Single-sex Tournaments: What We Can~     8  ## 14 Disease and Gender Gaps in Human Capital Investment: Evidence fr~     7  ## 15 Housing Constraints and Spatial Misallocation                         7  ## 16 Inherited Control and Firm Performance                                7  ## 17 Labor Supply and the Value of Non-work Time: Experimental Estima~     7  ## 18 Pricing in the Market for Anticancer Drugs                            7  ## 19 The Arrival of Fast Internet and Employment in Africa                 7  ## 20 The Economic Benefits of Pharmaceutical Innovations: The Case of~     7  ## # ... with 679 more rows    

For a nice thumbnail in R-bloggers let us finish with a screenshot of the app:

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

This posting includes an audio/video/photo media file: Download Now

Parallel R: Socket or Fork

Posted: 29 Jun 2019 11:34 PM PDT

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

In the R parallel package, there are two implementations of parallelism, e.g. fork and socket, with pros and cons.

For the fork, each parallel thread is a complete duplication of the master process with the shared environment, including objects or variables defined prior to the kickoff of parallel threads. Therefore, it runs fast. However, the major limitation is that the fork doesn't work on the Windows system.

On the other hand, the socket works on all operating systems. Each thread runs separately without sharing objects or variables, which can only be passed from the master process explicitly. As a result, it runs slower due to the communication overhead.

Below is an example showing the performance difference between the fork and the socket. A self-defined filter function runs in parallel and exacts three rows out of 336,776 that are meeting criteria. As shown, the fork runs 40% faster than the socket.

    df <- read.csv("data/nycflights")    ex <- expression(carrier == "UA" & origin == "EWR" & day == 1 & is.na(arr_time))  # SELECT 3 ROWS OUT OF 336,776  #        year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time ...  # 56866  2013    11   1       NA        NA       NA        NA      UA            252    EWR  IAH       NA ...  # 84148  2013    12   1       NA        NA       NA        NA      UA            643    EWR  ORD       NA ...  # 251405 2013     7   1       NA        NA       NA        NA      UA            394    EWR  ORD       NA ...    parFilter <- function(df, ex, type) {    cn <- parallel::detectCores() - 1    cl <- parallel::makeCluster(cn, type = type)    ### DIVIDE THE DATAFRAME BASED ON # OF CORES    sp <- parallel::parLapply(cl, parallel::clusterSplit(cl, seq(nrow(df))),                              function(c_) df[c_,])    ### PASS THE OBJECT FROM MASTER PROCESS TO EACH NODE    parallel::clusterExport(cl, "ex")    ### EXTRACT ROW INDEX ON EACH NODE    id <- Reduce(c, parallel::parLapply(cl, sp,                                        function(s_) with(s_, eval(ex))))    parallel::stopCluster(cl)    return(df[which(id),])  }    rbenchmark::benchmark(replications = 10, order = "elapsed", relative = "elapsed",                          columns = c("test", "replications", "elapsed", "relative"),    "  FORK" = parFilter(df, ex, "FORK"),    "SOCKET" = parFilter(df, ex, "PSOCK")  )  #     test replications elapsed relative  # 1   FORK           10  59.396    1.000  # 2 SOCKET           10  83.856    1.412

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

This posting includes an audio/video/photo media file: Download Now

Medal Breakdown at Comrades Marathon (2019)

Posted: 29 Jun 2019 07:30 PM PDT

(This article was first published on R on datawookie, and kindly contributed to R-bloggers)

A quick breakdown of the medal distribution at the 2019 edition of the Comrades Marathon.

This is what the medal categories correspond to:

Gold — first 10 men and women
Wally Hayward (men) — 11th position to sub-6:00
Isavel Roche-Kelly (women) — 11th position to sub-7:30
Silver (men) — 6:00 to sub-7:30
Bill Rowan — 7:30 to sub-9:00
Robert Mtshali — 9:00 to sub-10:00
Bronze — 10:00 to sub-11:00 and
Vic Clapham — 11:00 to sub-12:00.

Analysis and visualisation using R.

To leave a comment for the author, please follow the link and comment on their blog: R on datawookie.

This posting includes an audio/video/photo media file: Download Now

Coping with varying `gcc` versions and capabilities in R packages

Posted: 29 Jun 2019 05:00 PM PDT

(This article was first published on R on roryverse, and kindly contributed to R-bloggers)

The problem

I have a package called strex which is for string manipulation. In this package, I want to take advantage of the regex capabilities of C++11. The reason for this is that in strex, I find myself needing to do a calculation like

x <- list(c("1,000", "2,000,000"),            c("1", "50", "3,455"))  lapply(x, function(x) as.numeric(stringr::str_replace_all(x, ",", "")))  #> [[1]]  #> [1] 1e+03 2e+06  #>   #> [[2]]  #> [1]    1   50 3455

A lapply like this can be done faster in C++11, so I'd like to have that speedup in my package. The problem is, this requires the regex capabilities of C++11, which are only supported in gcc >= 4.9. Many people are using an older gcc, e.g. the still popular Ubuntu 14.04 is on gcc 4.8. If these people tried to install the strex which relied on C++11 regex, they'd get a compile error.

The hope

I wanted to provide the faster option to those with a capable gcc and the slower lapply option (which isn't painfully slow, just a bit slower) to those with an old gcc. This should all happen inside a seamless install.packages() call; the user needn't be bored by all of this.

The solution

Figuring out which `gcc` version the user has

The configure step of package installation needed to do different things depending on the gcc version. Kevin Ushey's configure package (https://github.com/kevinushey/configure) allows you to use R to configure R packages (normally you have to use shell commands). This was a saviour. To get the gcc version, I used the processx package (so I had to add it to Imports in DESCRIPTION) to execute the shell command gcc -v.

gcc_version <- function() {    out <- tryCatch(processx::run("gcc", "-v", stderr_to_stdout = TRUE),                    error = function(cnd) list(stdout = ""))    out <- stringr::str_match(out$stdout, "gcc version (\\d+(?:\\.\\d+)*)")[1, 2]    if (!is.na(out)) out <- numeric_version(out)    out  }

This returns the gcc version if gcc is installed and NA otherwise. Then, the statement !is.na(gcc_version()) && gcc_version() < "4.9" returns TRUE if the user's gcc does not support C++11 regex and FALSE otherwise.

Dealing with the `gcc` version

I decided that the default code in the package would be for people with an up to date gcc and that the configure step would make alterations to the code for people with an old gcc. Hence, for people with an old gcc, configure needed to remove all of the C++ code that required C++ regex and then replace the body of the R function which .Call()ed that (now removed) C++ code with R code that performed the same function. It took a long time (many days) and a lot of testing on Travis but this was the right strategy and now strex is installing beautifully with new and old gccs.

The code

There's a little too much code to walk through the steps in this blog post (and the steps are specific to this package), but if you're curious as to how this was done, first familiarize yourself with Kevin Ushey's amazing configure package and then read the configuration steps in strex at https://github.com/rorynolan/strex. This includes useful functions like file_replace_R_fun() to change the body of an R function in a file, file_remove_C_fun() to remove a C/C++ function from a file and file_remove_matching_lines() to remove certain lines from a file.

Conclusion

This post is intended to give people an idea of how to deal with this type of problem. If you are struggling with this problem, feel free to contact me; I'm happy to share the limited knowledge that I have.

To leave a comment for the author, please follow the link and comment on their blog: R on roryverse.

This posting includes an audio/video/photo media file: Download Now

Spurious correlations and random walks

Posted: 29 Jun 2019 03:00 AM PDT

(This article was first published on Fabian Dablander, and kindly contributed to R-bloggers)

The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.¹ In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: completely independent processes can be correlated substantially.

AR(1) processes and random walks

Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as:

$Y_t = \phi \, Y_{t-1} + \epsilon_t \enspace ,$

where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a random walk. We can simulate from these using the following code:

There are, of course, many more.

To leave a comment for the author, please follow the link and comment on their blog: Fabian Dablander.

This posting includes an audio/video/photo media file: Download Now

Curly-Curly, the successor of Bang-Bang

Posted: 28 Jun 2019 05:00 PM PDT

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

Writing functions that take data frame columns as arguments is a problem that most R users have been
confronted with at some point. There are different ways to tackle this issue, and this blog post will
focus on the solution provided by the latest release of the {rlang} package. You can read the
announcement here, which explains really
well what was wrong with the old syntax, and how the new syntax works now.

I have written about the problem of writing functions that use data frame columns as arguments
three years ago
and two year ago too.
Last year, I wrote a
blog post that showed how to map a list of functions to a list of datasets with a list of columns
as arguments that used the !!quo(column_name) syntax (the !! is pronounced bang-bang).
Now, there is a new sheriff in town, {{}}, introduced in {rlang} version 0.4.0 that makes
things even easier. The suggested pronunciation of {{}} is curly-curly, but there is no
consensus yet.

First, let's load the {tidyverse}:

library(tidyverse)

Let's suppose that I need to write a function that takes a data frame, as well as a column from
this data frame as arguments:

how_many_na <- function(dataframe, column_name){    dataframe %>%      filter(is.na(column_name)) %>%      count()  }

Let's try this function out on the starwars data:

data(starwars)    head(starwars)

## # A tibble: 6 x 13  ##   name  height  mass hair_color skin_color eye_color birth_year gender  ##                                 ## 1 Luke…    172    77 blond      fair       blue            19   male    ## 2 C-3PO    167    75        gold       yellow         112       ## 3 R2-D2     96    32        white, bl… red             33       ## 4 Dart…    202   136 none       white      yellow          41.9 male    ## 5 Leia…    150    49 brown      light      brown           19   female  ## 6 Owen…    178   120 brown, gr… light      blue            52   male    ## # … with 5 more variables: homeworld , species , films ,  ## #   vehicles , starships

As you can see, there are missing values in the hair_color column. Let's try to count how many
missing values are in this column:

how_many_na(starwars, hair_color)

Error: object 'hair_color' not found

R cannot find the hair_color column, and yet it is in the data! Well, this is actually exactly
the issue. The issue is that the column is inside the dataframe, but when calling the function
with hair_color as the second argument, R is looking for a variable called hair_color that
does not exist. What about trying with "hair_color"?

how_many_na(starwars, "hair_color")

## # A tibble: 1 x 1  ##       n  ##     ## 1     0

Now we get something, but something wrong!

One way to solve this issue, is to not use the filter() function, and instead rely on base R:

how_many_na_base <- function(dataframe, column_name){    na_index <- is.na(dataframe[, column_name])    nrow(dataframe[na_index, column_name])  }    how_many_na_base(starwars, "hair_color")

## [1] 5

This works, but not using the {tidyverse} at all is not an option, at least for me. For instance,
the next function, which uses a grouping variable, would be difficult to implement without the
{tidyverse}:

summarise_groups <- function(dataframe, grouping_var, column_name){    dataframe %>%      group_by(grouping_var) %>%        summarise(mean(column_name, na.rm = TRUE))  }

Calling this function results in the following error message:

Error: Column `grouping_var` is unknown

Before the release of {rlang} 0.4.0 this is was the solution:

summarise_groups <- function(dataframe, grouping_var, column_name){      grouping_var <- enquo(grouping_var)    column_name <- enquo(column_name)    mean_name <- paste0("mean_", quo_name(column_name))      dataframe %>%      group_by(!!grouping_var) %>%        summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE))  }

The core of the function remained very similar to the version from before, but now one has to
use the enquo()–!! syntax. While not overly difficult to use, it is cumbersome.

Now this can be simplified using the new {{}} syntax:

summarise_groups <- function(dataframe, grouping_var, column_name){      dataframe %>%      group_by({{grouping_var}}) %>%        summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE))  }

Much easier and cleaner! You still have to use the := operator instead of = for the column name
however. Also, from my understanding, if you want to modify the column names, for instance in this
case return "mean_height" instead of height you have to keep using the enquo()–!! syntax.

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.

Buy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

This posting includes an audio/video/photo media file: Download Now

Revisited: Forecasting Last Christmas Search Volume

Posted: 28 Jun 2019 03:00 AM PDT

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

It is June and nearly half of the year is over, marking the middle between Christmas 2018 and 2019. Last year in autumn, I've published a blog post about predicting Wham's „Last Christmas" search volume using Google Trends data with different types of neural network architectures. Of course, now I want to know how good the predictions were, compared to the actual search volumes.

The following table shows the predicted values by the different network architectures, the true search volume data in the relevant time region from November 2018 until January 2019, as well as the relative prediction error in brackets:

month	MLP	CNN	LSTM	actual
2018-11	0.166 (0.21)	0.194 (0.078)	0.215 (0.023)	0.21
2018-12	0.858 (0.057)	0.882 (0.031)	0.817 (0.102)	0.91
2019-01	0.035 (0.153)	0.034 (0.149)	0.035 (0.153)	0.03

There's no clear winner in this game. For the month November, the LSTM model performs best with a relative error of only 2.3%. However, in the „main" month December, the LSTM drops in accuracy in favor of the 1-dimensional CNN with 3.1% error and the MLP with 5.7% error. Compared to November and December, January exhibits higher prediction errors >10% regardless of the architecture.

To bring a little more data science flavor into this post, I've created a short R script that presents the results in a cool „heatmap" style.

library(dplyr)  library(ggplot2)    # Define data frame for plotting  df_plot <- data.frame(MONTH=rep(c("2018-11", "2018-12", "2019-01"), 3),                        MODEL = c(rep("MLP", 3), rep("CNN", 3), rep("LSTM", 3)),                        PREDICTION = c(0.166, 0.858, 0.035, 0.194, 0.882, 0.034, 0.215, 0.817, 0.035),                        ACTUAL = rep(c(0.21, 0.91, 0.03), 3))    # Do plot  df_plot %>%    mutate(MAPE = round(abs(ACTUAL - PREDICTION) / ACTUAL, 3)) %>%    ggplot(data = ., aes(x = MONTH, y = MODEL)) +    geom_tile(aes(fill = MAPE)) +    scale_fill_gradientn(colors = c('navyblue', 'darkmagenta', 'darkorange1')) +    geom_text(aes(label = MAPE), color = "white") +    theme_minimal()

This year, I will (of course) redo the experiment using the newly acquired data. I am curious to find out if the prediction improves. In the meantime, you can sign up to our mailing list, bringing you the best data science, machine learning and AI reads and treats directly into your mailbox!

Über den Autor

Sebastian Heinz

I am the founder and CEO of STATWORX. I enjoy writing about machine learning and AI, especially about neural networks and deep learning. In my spare time, I love to cook, eat and drink as well as traveling the world.

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.

Der Beitrag Revisited: Forecasting Last Christmas Search Volume erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

This posting includes an audio/video/photo media file: Download Now

Vignette: Write & Read Multiple Excel files with purrr

Posted: 27 Jun 2019 05:00 PM PDT

(This article was first published on Musings on R, and kindly contributed to R-bloggers)

Introduction

This post will show you how to write and read a list of data tables to and from Excel with purrr, the functional programming package from tidyverse. In this example I will also use the packages readxl and writexl for reading and writing in Excel files, and cover methods for both XLSX and CSV (not strictly Excel, but might as well!) files.

Whilst the internet is certainly in no shortage of R tutorials on how to read and write Excel files (see this Stack Overflow thread for example), I think a purrr approach still isn't as well-known or well-documented. I find this approach to be very clean and readable, and certainly more "tidyverse-consistent" than other approaches which rely on lapply() or for loops. My choice of packages for reading and writing Excel files are readxl and writexl, where the advantage is that neither of them require external dependencies.

For reading and writing CSV files, I personally have switched back and forth between readr and data.table, depending on whether I have a need to do a particular analysis in data.table (see this discussion on why I sometimes use it in favour of dplyr). Where applicable in this post, I will point out in places where you can use alternatives from data.table for fast reading/writing.

For documentation/demonstration purposes, I'll make the package references (indicated by ::) explicit in the functions below, but it's advisable to remove them in "real life" to avoid code that is overly verbose.

Getting Started

The key functions used in this vignette come from three packages: purrr, readxl, and writexl.

library(tidyverse)  library(readxl)  library(writexl)

Since purrr is part of core tidyverse, we can simply run library(tidyverse). This is also convenient as we'll also use various functions such as group_split() from dplyr and the %>% operator from magrittr in the example.

Note that although readxl is part of tidyverse, you'll still need to load it explicitly as it's not a "core" tidyverse package.

Writing multiple Excel files

Let us start off with the iris dataset that is pre-loaded with R. If you're not one of us sad people who almost know this dataset by heart, here's what it looks like:

iris %>% head()

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  ## 1          5.1         3.5          1.4         0.2  setosa  ## 2          4.9         3.0          1.4         0.2  setosa  ## 3          4.7         3.2          1.3         0.2  setosa  ## 4          4.6         3.1          1.5         0.2  setosa  ## 5          5.0         3.6          1.4         0.2  setosa  ## 6          5.4         3.9          1.7         0.4  setosa

The first thing that we want to do is to create multiple datasets, which we can do so by splitting iris. I'll do this by running group_split() on the Species column, so that each species of iris has its own dataset. This will return a list of three data frames, one for each unique value in Species: setosa, versicolor, and virginica. I'll assign these three data frames to a list object called list_of_dfs:

# Split: one data frame per Species  iris %>%    dplyr::group_split(Species) -> list_of_dfs    list_of_dfs

## [[1]]  ## # A tibble: 50 x 5  ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  ##                                      ##  1          5.1         3.5          1.4         0.2 setosa   ##  2          4.9         3            1.4         0.2 setosa   ##  3          4.7         3.2          1.3         0.2 setosa   ##  4          4.6         3.1          1.5         0.2 setosa   ##  5          5           3.6          1.4         0.2 setosa   ##  6          5.4         3.9          1.7         0.4 setosa   ##  7          4.6         3.4          1.4         0.3 setosa   ##  8          5           3.4          1.5         0.2 setosa   ##  9          4.4         2.9          1.4         0.2 setosa   ## 10          4.9         3.1          1.5         0.1 setosa   ## # ... with 40 more rows  ##   ## [[2]]  ## # A tibble: 50 x 5  ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species     ##                                         ##  1          7           3.2          4.7         1.4 versicolor  ##  2          6.4         3.2          4.5         1.5 versicolor  ##  3          6.9         3.1          4.9         1.5 versicolor  ##  4          5.5         2.3          4           1.3 versicolor  ##  5          6.5         2.8          4.6         1.5 versicolor  ##  6          5.7         2.8          4.5         1.3 versicolor  ##  7          6.3         3.3          4.7         1.6 versicolor  ##  8          4.9         2.4          3.3         1   versicolor  ##  9          6.6         2.9          4.6         1.3 versicolor  ## 10          5.2         2.7          3.9         1.4 versicolor  ## # ... with 40 more rows  ##   ## [[3]]  ## # A tibble: 50 x 5  ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species    ##                                        ##  1          6.3         3.3          6           2.5 virginica  ##  2          5.8         2.7          5.1         1.9 virginica  ##  3          7.1         3            5.9         2.1 virginica  ##  4          6.3         2.9          5.6         1.8 virginica  ##  5          6.5         3            5.8         2.2 virginica  ##  6          7.6         3            6.6         2.1 virginica  ##  7          4.9         2.5          4.5         1.7 virginica  ##  8          7.3         2.9          6.3         1.8 virginica  ##  9          6.7         2.5          5.8         1.8 virginica  ## 10          7.2         3.6          6.1         2.5 virginica  ## # ... with 40 more rows

I'll also use purrr::map() to take the character values (setosa, versicolor, and virginica) from the Species column itself for assigning names to the list. map() transforms an input by applying a function to each element of the input, and then returns a vector the same length as the input. In this immediate example, the input is the list_of_dfs and we apply the function dplyr::pull() to extract the Species variable from each data frame. We then repeat this approach to convert Species into character type with as.character() and take out a single value with unique():

# Use the value from the "Species" column to provide a name for the list members  list_of_dfs %>%    purrr::map(~pull(.,Species)) %>% # Pull out Species variable    purrr::map(~as.character(.)) %>% # Convert factor to character    purrr::map(~unique(.)) -> names(list_of_dfs) # Set this as names for list members    names(list_of_dfs)

## [1] "setosa"     "versicolor" "virginica"

These names will be useful for exporting the data frames into Excel, as they will effectively be our Excel sheet names. You can always manually hard-code the sheet names, but the above approach allows you to do the entire thing dynamically if you need to.

Having set the sheet names, I can then pipe the list of data frames directly into write_xlsx(), where the Excel file name and path is specified in the same path argument:

list_of_dfs %>%    writexl::write_xlsx(path = "../datasets/test-excel/test-excel.xlsx")

Writing multiple CSV files

Exporting the list of data frames into multiple CSV files will take a few more lines of code, but still relatively straightforward. There are three main steps:

Define a function that tells R what the names for each CSV file should be, which I've called output_csv() below. The data argument will take in a data frame whilst the names argument will take in a character string that will form part of the file name for the individual CSV file.
Create a named list where the names match the arguments of the function you've just defined (data and names), and should contain the objects that you would like to pass through to the function for the respective arguments. In this case, list_of_dfs will provide the three data frames, and names(list_of_dfs) will provide the names of those three data frames. This is necessary for running pmap(), which in my view is basically a super-powered version of map() that lets you iterate over multiple inputs simultaneously.
pmap() will then iterate through the two sets of inputs through output_csv() (the inputs are used as arguments), which then writes the three CSV files with the file names you want. For the "writing" function, you could either use write_csv() from readr (part of tidyverse) or fwrite() from data.table, depending on your workflow / style.

# Step 1  # Define a function for exporting csv with the desired file names and into the right path  output_csv <- function(data, names){       folder_path <- "../datasets/test-excel/"            write_csv(data, paste0(folder_path, "test-excel-", names, ".csv"))    }    # Step 2  list(data = list_of_dfs,       names = names(list_of_dfs)) %>%       # Step 3    purrr::pmap(output_csv)

The outcome of the above code is shown below. My directory now contains one Excel file with three Worksheets (sheet names are "setosa", "versicolor", and "virginica"), and three separate CSV files for each data slice:

Reading multiple Excel / CSV files

For reading files in, you'll need to decide on how you want them to be read in. The options are:

Read all the datasets directly into the Global Environment as individual data frames with a "separate existence" and separate names.
Read all the datasets into a single list, where each data frame is a member of that list.

The first option is best if you are unlikely to run similar operations on all the data frames at the same time. You may for instance want to do this if the data sets that you are reading in are structurally different from each other, and that you are planning to manipulate them separately.

The second option will be best if you are likely to manipulate all the data frames at the same time, where for instance you may run on the list of data frames map() with drop_na() as an argument to remove missing values for all of the data frames at the same time. The benefit of reading your multiple data sets into a list is that you will have a much cleaner workspace (Global Environment). However, there is a minor and almost negligible inconvenience accessing individual data frames, as you will need to go into a list and pick out the right member of the list (e.g. doing something like list_of_dfs[3]).

Method 1A: Read all sheets in Excel into Global Environment

So let's begin! This method will read all the sheets within a specified Excel file and load them into the Global Environment, using variable names of your own choice. For simplicity, I will use the original Excel sheet names as the variable names.

The first thing to do is to specify the file path to the Excel file:

wb_source <- "../datasets/test-excel/test-excel.xlsx"

You can then run readxl::excel_sheets() to extract the sheet names in that Excel file, and save it as a character type vector.

# Extract the sheet names as a character string vector  wb_sheets <- readxl::excel_sheets(wb_source)     print(wb_sheets)

## [1] "setosa"     "versicolor" "virginica"

The next step is to iterate through the sheet names (saved in wb_sheets) using map(), and within each iteration use assign() (base) and read_xlsx() (from readxl) to load each individual sheet into the Global Environment, giving each one a variable name. Here's the code:

# Load everything into the Global Environment  wb_sheets %>%    purrr::map(function(sheet){ # iterate through each sheet name    assign(x = sheet,           value = readxl::read_xlsx(path = wb_source, sheet = sheet),           envir = .GlobalEnv)  })

This is what my work space looks like:

Note that map() always returns a list, but in this case we do not need a list returned and only require the "side effects", i.e. the objects being read in to be assigned to the Global Environment. If you prefer you can use lapply() instead of map(), which for this purpose doesn't make a big practical difference.

Also, assign() allows you to assign a value to a name in an environment, where we've specified the following as arguments:

x: sheet as the variable name
value: The actual data from the sheet we read in. Here, we use readxl::read_xlsx() for reading in specific sheets from the Excel file, where you simply specify the file path and the sheet name as the arguments.
envir: .GlobalEnv as the environment

Method 1B: Read all CSV files in directory into Global Environment

The method for reading CSV files into a directory is slightly different, as you'll need to find a way to identify or create a character vector of names of all the files that you want to load into R. To do this, we'll use list.files(), which produces a character vector of the names of files or directories in the named directory:

file_path <- "../datasets/test-excel/"    file_path %>% list.files()

## [1] "test-excel-setosa.csv"     "test-excel-versicolor.csv"  ## [3] "test-excel-virginica.csv"  "test-excel.xlsx"

We only want CSV files in this instance, so we'll want to do a bit of string manipulation (using str_detect() from stringr – again, from tidyverse) to get only the names that end with the extension ".csv". Let's pipe this along:

file_path %>%    list.files() %>%    .[str_detect(., ".csv")] -> csv_file_names    csv_file_names

## [1] "test-excel-setosa.csv"     "test-excel-versicolor.csv"  ## [3] "test-excel-virginica.csv"

The next part is similar to what we've done earlier, using map(). Note that apart from replacing the value argument with read_csv() (or you can use fread() to return a data.table object rather than a tibble), I also removed the file extension in the x argument so that the variable names would not contain the actual characters ".csv":

# Load everything into the Global Environment  csv_file_names %>%    purrr::map(function(file_name){ # iterate through each file name    assign(x = str_remove(file_name, ".csv"), # Remove file extension ".csv"           value = read_csv(paste0(file_path, file_name)),           envir = .GlobalEnv)  })

Method 2A: Read all sheets in Excel into a list

Reading sheets into a list is actually easier than to read it into the Global Environment, as map() returns a list and you won't have to use assign() or specify a variable name. Recall that wb_source holds the path of the Excel file, and wb_sheets is a character vector of all the sheet names in the Excel file:

# Load everything into the Global Environment  wb_sheets %>%    purrr::map(function(sheet){ # iterate through each sheet name    readxl::read_xlsx(path = wb_source, sheet = sheet)  }) -> df_list_read # Assign to a list

You can then use map() again to run operations across all members of the list, and even chain operations within it:

df_list_read %>%    map(~select(., Petal.Length, Species) %>%          head())

## [[1]]  ## # A tibble: 6 x 2  ##   Petal.Length Species  ##               ## 1          1.4 setosa   ## 2          1.4 setosa   ## 3          1.3 setosa   ## 4          1.5 setosa   ## 5          1.4 setosa   ## 6          1.7 setosa   ##   ## [[2]]  ## # A tibble: 6 x 2  ##   Petal.Length Species     ##                  ## 1          4.7 versicolor  ## 2          4.5 versicolor  ## 3          4.9 versicolor  ## 4          4   versicolor  ## 5          4.6 versicolor  ## 6          4.5 versicolor  ##   ## [[3]]  ## # A tibble: 6 x 2  ##   Petal.Length Species    ##                 ## 1          6   virginica  ## 2          5.1 virginica  ## 3          5.9 virginica  ## 4          5.6 virginica  ## 5          5.8 virginica  ## 6          6.6 virginica

Method 2B: Read all CSV files in directory into a list

At this point you've probably gathered how you can adapt the code to read CSV files into a list, but let's cover this for comprehensiveness. No assign() needed, and only run read_csv() within the map() function, iterating through the file names:

# Load everything into the Global Environment  csv_file_names %>%    purrr::map(function(file_name){ # iterate through each file name        read_csv(paste0(file_path, file_name))      }) -> df_list_read2 # Assign to a list

Thank you for reading!

Hopefully this is a helpful tutorial for an iterative approach to writing and reading Excel files. If you like what you read or if you have any suggestions / thoughts about the subject, do leave a comment in the Disqus fields in the blog and let me know!

To leave a comment for the author, please follow the link and comment on their blog: Musings on R.

This posting includes an audio/video/photo media file: Download Now

THE BENEFIT

Search This Blog