(TL;DR: matrices are rectangular arrays of numbers.)
LaTeX and MathJax warning for those viewing my feed: please view directly on website!
I wanted to avoid a picture related to the film franchise!
Me
Last time, we learnt about dot products. We will now finally start talking aobut matrices.
Today's topic: Matrices – a primer
We know what row vectors and column vectors are from our previous posts.
What would it look like if we were to stack a bunch of row vectors on top of each other? Similarly, what would it look like if we were to put a bunch of column vectors next to each other? Let's experiment!
Let's define some row vectors:
Let's stack them on top of each other:
This rectangular array of numbers is an example of a matrix!
Let's repeat the exercise with the same vectors represented as column vectors:
Putting these column vectors next to each other gives us this:
Looks like another matrix to me!
In fact, our row vectors and column vectors we saw in previous posts live double lives. A row vector goes by another name: "row matrix". A column vector goes by another name, too: "column matrix".
Sneaky!
A little bit of notation
As before, we will be following the notation used in Goodfellow, Ian, et al.. We will depict matrices using uppercase, bold font:
Describing the size of a matrix
From the above, we can guess that matrices can be described in terms of rows and columns. How can we describe the number of rows and the number of columns we have in our matrices?
Let's go back to our previous examples. They both happen to have three rows and three columns. So we have ourselves two 3 x 3 ("three by three") matrices. The number of rows and the number of columns describe the "size" of our matrix.
Let's generalise this. By convention, we have rows and columns in our matrix. Let's use our matrix as an example. We haven't defined what elements it contains. In fact, all we have said is that is a matrix. Whatever the contents of , it will will have rows and columns. So will be of size .
Keep it real
Let's take this a step further. From previous posts, we are working with real numbers so let's assume that our matrix of size will contain real numbers. That is, it is a real matrix. How can we depict this using some compact notation?
Remember when we touched on our vectors belonging to some abstract set of all real-valued vectors of similar dimensions when describing the dimensions of a vector? We can say something similar for our real-valued matrix of size . We can describe this set of all real-valued matrices like this:
So to show that our matrix belongs to the set of all matrices that are made up of real numbers, we simply say:
Going back to our examples above:
Our real-valued row vectors (a.k.a. row matrices) are members of the set . That is, they are members of the set of all possible matrices made up of one row and columns of real numbers.
Our real-valued column vectors (a.k.a. column matrices) are members of the set . That is, they are members of the set of all possible matrices made up of rows and one column of real numbers.
Our real-valued square matrices are members of the set where . That is, they are members of the set of all possible real matrices that have the same number of rows as they have columns.
Not too bad, right?
How can we create matrices in R?
Stack those vectors
One way is to simply rbind and cbind vectors. Let's define 3 vectors and rbind them (i.e. stack them row-wise):
## [,1] [,2] [,3] ## a 1 2 3 ## b 4 5 6 ## c 7 8 9
We have ourselves a matrix!
print(class(A))
## [1] "matrix"
Let's now stack them column-wise by using cbind on our vectors:
A<-cbind(a,b,c)print(A)
## a b c ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
It's a matrix!
print(class(A))
## [1] "matrix"
Use the matrix function
The "usual" way to create a matrix would be to pass some vector or list into the matrix function and specify its dimensions. For example, if we want to use the values passed into the matrix function to create a 3×3 matrix by using the values in a column-wise manner, we can do this:
The data types of the elements in the vector or list you pass into the matrix function will be coerced to a single data type following the usual coercion rules. For example, if we use a vector containing a numeric type and a integer type, let's see what happens:
From previous posts, we know that the values in our vector have already been coerced by c before the matrix function sees them! So, naturally, we have a matrix of numeric types:
sapply(A,class)
## [1] "numeric" "numeric"
We can even have a character matrix. Not sure why we'd want to create one…but we can!
x<-c("hello","goodbye")matrix(x)
## [,1] ## [1,] "hello" ## [2,] "goodbye"
Elements are recycled
Recycling occurs like in vector addition posts. For example, if we try to create a matrix with four elements (2 rows times 2 columns) from a vector with only two elements, this is what we get:
matrix(c(1,2),nrow=2,ncol=2)
## [,1] [,2] ## [1,] 1 1 ## [2,] 2 2
If the number of elements in our vector is not a multiple or submultiple of the number of rows or columns, we get warned of this fact:
matrix(c(1,2,3),nrow=2,ncol=1)
## Warning in matrix(c(1, 2, 3), nrow = 2, ncol = 1): data length [3] is not a ## sub-multiple or multiple of the number of rows [2] ## [,1] ## [1,] 1 ## [2,] 2
Done!
Conclusion
Next time, we'll talk about matrix addition and introduce some notation we can use to access the individual elements contained within our matrices!
Justin
To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random | R.
An earlier post from February, describes a Shiny app that allows to search among currently more than 4000 economic articles that have an accessible data and code supplement. Finally, I managed to configure an nginx reverse proxy server and now you can also access the app under a proper https link here:
(I was very positively surprised how easy is it was to change http to https using certbot). Some colleagues told me that they could not access the app under the originally posted link:
I am not sure about the exact reason, but perhaps some security settings don't allow to access web sites on a non-standard port like 3200. Hopefully the new link helps.
Since my initial post, the number of articles has grown, and I included new journals like Econometrica or AER Insights.
The main data for my app can be downloaded as a zipped SQLite database from my server. Let us do some analysis.
We see that there is quite some variation in the share of articles with R code going from 15.6% in Econometrica (ecta) to only 0.82% in the Review of Economics and Statistics (restat). (The statistics exclude all articles that don't have a code supplement or a supplement whose file types I did not analyse, e.g. because it is too large or the ZIP files are nested too deeply.)
Overall, we still have a clear dominance of Stata in economics:
# Number of articles with analyes data & code supplementaryn_art=n_distinct(fs$id)# Count articles by file types and compute sharesfs%>%group_by(file_type)%>%summarize(count=n(),share=round((count/n_art)*100,2))%>%# note that all file extensions are stored in lower casefilter(file_type%in%c("do","r","py","jl","m"))%>%arrange(desc(share))
file_type
count
share
do
2834
70.18
m
979
24.24
r
129
3.19
py
42
1.04
jl
2
0.05
Roughly 70% of the articles have Stata do files and almost a quarter Matlab m files and only slightly above 3% R files.
I also meanwhile have added a log file to the app that anonymously stores data about which articles that have been clicked on. The code below shows the 20 most clicked on articles so far:
## # A tibble: 699 x 2 ## article count ## ## 1 A Macroeconomic Model of Price Swings in the Housing Market 27 ## 2 Job Polarization and Jobless Recoveries 20 ## 3 Tax Evasion and Inequality 19 ## 4 Public Debt and Low Interest Rates 16 ## 5 An Empirical Model of Tax Convexity and Self-Employment 13 ## 6 Alcohol and Self-Control: A Field Experiment in India 11 ## 7 Drug Innovations and Welfare Measures Computed from Market Deman~ 11 ## 8 Food Deserts and the Causes of Nutritional Inequality 11 ## 9 Some Causal Effects of an Industrial Policy 11 ## 10 Costs Demand and Producer Price Changes 10 ## 11 Breaking Bad: Mechanisms of Social Influence and the Path to Cri~ 9 ## 12 Government Involvement in the Corporate Governance of Banks 8 ## 13 Performance in Mixed-sex and Single-sex Tournaments: What We Can~ 8 ## 14 Disease and Gender Gaps in Human Capital Investment: Evidence fr~ 7 ## 15 Housing Constraints and Spatial Misallocation 7 ## 16 Inherited Control and Firm Performance 7 ## 17 Labor Supply and the Value of Non-work Time: Experimental Estima~ 7 ## 18 Pricing in the Market for Anticancer Drugs 7 ## 19 The Arrival of Fast Internet and Employment in Africa 7 ## 20 The Economic Benefits of Pharmaceutical Innovations: The Case of~ 7 ## # ... with 679 more rows
For a nice thumbnail in R-bloggers let us finish with a screenshot of the app:
To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.
In the R parallel package, there are two implementations of parallelism, e.g. fork and socket, with pros and cons.
For the fork, each parallel thread is a complete duplication of the master process with the shared environment, including objects or variables defined prior to the kickoff of parallel threads. Therefore, it runs fast. However, the major limitation is that the fork doesn't work on the Windows system.
On the other hand, the socket works on all operating systems. Each thread runs separately without sharing objects or variables, which can only be passed from the master process explicitly. As a result, it runs slower due to the communication overhead.
Below is an example showing the performance difference between the fork and the socket. A self-defined filter function runs in parallel and exacts three rows out of 336,776 that are meeting criteria. As shown, the fork runs 40% faster than the socket.
df <- read.csv("data/nycflights") ex <- expression(carrier == "UA" & origin == "EWR" & day == 1 & is.na(arr_time)) # SELECT 3 ROWS OUT OF 336,776 # year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time ... # 56866 2013 11 1 NA NA NA NA UA 252 EWR IAH NA ... # 84148 2013 12 1 NA NA NA NA UA 643 EWR ORD NA ... # 251405 2013 7 1 NA NA NA NA UA 394 EWR ORD NA ... parFilter <- function(df, ex, type) { cn <- parallel::detectCores() - 1 cl <- parallel::makeCluster(cn, type = type) ### DIVIDE THE DATAFRAME BASED ON # OF CORES sp <- parallel::parLapply(cl, parallel::clusterSplit(cl, seq(nrow(df))), function(c_) df[c_,]) ### PASS THE OBJECT FROM MASTER PROCESS TO EACH NODE parallel::clusterExport(cl, "ex") ### EXTRACT ROW INDEX ON EACH NODE id <- Reduce(c, parallel::parLapply(cl, sp, function(s_) with(s_, eval(ex)))) parallel::stopCluster(cl) return(df[which(id),]) } rbenchmark::benchmark(replications = 10, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative"), " FORK" = parFilter(df, ex, "FORK"), "SOCKET" = parFilter(df, ex, "PSOCK") ) # test replications elapsed relative # 1 FORK 10 59.396 1.000 # 2 SOCKET 10 83.856 1.412
I have a package called strex which is for string manipulation. In this package, I want to take advantage of the regex capabilities of C++11. The reason for this is that in strex, I find myself needing to do a calculation like
A lapply like this can be done faster in C++11, so I'd like to have that speedup in my package. The problem is, this requires the regex capabilities of C++11, which are only supported in gcc >= 4.9. Many people are using an older gcc, e.g. the still popular Ubuntu 14.04 is on gcc 4.8. If these people tried to install the strex which relied on C++11 regex, they'd get a compile error.
The hope
I wanted to provide the faster option to those with a capable gcc and the slower lapply option (which isn't painfully slow, just a bit slower) to those with an old gcc. This should all happen inside a seamless install.packages() call; the user needn't be bored by all of this.
The solution
Figuring out which gcc version the user has
The configure step of package installation needed to do different things depending on the gcc version. Kevin Ushey's configure package (https://github.com/kevinushey/configure) allows you to use R to configure R packages (normally you have to use shell commands). This was a saviour. To get the gcc version, I used the processx package (so I had to add it to Imports in DESCRIPTION) to execute the shell command gcc -v.
gcc_version <- function() { out <- tryCatch(processx::run("gcc", "-v", stderr_to_stdout = TRUE), error = function(cnd) list(stdout = "")) out <- stringr::str_match(out$stdout, "gcc version (\\d+(?:\\.\\d+)*)")[1, 2] if (!is.na(out)) out <- numeric_version(out) out }
This returns the gcc version if gcc is installed and NA otherwise. Then, the statement !is.na(gcc_version()) && gcc_version() < "4.9" returns TRUE if the user's gcc does not support C++11 regex and FALSE otherwise.
Dealing with the gcc version
I decided that the default code in the package would be for people with an up to date gcc and that the configure step would make alterations to the code for people with an old gcc. Hence, for people with an old gcc, configure needed to remove all of the C++ code that required C++ regex and then replace the body of the R function which .Call()ed that (now removed) C++ code with R code that performed the same function. It took a long time (many days) and a lot of testing on Travis but this was the right strategy and now strex is installing beautifully with new and old gccs.
The code
There's a little too much code to walk through the steps in this blog post (and the steps are specific to this package), but if you're curious as to how this was done, first familiarize yourself with Kevin Ushey's amazing configure package and then read the configuration steps in strex at https://github.com/rorynolan/strex. This includes useful functions like file_replace_R_fun() to change the body of an R function in a file, file_remove_C_fun() to remove a C/C++ function from a file and file_remove_matching_lines() to remove certain lines from a file.
Conclusion
This post is intended to give people an idea of how to deal with this type of problem. If you are struggling with this problem, feel free to contact me; I'm happy to share the limited knowledge that I have.
To leave a comment for the author, please follow the link and comment on their blog: R on roryverse.
The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.1 In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: completely independent processes can be correlated substantially.
AR(1) processes and random walks
Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as:
where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a random walk. We can simulate from these using the following code:
Writing functions that take data frame columns as arguments is a problem that most R users have been confronted with at some point. There are different ways to tackle this issue, and this blog post will focus on the solution provided by the latest release of the {rlang} package. You can read the announcement here, which explains really well what was wrong with the old syntax, and how the new syntax works now.
I have written about the problem of writing functions that use data frame columns as arguments three years ago and two year ago too. Last year, I wrote a blog post that showed how to map a list of functions to a list of datasets with a list of columns as arguments that used the !!quo(column_name) syntax (the !! is pronounced bang-bang). Now, there is a new sheriff in town, {{}}, introduced in {rlang} version 0.4.0 that makes things even easier. The suggested pronunciation of {{}} is curly-curly, but there is no consensus yet.
First, let's load the {tidyverse}:
library(tidyverse)
Let's suppose that I need to write a function that takes a data frame, as well as a column from this data frame as arguments:
## # A tibble: 6 x 13 ## name height mass hair_color skin_color eye_color birth_year gender ## ## 1 Luke… 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 gold yellow 112 ## 3 R2-D2 96 32 white, bl… red 33 ## 4 Dart… 202 136 none white yellow 41.9 male ## 5 Leia… 150 49 brown light brown 19 female ## 6 Owen… 178 120 brown, gr… light blue 52 male ## # … with 5 more variables: homeworld , species , films , ## # vehicles , starships
As you can see, there are missing values in the hair_color column. Let's try to count how many missing values are in this column:
how_many_na(starwars, hair_color)
Error: object 'hair_color' not found
R cannot find the hair_color column, and yet it is in the data! Well, this is actually exactly the issue. The issue is that the column is inside the dataframe, but when calling the function with hair_color as the second argument, R is looking for a variable called hair_color that does not exist. What about trying with "hair_color"?
how_many_na(starwars, "hair_color")
## # A tibble: 1 x 1 ## n ## ## 1 0
Now we get something, but something wrong!
One way to solve this issue, is to not use the filter() function, and instead rely on base R:
This works, but not using the {tidyverse} at all is not an option, at least for me. For instance, the next function, which uses a grouping variable, would be difficult to implement without the {tidyverse}:
The core of the function remained very similar to the version from before, but now one has to use the enquo()–!! syntax. While not overly difficult to use, it is cumbersome.
Now this can be simplified using the new {{}} syntax:
Much easier and cleaner! You still have to use the := operator instead of = for the column name however. Also, from my understanding, if you want to modify the column names, for instance in this case return "mean_height" instead of height you have to keep using the enquo()–!! syntax.
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.
It is June and nearly half of the year is over, marking the middle between Christmas 2018 and 2019. Last year in autumn, I've published a blog post about predicting Wham's „Last Christmas" search volume using Google Trends data with different types of neural network architectures. Of course, now I want to know how good the predictions were, compared to the actual search volumes.
The following table shows the predicted values by the different network architectures, the true search volume data in the relevant time region from November 2018 until January 2019, as well as the relative prediction error in brackets:
month
MLP
CNN
LSTM
actual
2018-11
0.166 (0.21)
0.194 (0.078)
0.215 (0.023)
0.21
2018-12
0.858 (0.057)
0.882 (0.031)
0.817 (0.102)
0.91
2019-01
0.035 (0.153)
0.034 (0.149)
0.035 (0.153)
0.03
There's no clear winner in this game. For the month November, the LSTM model performs best with a relative error of only 2.3%. However, in the „main" month December, the LSTM drops in accuracy in favor of the 1-dimensional CNN with 3.1% error and the MLP with 5.7% error. Compared to November and December, January exhibits higher prediction errors >10% regardless of the architecture.
To bring a little more data science flavor into this post, I've created a short R script that presents the results in a cool „heatmap" style.
This year, I will (of course) redo the experiment using the newly acquired data. I am curious to find out if the prediction improves. In the meantime, you can sign up to our mailing list, bringing you the best data science, machine learning and AI reads and treats directly into your mailbox!
Über den Autor
Sebastian Heinz
I am the founder and CEO of STATWORX. I enjoy writing about machine learning and AI, especially about neural networks and deep learning. In my spare time, I love to cook, eat and drink as well as traveling the world.
ABOUT US
STATWORX is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.
This post will show you how to write and read a list of data tables to and from Excel with purrr, the functional programming package from tidyverse. In this example I will also use the packages readxl and writexl for reading and writing in Excel files, and cover methods for both XLSX and CSV (not strictly Excel, but might as well!) files.
Whilst the internet is certainly in no shortage of R tutorials on how to read and write Excel files (see this Stack Overflow thread for example), I think a purrr approach still isn't as well-known or well-documented. I find this approach to be very clean and readable, and certainly more "tidyverse-consistent" than other approaches which rely on lapply() or for loops. My choice of packages for reading and writing Excel files are readxl and writexl, where the advantage is that neither of them require external dependencies.
For reading and writing CSV files, I personally have switched back and forth between readr and data.table, depending on whether I have a need to do a particular analysis in data.table (see this discussion on why I sometimes use it in favour of dplyr). Where applicable in this post, I will point out in places where you can use alternatives from data.table for fast reading/writing.
For documentation/demonstration purposes, I'll make the package references (indicated by ::) explicit in the functions below, but it's advisable to remove them in "real life" to avoid code that is overly verbose.
Getting Started
The key functions used in this vignette come from three packages: purrr, readxl, and writexl.
Since purrr is part of core tidyverse, we can simply run library(tidyverse). This is also convenient as we'll also use various functions such as group_split() from dplyr and the %>% operator from magrittr in the example.
Note that although readxl is part of tidyverse, you'll still need to load it explicitly as it's not a "core" tidyverse package.
Writing multiple Excel files
Let us start off with the iris dataset that is pre-loaded with R. If you're not one of us sad people who almost know this dataset by heart, here's what it looks like:
The first thing that we want to do is to create multiple datasets, which we can do so by splitting iris. I'll do this by running group_split() on the Species column, so that each species of iris has its own dataset. This will return a list of three data frames, one for each unique value in Species: setosa, versicolor, and virginica. I'll assign these three data frames to a list object called list_of_dfs:
# Split: one data frame per Species iris %>%dplyr::group_split(Species) ->list_of_dfs list_of_dfs
I'll also use purrr::map() to take the character values (setosa, versicolor, and virginica) from the Species column itself for assigning names to the list. map() transforms an input by applying a function to each element of the input, and then returns a vector the same length as the input. In this immediate example, the input is the list_of_dfs and we apply the function dplyr::pull() to extract the Species variable from each data frame. We then repeat this approach to convert Species into character type with as.character() and take out a single value with unique():
# Use the value from the "Species" column to provide a name for the list members list_of_dfs %>%purrr::map(~pull(.,Species)) %>%# Pull out Species variablepurrr::map(~as.character(.)) %>%# Convert factor to characterpurrr::map(~unique(.)) ->names(list_of_dfs) # Set this as names for list membersnames(list_of_dfs)
## [1] "setosa" "versicolor" "virginica"
These names will be useful for exporting the data frames into Excel, as they will effectively be our Excel sheet names. You can always manually hard-code the sheet names, but the above approach allows you to do the entire thing dynamically if you need to.
Having set the sheet names, I can then pipe the list of data frames directly into write_xlsx(), where the Excel file name and path is specified in the same path argument:
Exporting the list of data frames into multiple CSV files will take a few more lines of code, but still relatively straightforward. There are three main steps:
Define a function that tells R what the names for each CSV file should be, which I've called output_csv() below. The data argument will take in a data frame whilst the names argument will take in a character string that will form part of the file name for the individual CSV file.
Create a named list where the names match the arguments of the function you've just defined (data and names), and should contain the objects that you would like to pass through to the function for the respective arguments. In this case, list_of_dfs will provide the three data frames, and names(list_of_dfs) will provide the names of those three data frames. This is necessary for running pmap(), which in my view is basically a super-powered version of map() that lets you iterate over multiple inputs simultaneously.
pmap() will then iterate through the two sets of inputs through output_csv() (the inputs are used as arguments), which then writes the three CSV files with the file names you want. For the "writing" function, you could either use write_csv() from readr (part of tidyverse) or fwrite() from data.table, depending on your workflow / style.
# Step 1# Define a function for exporting csv with the desired file names and into the right path output_csv <-function(data, names){ folder_path <- "../datasets/test-excel/"write_csv(data, paste0(folder_path, "test-excel-", names, ".csv")) } # Step 2list(data = list_of_dfs, names =names(list_of_dfs)) %>%# Step 3purrr::pmap(output_csv)
The outcome of the above code is shown below. My directory now contains one Excel file with three Worksheets (sheet names are "setosa", "versicolor", and "virginica"), and three separate CSV files for each data slice:
Reading multiple Excel / CSV files
For reading files in, you'll need to decide on how you want them to be read in. The options are:
Read all the datasets directly into the Global Environment as individual data frames with a "separate existence" and separate names.
Read all the datasets into a single list, where each data frame is a member of that list.
The first option is best if you are unlikely to run similar operations on all the data frames at the same time. You may for instance want to do this if the data sets that you are reading in are structurally different from each other, and that you are planning to manipulate them separately.
The second option will be best if you are likely to manipulate all the data frames at the same time, where for instance you may run on the list of data frames map() with drop_na() as an argument to remove missing values for all of the data frames at the same time. The benefit of reading your multiple data sets into a list is that you will have a much cleaner workspace (Global Environment). However, there is a minor and almost negligible inconvenience accessing individual data frames, as you will need to go into a list and pick out the right member of the list (e.g. doing something like list_of_dfs[3]).
Method 1A: Read all sheets in Excel into Global Environment
So let's begin! This method will read all the sheets within a specified Excel file and load them into the Global Environment, using variable names of your own choice. For simplicity, I will use the original Excel sheet names as the variable names.
The first thing to do is to specify the file path to the Excel file:
You can then run readxl::excel_sheets() to extract the sheet names in that Excel file, and save it as a character type vector.
# Extract the sheet names as a character string vector wb_sheets <-readxl::excel_sheets(wb_source) print(wb_sheets)
## [1] "setosa" "versicolor" "virginica"
The next step is to iterate through the sheet names (saved in wb_sheets) using map(), and within each iteration use assign() (base) and read_xlsx() (from readxl) to load each individual sheet into the Global Environment, giving each one a variable name. Here's the code:
# Load everything into the Global Environment wb_sheets %>%purrr::map(function(sheet){ # iterate through each sheet nameassign(x = sheet, value = readxl::read_xlsx(path = wb_source, sheet = sheet), envir = .GlobalEnv) })
This is what my work space looks like:
Note that map() always returns a list, but in this case we do not need a list returned and only require the "side effects", i.e. the objects being read in to be assigned to the Global Environment. If you prefer you can use lapply() instead of map(), which for this purpose doesn't make a big practical difference.
Also, assign() allows you to assign a value to a name in an environment, where we've specified the following as arguments:
x: sheet as the variable name
value: The actual data from the sheet we read in. Here, we use readxl::read_xlsx() for reading in specific sheets from the Excel file, where you simply specify the file path and the sheet name as the arguments.
envir: .GlobalEnv as the environment
Method 1B: Read all CSV files in directory into Global Environment
The method for reading CSV files into a directory is slightly different, as you'll need to find a way to identify or create a character vector of names of all the files that you want to load into R. To do this, we'll use list.files(), which produces a character vector of the names of files or directories in the named directory:
We only want CSV files in this instance, so we'll want to do a bit of string manipulation (using str_detect() from stringr – again, from tidyverse) to get only the names that end with the extension ".csv". Let's pipe this along:
The next part is similar to what we've done earlier, using map(). Note that apart from replacing the value argument with read_csv() (or you can use fread() to return a data.table object rather than a tibble), I also removed the file extension in the x argument so that the variable names would not contain the actual characters ".csv":
# Load everything into the Global Environment csv_file_names %>%purrr::map(function(file_name){ # iterate through each file nameassign(x =str_remove(file_name, ".csv"), # Remove file extension ".csv"value =read_csv(paste0(file_path, file_name)), envir = .GlobalEnv) })
Method 2A: Read all sheets in Excel into a list
Reading sheets into a list is actually easier than to read it into the Global Environment, as map() returns a list and you won't have to use assign() or specify a variable name. Recall that wb_source holds the path of the Excel file, and wb_sheets is a character vector of all the sheet names in the Excel file:
# Load everything into the Global Environment wb_sheets %>%purrr::map(function(sheet){ # iterate through each sheet name readxl::read_xlsx(path = wb_source, sheet = sheet) }) ->df_list_read # Assign to a list
You can then use map() again to run operations across all members of the list, and even chain operations within it:
Method 2B: Read all CSV files in directory into a list
At this point you've probably gathered how you can adapt the code to read CSV files into a list, but let's cover this for comprehensiveness. No assign() needed, and only run read_csv() within the map() function, iterating through the file names:
# Load everything into the Global Environment csv_file_names %>% purrr::map(function(file_name){ # iterate through each file nameread_csv(paste0(file_path, file_name)) }) ->df_list_read2 # Assign to a list
Thank you for reading!
Hopefully this is a helpful tutorial for an iterative approach to writing and reading Excel files. If you like what you read or if you have any suggestions / thoughts about the subject, do leave a comment in the Disqus fields in the blog and let me know!
To leave a comment for the author, please follow the link and comment on their blog: Musings on R.
Comments
Post a Comment