[R-bloggers] Interacting with AWS from R (and 4 more aRticles)

[R-bloggers] Interacting with AWS from R (and 4 more aRticles)

Link to R-bloggers

Interacting with AWS from R

Posted: 30 Jun 2018 06:30 AM PDT

(This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers)

Getting set up

If there is one realisation in life, it is the fact that you will never have enough CPU or RAM available for your analytics. Luckily for us, cloud computing is becoming cheaper and cheaper each year. One of the more established providers of cloud services is AWS. If you don't know yet, they provide a free, yes free, option. Their t2.micro instance is a 1 CPU, 500MB machine, which doesn't sound like much, but I am running a Rstudio and Docker instance on one of these for a small project.

The management console has the following interface:

So, how cool would it be if you could start up one of these instances from R? Well, with the cloudyr project it makes R a lot better at interacting with cloud based computing infrastructure. With this in mind, I have been playing with the aws.ec2 package which is a simple client package for the Amazon Web Services ('AWS') Elastic Cloud Compute EC2API. There is some irritating setup that has to be done, so if you want to use this package, you need to follow the instructions on the github page to create AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION parameters in the ENV. But once you have figured out this step, the fun starts.

I always enjoy getting the development version of a package, so I am going to install the package straight from github:

devtools::install_github("cloudyr/aws.ec2")  

Next we are going to use a Amazon Machine Images (AMI) which is a pre-build image that already contains all the necessary installations such as R and RStudio. You can build your own AMI and I suggest you build your own if you comfortable with Linux CLI.

Release the beast

library(aws.ec2)  # Describe the AMI (from: http://www.louisaslett.com/RStudio_AMI/)  aws.signature::locate_credentials()  image <- "ami-3b0c205e"  describe_images(image)  

In the code snippet above you will notice I call a function aws.signature::locate_credentials(). I use this function to confirm my credentials. You will need to populate your own credentials after creating a user profile on IAM management console and have generated an ACCESS_KEY for the use of the API. My preferred method of implementing the credentials, is to add the information to the environment using usethis::edit_r_environ().

Here is my (fake) .Renviron:

AWS_ACCESS_KEY_ID=F8D6E9131F0E0CE508126  AWS_SECRET_ACCESS_KEY=AAK53148eb87db04754+f1f2c8b8cae222a2  AWS_DEFAULT_REGION=us-east-2  

Now we are almost ready to test out the package and its functions, but first, I recommend you source a handy function I wrote that helps to tidy the outputs from selected functions from the aws.ec2 package.

source("https://bit.ly/2KnkdzV")  

I found the list object returned from functions such as describe_images(), describe_instance() and instance_status() very verbose and difficult to work with. The tidy_describe() function is there to clean up the outputs and only return the most important information. The function also implements a pretty_print option which
will cat the output in a table to the screen for a quick overview of the information contained in the object.

Lets use this function to see the output from the describe_images() as a pretty_print. Print the aws_describe object without this handy function at your own peril.

image <- "ami-3b0c205e"  aws_describe <- describe_images(image)  aws_describe %>% tidy_describe(.)  
--------------------------------------                  Summary                 --------------------------------------  imageId : ami-3b0c205e   imageOwnerId : 732690581533   creationDate : 2017-10-17T09:28:45.000Z   name : RStudio-1.1.383_R-3.4.2_Julia-0.6.0_CUDA-8_cuDNN-6_ubuntu-16.04-LTS-64bit   description : Ready to run RStudio + Julia/Python server for statistical computation (www.louisaslett.com). Connect to instance public DNS in web brower (standard port 80), username rstudio and password rstudio     To return as tibble: pretty_print = FALSE  

Once we have confirmed that we are happy with the image, we need to save the subnet information as well as the security group information.

s <- describe_subnets()  g <- describe_sgroups()  

Now that you have specified those two things, you have all the pieces to spin up the machine of your choice. To have a look at what machines are available, visit the instance type webpage to choose your machine. Warning: choosing big machines with lots of CPU and a ton of RAM can be addictive. Winners know when to stop

In this example I spin up a t2.micro instance, which is part of the free tier from which Amazon provides.

# Launch the instance using appropriate settings  i <- run_instances(image = image,                      type = "t2.micro", # <- you might want to change this to something like x1e.32xlarge ($26.688 p/h) if you feeling adventurous                     subnet = s[[1]],                      sgroup = g[[1]])  

Once I have executed the code above, I can check on the instance using instance_status to see if the machine is ready, or describe_instance to get the meta information on the machine such as ip. Again, I use the custom tidy_describe

aws_instance <- describe_instances(i)  aws_instance %>% tidy_describe()  
--------------------------------------                  Summary                 --------------------------------------  ownerId : 748485365675   instanceId : i-007fd9116488691fe   imageId : ami-3b0c205e   instanceType : t2.micro   launchTime : 2018-06-30T13:15:50.000Z   availabilityZone : us-east-2b   privateIpAddress : 172.31.16.198   ipAddress : 18.222.174.186   coreCount : 1   threadsPerCore : 1   To return as tibble: pretty_print = FALSE  
aws_status <- instance_status(i)  aws_status %>% tidy_describe()  
--------------------------------------                  Summary                 --------------------------------------  instanceId : i-007fd9116488691fe   availabilityZone : us-east-2b   code : 16   name : running     To return as tibble: pretty_print = FALSE  

The final bit of code (which is VERY important when running large instance), is to stop the instance and confirm that it has been terminated:

# Stop and terminate the instances  stop_instances(i[[1]])  terminate_instances(i[[1]])  

Final comments

Working with AWS-instances for a while now has really been a game changer in the way I conduct/approach any analytical project. Having the capability to switch on large machines on demand and quickly run any of my analytical scripts opened up new opportunities on what I can do as a consultant who has very limited budget to spend on hardware – also, where will I keep my 96 core 500GB RAM machine once I have scraped enough cash together to actually build such a machine?

To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

RcppArmadillo 0.8.600.0.0

Posted: 29 Jun 2018 07:11 PM PDT

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

armadillo image

A new RcppArmadillo release 0.8.600.0.0, based on the new Armadillo release 8.600.0 from this week, just arrived on CRAN.

It follows our (and Conrad's) bi-monthly release schedule. We have made interim and release candidate versions available via the GitHub repo (and as usual thoroughly tested them) but this is the real release cycle. A matching Debian release will be prepared in due course.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 479 other packages on CRAN.

A high-level summary of changes follows (which omits the two rc releases leading up to 8.600.0). Conrad did his usual impressive load of upstream changes, but we are also grateful for the RcppArmadillo fixes added by Keith O'Hara and Santiago Olivella.

Changes in RcppArmadillo version 0.8.600.0.0 (2018-06-28)

  • Upgraded to Armadillo release 8.600.0 (Sabretooth Rugrat)

    • added hess() for Hessenberg decomposition

    • added .row(), .rows(), .col(), .cols() to subcube views

    • expanded .shed_rows() and .shed_cols() to handle cubes

    • expanded .insert_rows() and .insert_cols() to handle cubes

    • expanded subcube views to allow non-contiguous access to slices

    • improved tuning of sparse matrix element access operators

    • faster handling of tridiagonal matrices by solve()

    • faster multiplication of matrices with differing element types when using OpenMP

Changes in RcppArmadillo version 0.8.500.1.1 (2018-05-17) [GH only]

  • Upgraded to Armadillo release 8.500.1 (Caffeine Raider)

    • bug fix for banded matricex
  • Added slam to Suggests: as it is used in two unit test functions [CRAN requests]

  • The RcppArmadillo.package.skeleton() function now works with example_code=FALSE when pkgKitten is present (Santiago Olivella in #231 fixing #229)

  • The LAPACK tests now cover band matrix solvers (Keith O'Hara in #230).

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Punctuation in literature

Posted: 29 Jun 2018 05:00 PM PDT

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

This morning I was scrolling through Twitter and noticed Alberto Cairo share this lovely data visualization piece by Adam J. Calhoun about the varying prevalence of punctuation in literature. I thought, "I want to do that!" It also offers me the opportunity to chat about a few of the new options available for tokenizing in tidytext via updates to the tokenizers package.
Adam's original piece explores how punctuation is used in nine novels, including my favorite Pride and Prejudice.

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Global Migration, animated with R

Posted: 29 Jun 2018 02:30 PM PDT

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The animation below, by Shanghai University professor Guy Abel, shows migration within and between regions of the world from 1960 to 2015. The data and the methodology behind the chart is described in this paper. The curved bars around the outside represent the peak migrant flows for each region; globally, migration peaked during the 2005-2010 period and the declined in 2010-2015, the latest data available.

Global-migration

This animated chord chart was created entirely using the R language. The chord plot showing the flows between regions was created using the circlize package; the tweenr package created the smooth transitions between time periods, and the magick package created the animated GIF you see above. You can find a tutorial on making this animation, including the complete R code, at the link below.

Guy Abel: Animated Directional Chord Diagrams (via Cal Carrie)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Benchmarking a SSD drive in reading and writing files with R

Posted: 29 Jun 2018 12:00 AM PDT

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

I recently bought a new computer for home and it came with two drives,
one HDD and other SSD. The later is used for the OS and the former for
all of my files. From all computers I had, both home and work, this is
definitely the fastest. While some of the merits are due to the newer
CPUS and RAM, the SSD drive can make all the difference in file
operations.

My research usually deals with large files from financial markets. Being
efficient in reading those files is key to my productivity. Given that,
I was very curious in understanding how much I would benefit in speed
when reading/writing files in my SSD drive instead of the HDD. For that,
I wrote a simple function that will time a particular operation. The
function will take as input the number of rows in the data (1..Inf), the
type of function used to save the file (rds, csv, fst) and the
type of drive (HDD or SSD). See next.

bench.fct <- function(N = 2500000, type.file = 'rds', type.hd = 'HDD') {    # Function for timing read and write operations    #    # INPUT: N - Number of rows in dataframe to be read and write    #        type.file - format of output file (rds, csv, fst)    #        type.hd - where to save (hdd or ssd)    #    # OUTPUT: A dataframe with results    require(tidyverse)    require(fst)        my.df <- data_frame(x = runif(N),                        char.vec = sample(letters, size = N,                                           replace = TRUE))        path.file <- switch(type.hd,                        'SSD' = '~',                        'HDD' = '/mnt/HDD/')        my.file <- file.path(path.file,                          switch (type.file,                                 'rds-base' = 'temp_rds.rds',                                 'rds-readr' = 'temp_rds.rds',                                 'fst' = 'temp_fst.fst',                                 'csv-readr' = 'temp_csv.csv',                                 'csv-base' = 'temp_csv.csv'))        if (type.file == 'rds-base') {      time.write <- system.time(saveRDS(my.df, my.file, compress = FALSE))      time.read <- system.time(readRDS(my.file))    } else if (type.file == 'rds-readr') {      time.write <- system.time(write_rds(x = my.df, path =  my.file, compress = 'none'))      time.read <- system.time(read_rds(path = my.file ))    } else if (type.file == 'fst') {      time.write <- system.time(write.fst(x = my.df, path = my.file))      time.read <- system.time(read_fst(my.file))    } else if (type.file == 'csv-readr') {      time.write <- system.time(write_csv(x = my.df, path = my.file))      time.read <- system.time(read_csv(file = my.file, col_types = cols(x = col_double(),                                                                         char.vec = col_character())))    } else if (type.file == 'csv-base') {      time.write <- system.time(write.csv(x = my.df, file = my.file))      time.read <- system.time(read.csv(file = my.file))    }        # clean up    file.remove(my.file)        # save output    df.out <- data_frame(type.file = type.file,                         type.hd = type.hd,                         N = N,                         type.time = c('write',                                        'read'),                         times = c(time.write[3],                                    time.read[3]))        return(df.out)      }  

Now that we have my function, its time to use it for all combinations
between number of rows, the formats of the file and type of drive:

library(purrr)  df.grid <- expand.grid(N = seq(1, 500000, by = 50000),                          type.file = c('rds-readr', 'rds-base', 'fst', 'csv-readr', 'csv-base'),                          type.hd = c('HDD', 'SSD'), stringsAsFactors = F)    l.out <- pmap(list(N = df.grid$N,                 type.file = df.grid$type.file,                 type.hd = df.grid$type.hd), .f = bench.fct)    df.res <- do.call(what = bind_rows, args = l.out)  

Lets check the result in a nice plot:

library(ggplot2)    p <- ggplot(df.res, aes(x = N, y = times, linetype = type.hd)) +     geom_line() + facet_grid(type.file ~ type.time)    print(p)  

As you can see, the csv-base format is messing with the y axis. Let's
remove it for better visualization:

library(ggplot2)    p <- ggplot(filter(df.res, !(type.file %in% c('csv-base'))),              aes(x = N, y = times, linetype = type.hd)) +     geom_line() + facet_grid(type.file ~ type.time)    print(p)  

When it comes to the file format, we learn:

  • By far, the fst format is the best. It takes less time to read
    and write than the others. However, it's probably unfair to compare
    it to csv and rds as it uses many of the 16 cores of my
    computer.

  • readr is a great package for writing and reading csv files.
    You can see a large difference of time from using the base
    functions. This is likely due to the use of low level functions to
    write and read the text files.

  • When using the rds format, the base function do not differ much
    from the readr functions
    .

As for the effect of using SSD, its clear that it DOES NOT effect
the time of reading and writing. The differences between using HDD and
SSD looks like noise. Seeking to provide a more robust analysis, let's
formally test this hypothesis using a simple t-test for the means:

tab <- df.res %>%    group_by(type.file, type.time) %>%    summarise(mean.HDD = mean(times[type.hd == 'HDD']),              mean.SSD = mean(times[type.hd == 'SSD']),              p.value = t.test(times[type.hd == 'SSD'],                               times[type.hd == 'HDD'])$p.value)      print(tab)    ## # A tibble: 10 x 5  ## # Groups:   type.file [?]  ##    type.file type.time mean.HDD mean.SSD p.value  ##                          ##  1 csv-base  read       0.554    0.463    0.605   ##  2 csv-base  write      0.405    0.405    0.997   ##  3 csv-readr read       0.142    0.126    0.687   ##  4 csv-readr write      0.0711   0.0706   0.982   ##  5 fst       read       0.015    0.0084   0.0584  ##  6 fst       write      0.00900  0.00910  0.964   ##  7 rds-base  read       0.0321   0.0303   0.848   ##  8 rds-base  write      0.0253   0.025    0.969   ##  9 rds-readr read       0.0323   0.0304   0.845   ## 10 rds-readr write      0.0251   0.0247   0.957  

As we can see, the null hypothesis of equal means easily fails to be
rejected for almost all types of files and operations at 10%. The
exception was for the fst format in a reading operation. In other
words, statistically, it does not make any difference in time from using
SSD or HDD to read or write files in different formats.

I am very surprised by this result. Independently of the type of format,
I expected a large difference as SSD drives are much faster within an
OS. Am I missing something? Is this due to the OS being in the SSD? What
you guys think?

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Comments