[R-bloggers] Building a Shiny App as a Package (and 14 more aRticles)

Building a Shiny App as a Package
Machine Learning explained for statistics folk
Spatial data and maps conference – FOSS4G
The Rich didn’t earn their Wealth, they just got Lucky
Google Next 2019 – observations from the Mango viewing deck
Zooming in on maps with sf and ggplot2
RcppArmadillo 0.9.400.2.0
Could not Resist
RStudio 1.2 Released
ViennaR Meetup March – Full Talks Online
Relaunching the qualtRics package
Relaunching the qualtRics package
What matters more in a cluster randomized trial: number or size?
Compute R2s and other performance indices for all your models!
Marketing with Machine Learning: Apriori

Posted: 30 Apr 2019 07:24 AM PDT

(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

Shiny App as a Package

In a previous post, I've introduced the {golem} package, which is an opinionated framework for building production-ready Shiny Applications. This framework starts by creating a package skeleton waiting to be filled.

But, in a world where Shiny Applications are mostly created as a series of files, why bother with a package? This is the question I'll be focusing on in this blog post.

What's in a Shiny App?

OK, so let's ask the question the other way round. Think about your last Shiny which was created as a single-file (app.R) or two files app (ui.R and server.R). You've got these two, and you put them into a folder.

So, let's have a review of what you'll need next for a robust application.

First, metadata. In other words, the name of the app, the version number (which is crucial to any serious, production-level project), what the application does, who to contact if something goes wrong.

Then, you need to find a way to handle the dependencies. Because you know, when you want to push your app into production, you can't have this conversation with IT:

IT: Hey, I tried to 'source("app.R")' but I've got an error.

R-dev: What's the error?

IT: It says "could not find package 'shiny'".

R-dev: Ah yes, you need to install {shiny}. Try to run 'install.packages("shiny")'.

IT: Ok nice. What else?

R-dev: Let me think, try also 'install.packages("DT")'… good? Now try 'install.packages("ggplot2")', and …

[…]

IT: Ok, now I source the 'app.R', right?

R-dev: Sure!

IT: Ok so it says 'could not find function runApp()'

R-dev: Ah, you've got to do library(shiny) at the beginning of your script. And library(purrr), and library(jsonlite)*.

* Which will lead to a Namespace conflict on the flatten() function that can cause you some debugging headache. So, hey, it would be cool if we could have a Shiny app that only imports specific functions from a package, right?

So yes, dependencies matter. You need to handle them, and handle them correctly.

Now, let's say you're building a big app. Something with thousands of lines of code. Handling a one-file or two-file shiny app with that much lines is just a nightmare. So, what to do? Let's split everything into smaller files that we can call!

And finally, we want our app to live long and prosper, which means we need to document it: each small pieces of code should have a piece of comment to explain what these specific lines do. The other thing we need for our application to be successful on the long run is tests, so that we are sure we're not introducing any regression.

Oh, and that would be nice if people can get a tar.gz and install it on their computer and have access to a local copy of the app!

OK, so let's sum up: we want to build an app. This app needs to have metadata and to handle dependencies corrrecly, which is what you get from the DESCRIPTION + NAMESPACE files of the package. Even more practical is the fact that you can do "selective namespace extraction" inside a package, i.e you can say "I want this function from this package". Also, a big app needs to be split up in smaller .R files, which is the way a package is organized. And I don't need to emphasize how documentation is a vital part of any package, so we solved this question too here. So is the testing toolkit. And of course, the "install everywhere" wish comes to life when a Shiny App is in a package.

The other plus side of Shiny as a Package

Testing

Nothing should go to production without being tested. Nothing. Testing production apps is a wide question, and I'll just stick to tests inside a Package here.

Frameworks for package testing are robust and widely documented. So you don't have to put any extra-effort here: just use a canonical testing framework like {testthat}. Learning how to use it is not the subject of this post, so feel free to refer to the documentation. See also Chapter 5 of "Building a package that lasts".

What should you test?

First of all, as we've said before, the app should be split between the UI part and the backend (or 'business logic') part. These backend functions are supposed to run without any interactive context, just as plain old functions. So for these ones, you can do classical tests. As they are backend functions (so specific to a project), {golem} can't provide any helpers for that.
For the UI part, remember that any UI function is designed to render an HTML element. So you can save a file as HTML, and then compare it to a UI object with the golem::expect_html_equal().

library(shiny)  ui <- tagList(h1("Hello world!"))  htmltools::save_html(ui, "ui.html")  golem::expect_html_equal(ui, "ui.html")  # Changes   ui <- tagList(h2("Hello world!"))  golem::expect_html_equal(ui, "ui.html")

This can for example be useful if you need to test a module. A UI module function returns an HTML tag list, so once your modules are set you can save them and use them inside tests.

my_mod_ui <- function(id){    ns <- NS("id")    tagList(      selectInput(ns("this"), "that", choices = LETTERS[1:4])    )  }  my_mod_ui_test <- tempfile(fileext = "html")  htmltools::save_html(my_mod_ui("test"), my_mod_ui_test)  # Some time later, and of course saved in the test folder,   # not as a temp file  golem::expect_html_equal(my_mod_ui("test"), my_mod_ui_test)

{golem} also provides two functions, expect_shinytag() and expect_shinytaglist(), that test if an objet is of class "shiny.tag" or "shiny.tag.list".

Testing package launch: when launching golem::use_recommended_tests(), you'll find a test built on top of {processx} that allows to check if the application is launch-able. Here's a short description of what happens:

# Standard testthat things  context("launch")    library(processx)    testthat::test_that(    "app launches",{      # We're creating a new process that runs the app      x <- process$new(        "R",         c(          "-e",           # As we are in the tests/testthat dir, we're moving           # two steps back before launching the whole package          # and we try to launch the app          "setwd('../../'); pkgload::load_all();run_app()"        )      )      # We leave some time for the app to launch      # Configure this according to your need      Sys.sleep(5)      # We check that the app is alive      expect_true(x$is_alive())      # We kill it      x$kill()    }  )

Note: this specific configuration will possibly fail on Continuous integration platform as Gitlab CI or Travis. A workaround is to, inside your CI yml, first install the package with remotes::install_local(), and then replace the setwd (...) run_app() command with myuberapp::run_app().

For example:

in .gitlab-ci.yml:

test:    stage: test    script:     - echo "Running tests"    - R -e 'remotes::install_local()'    - R -e 'devtools::check()'

in test-golem.R:

testthat::test_that(    "app launches",{      x <- process$new(         "R",         c(          "-e",           "datuberapp::run_app()"        )      )      Sys.sleep(5)      expect_true(x$is_alive())      x$kill()    }  )

Documenting

Documenting packages is a natural thing for any R developer. Any exported function should have its own documentation, hence you are "forced" to document any user facing-function.

Also, building a Shiny App as a package allows you to write standard R documentation:

A README at the root of your package
Vignettes that explain how to use your app
A {pkgdown} that can be used as an external link for your application.

Deploy

Local deployment

As your Shiny App is a standard package, it can be built as a tar.gz, sent to your colleagues, friends, and family, and even to the CRAN. It can also be installed in any R-package repository. Then, if you've built your app with {golem}, you'll just have to do:

library(myuberapp)  run_app()

to launch your app.

RStudio Connect & Shiny Server

Both these platforms expect a file app configuration, i.e an app.R file or ui.R / server.R files. So how can we integrate this "Shiny App as Package" into Connect or Shiny Server?

Using an internal package manager like RStudio Package Manager, where the package app is installed, and then you simply have to create an app.R with the small piece of code from the section just before.
Uploading the package folder to the server. In that scenario, you use the package folder as the app package, and upload the whole thing. Then, write an app.R that does:

pkgload::load_all()  shiny::shinyApp(ui = app_ui(), server = app_server)

And of course, don't forget to list this file in the .Rbuildignore!

This is the file you'll get if your run golem::add_rconnect_file().

Docker containers

In order to dockerize your app, simply install the package as any other package, and use as a CMD R -e 'options("shiny.port"=80,shiny.host="0.0.0.0");myuberapp::run_app()'. Of course change the port to the one you need.

You'll get the Dockerfile you need with golem::add_dockerfile().

Resources

The post Building a Shiny App as a Package appeared first on (en) The R Task Force.

To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Machine Learning explained for statistics folk

Posted: 30 Apr 2019 06:58 AM PDT

(This article was first published on R – Dataviz – Stats – Bayes, and kindly contributed to R-bloggers)

I'm running a one-day workshop called "From Statistics To Machine Learning" in central London on 28 October, for anyone who learnt some statistics and wants to find out about machine learning methods.

I guess you might feel frustrated. There's a lot of interest and investment in machine learning, but it's hard to know where to start learning if you have a stats background. Writing is either aimed squarely at experts in computer science, with a lot of weird jargon that looks almost familiar, but isn't quite (features? loss functions? out-of-bag?), or is written for gullible idiots with more money than sense.

This workshop is a mixture of about 25% talks, 25% demonstration with R, and 50% small-group activities where you will consider scenarios detailing realistic data analysis needs for public sector, private sector and NGOs. Your task will be to discuss possible machine learning approaches and to propose an analysis plan, detailing pros and cons and risks to the organisation from data problems and unethical consequences. You'll also have time to practice communicating the analysis plan in plain English to decision-makers. With only 18 seats, it'll be a small enough session for everyone to learn from each other and really get involved. This is not one where you sit at the back and catch up on your emails!

I've built this out of training I've done for clients who wanted machine learning and AI demystified. The difference here is that you're expected to have some statistical know-how, so we can start talking about the strengths and weaknesses of machine learning in a rigorous way that is actually useful to you in your work straight away.

Here are some topics we'll cover:

Terminology and jargon
Supervised and unsupervised learning
Ensembles, bagging and boosting
Neural networks, image data and adversarial thinking
AI and ethical concerns
Reinforcement and imitation learning
Big data's challenges, opportunities and hype
Speed and memory efficiency
Concepts of model building such as cross-validation and feature engineering
Options for software, outsourcing and software-as-a-service
Data science workplaces combining statistical expertise with machine learning: what makes them happy and healthy

You can bring a laptop to try out some of the examples in R, but this is not essential.

It's £110 all inclusive, which is about as cheap as I can make it with a central London venue. I'm mindful that a lot of people interested in this might be students or academics or public sector analysts, and I know you don't exactly have big training budgets to dip into. Having said that, you can haggle with me if you're a self-funding student or out of work or whatever.

Venue is TBC but I'll confirm soon. If I get my way it'll be one where you get breakfast (yes!), bottomless coffee and tea (yes!!) and a genuinely nice lunch (yes!!!), all included in the cost. Then, we can repair to the craft beer company afterwards and compare laptop stickers.

Book 'em quick here! I've had a lot of interest in this and it might go fast.

To leave a comment for the author, please follow the link and comment on their blog: R – Dataviz – Stats – Bayes.

Spatial data and maps conference – FOSS4G

Posted: 30 Apr 2019 03:36 AM PDT

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I'm helping organise a conference on (geo)spatial open source software – FOSS4G. We're hosting it in the great city of Edinburgh, Scotland in September 2019.

Abstract submissions: https://uk.osgeo.org/foss4guk2019/talks_workshops.html

We're very interested in hearing your tales of R, QGIS, Python, GRASS, PostGIS, etc.!

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

The Rich didn’t earn their Wealth, they just got Lucky

Posted: 30 Apr 2019 03:00 AM PDT

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

Tomorrow, on the First of May, many countries celebrate the so called International Workers' Day (or Labour Day): time to talk about the unequal distribution of wealth again!

A few months ago I posted a piece with the title "If wealth had anything to do with intelligence…" where I argued that ability, e.g. intelligence, as an input has nothing to do with wealth as an output. It drew a lot of criticism (as expected), most of it unfounded in my opinion but one piece merits some discussion: the fact that the intelligence quotient (IQ) is normally distributed by construction. The argument goes that intelligence per se may be a distribution with fat tails too but by the way the IQ is constructed the metric is being transformed into a well formed gaussian distribution. To a degree this is certainly true, yet I would still argue that the distribution of intelligence and all other human abilities are far more well behaved than the extremely unequal distribution of wealth. I wrote in a comment:

There are many aspects in your comment that are certainly true. Obviously there are huge problems in measuring "true" mental abilities, which is the exact reason why people came up with a somewhat artificial "intelligence quotient" with all its shortcomings.

What would be interesting to see is (and I don't know if you perhaps have a source about this) what the outcome of an intelligence test would look like without the "quotient" part, i.e. without subsequently normalizing the results.

I guess the relationship wouldn't be strictly linear but it wouldn't be as extreme as the wealth distribution either.

What I think is true in any case, independent of the distributions, is when you rank all people by intelligence and by wealth respectively you wouldn't really see any stable connection – and that spirit was the intention of my post in the first place and I still stand by it, although some of the technicalities are obviously debatable.

So, if you have a source, Dear Reader, you are more than welcome to share it in the comments – I am always eager to learn!

I ended my post with:

But if it is not ability/intelligence that determines the distribution of wealth what else could account for the extreme inequality we perceive in the world?

In this post I will argue that luck is a good candidate, so read on…

In 2014 there was a special issue of the renowned magazine Science titled "The science of inequality". In one of the articles (Cho, A.: "Physicists say it's simple") the following thought experiment is being proposed:

Suppose you randomly divide 500 million in income among 10,000 people. There's only one way to give everyone an equal, 50,000 share. So if you're doling out earnings randomly, equality is extremely unlikely. But there are countless ways to give a few people a lot of cash and many people a little or nothing. In fact, given all the ways you could divvy out income, most of them produce an exponential distribution of income.

So, the basic idea is to randomly throw 9,999 darts at a scale ranging from zero to 500 million and study the resulting distribution of intervals:

  library(MASS)    w <- 5e8 # wealth  p <- 1e4 # no. of people    set.seed(123)  d <- diff(c(0, sort(runif(p-1, max = w)), w)) # wealth distribution  h <- hist(d, col = "red", main = "Exponential decline", freq = FALSE, breaks = 45, xlim = c(0, quantile(d, 0.99)))    fit <- fitdistr(d, "exponential")  curve(dexp(x, rate = fit$estimate), col = "black", type = "p", pch = 16, add = TRUE)

The resulting distribution fits an exponential distribution very well. You can read some interesting discussions concerning this result on CrossValidated StackExchange: How can I analytically prove that randomly dividing an amount results in an exponential distribution (of e.g. income and wealth)?

Just to give you an idea of how unfair this distribution is: the richest six persons have more wealth than the poorest ten percent together:

  sum(sort(d)[9994:10000]) - sum(sort(d)[0:1000])  ## [1] 183670.8

If you think that this is ridiculous just look at the real global wealth distribution: here it is not six but three persons who own more than the poorest ten percent!

Now, what does that mean? Well, equality seems to be the exception and (extreme) inequality the rule. The intervals were found randomly, no interval had any special skills, just luck – and the result is (extreme) inequality – as in the real world!

If you can reproduce the wealth distribution of a society stochastically this could have the implication that it weren't so much the extraordinary skills of the rich which made them rich but they just got lucky.

Some rich people are decent enough to admit this. In his impressive essay "Why Poverty Is Like a Disease" Christian H. Cooper, a hillbilly turned investment banker writes:

So how did I get out? By chance.

It's easy to attach a post-facto narrative of talent and hard work to my story, because that's what we're fed by everything from Hollywood to political stump speeches. But it's the wrong story. My escape was made up of a series of incredibly unlikely events, none of which I had real control over.

[…]

I am the exception that proves the rule—but that rule is that escape from poverty is a matter of chance, and not a matter of merit.

A consequence would be that you cannot really learn much from the rich. So throw away all of your self help books on how to become successful. I will end with a cartoon, which brings home this message, on a closely related concept, the so called survivorship bias (which is also important to keep in mind when backtesting trading strategies in quantitative finance, the topic of an upcoming post… so stay tuned!):

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

This posting includes an audio/video/photo media file: Download Now

Google Next 2019 – observations from the Mango viewing deck

Posted: 30 Apr 2019 02:01 AM PDT

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

The first thing that hit me about this year's Google Next was the scale. Thousands of people wandering around Moscone Centre in San Fran with their name badges proudly displayed really brought home that this is a platform of which its users are proud. I was told by several people that the size of the conference has doubled in size year on year, which, given there were 35,000 people at this year's event, may well prove a logistical nightmare next year.

I was really struck by the massive enthusiasm of the Google team, the way in which the company's leadership is aware of Google's market position and how they seem to be keen to take pragmatic and strategic decisions based on where they are, versus where they might like to be. The opening up to other cloud platforms via the Anthos announcement seems a good way for Google to position itself as the honest broker in the field – they have identified legacy apps and codebases as difficult to turnover, something which I think many organisations will feel comfortable.

There were rafts of customer testimonials and whilst many of them did not seem to contain much in the way of ROI details, the mere fact that Google could call these C-level Fortune 500 companies onto the stage speaks towards a clarity of intent and purpose.

The nature of many of these types of events is one that is fairly broad, and considering that Mango is a relatively niche player, it is sometimes difficult to find the areas and talks that may resonate with our posture. That was true of these sessions., but not entirely surprising. The widescale abuse of terms like AI and Machine Learning carries on apace, and we at Mango need to find ways to gently persuade people that when they think of AI, they're actually meaning Machine Learning, and when they talk about Machine Learning they might well be talking about, well, models. The current push in the industry is to be able to add these complex components at a click of a button, by an untrained analyst who can then roll it into production without proper validation or testing. These are dangerous situations and reminded me of the importance of doing some of the hard yards first i.e creating an analytic culture and community to ensure that the "right" analytics are used with the "right" data. It's clear however that the opening up of cloud platforms is creating an arena in which advanced analytics will play a crucial role, and presents massive opportunities for Mango in working with Google and our customers.

We've loved being back in San Francisco and its always lovely to be around passionate and energetic advocates. Hopefully London Next later in the year will be equally energetic and fun.

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

This posting includes an audio/video/photo media file: Download Now

Zooming in on maps with sf and ggplot2

Posted: 30 Apr 2019 01:58 AM PDT

(This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers)

When working with geo-spatial data in R, I usually use the sf package for manipulating spatial data as Simple Features objects and ggplot2 with geom_sf for visualizing these data. One thing that comes up regularly is "zooming in" on a certain region of interest, i.e. displaying a certain map detail. There are several ways to do so. Three common options are:

selecting only certain areas of interest from the spatial dataset (e.g. only certain countries / continent(s) / etc.)
cropping the geometries in the spatial dataset using sf_crop()
restricting the display window via coord_sf()

I will show the advantages and disadvantages of these options and especially focus on how to zoom in on a certain point of interest at a specific "zoom level". We will see how to calculate the coordinates of the display window or "bounding box" around this zoom point.

Loading the required libraries and data

We'll need to load the following packages:

library(ggplot2)  library(sf)  library(rnaturalearth)

Using ne_countries() from the package worldmap, we can get a spatial dataset with the borders of all countries in the world. We load this data as a "simple feature collection" by specifying returnclass = 'sf'. The result is a spatial dataset that can be processed with tools from the sf package.

worldmap <- ne_countries(scale = 'medium', type = 'map_units',                           returnclass = 'sf')  # have a look at these two columns only  head(worldmap[c('name', 'continent')])

Simple feature collection with 6 features and 2 fields  geometry type: MULTIPOLYGON  dimension: XY  bbox: xmin: -180 ymin: -18.28799 xmax: 180 ymax: 83.23324  epsg (SRID): 4326  proj4string: +proj=longlat +datum=WGS84 +no_defs  name continent geometry  0 Fiji Oceania MULTIPOLYGON (((180 -16.067...  1 Tanzania Africa MULTIPOLYGON (((33.90371 -0...  2 W. Sahara Africa MULTIPOLYGON (((-8.66559 27...  3 Canada North America MULTIPOLYGON (((-122.84 49,...  4 United States of America North America MULTIPOLYGON (((-122.84 49,...  5 Kazakhstan Asia MULTIPOLYGON (((87.35997 49...

We see that we have a spatial dataset consisting of several attributes (only name and continent are shown above) for each spatial feature which is stored in the geometry column. The header above the table displays further information such as the coordinate reference system (CRS) which will be important later. For now, it is important to know that the coordinates in the plot are longitude / latitude coordinates in degrees in a CRS called WGS84. The longitude reaches from -180° (180° west of the meridian) to +180° (180° east of the meridian). The latitude reaches from -90° (90° south of the equator) to +90° (90° north of the equator).

Let's plot this data to confirm what we have:

ggplot() + geom_sf(data = worldmap) + theme_bw()

Selecting certain areas of interest from a spatial dataset

This is quite straight forward: We can plot a subset of a larger spatial dataset and then ggplot2 automatically sets the display window accordingly, so that it only includes the selected geometries. Let's select and plot France only:

france <- worldmap[worldmap$name == 'France',]  ggplot() + geom_sf(data = france) + theme_bw()

Let's try to select and plot all countries from Europe:

europe <- worldmap[worldmap$continent == 'Europe',]  ggplot() + geom_sf(data = europe) + theme_bw()

Well, this is probably not the section that you want to display when you want to make a plot of Europe. It includes whole Russia (since its continent attribute is set to "Europe") which makes the plot span the whole northern globe. At the same time, nearby countries such as Turkey or countries from northern Africa are missing, which you might want to include to show "context" around your region of interest.

Cropping geometries in the spatial dataset or restricting the display window

To fix this, you may crop the geometries in your spatial dataset as if you took a knife a sliced through the shapes on the map to take out a rectangular region of interest. The alternative would be to restrict the display window of your plot, which is like taking the map and folding it so that you only see the region of interest. The rest of the map still exists, you just don't see it.

So both approaches are quite similar, only that the first actually changes the data by cutting the geometries, whereas the latter only "zooms" to a detail but keeps the underlying data untouched.

Let's try out the cropping method first, which can be done with sf_crop(). It works by passing the spatial dataset (the whole worldmap dataset in our case) and specifying a rectangular region which we'd like to cut. Our coordinates are specified as longitude / latitude in degrees, as mentioned initially. We set the minimum longitude (i.e. the left border of the crop area) as xmin and the maximum longitude (i.e. right border) as xmax. The minimum / maximum latitude (i.e. lower and upper border) are set analogously as ymin and ymax. For a cropping area of Europe, we can set the longitude range to be in [-20, 45] and latitude range to be in [30, 73]:

europe_cropped <- st_crop(worldmap, xmin = -20, xmax = 45,                                      ymin = 30, ymax = 73)  ggplot() + geom_sf(data = europe_cropped) + theme_bw()

You can see that the shapes were actually cropped at the specified ranges when you look at the gaps near the plot margins. If you want to remove these gaps, you can add + coord_sf(expand = FALSE) to the plot.

We can also see that we now have clipped shapes when we project them to a different CRS. We can see that in the default projection that is used by ggplot2, the grid of longitudes and latitudes, which is called graticule, is made of perpendicular lines (displayed as light gray lines). When we use a Mollweide projection, which is a "pseudo-cylindrical" projection, we see how the clipped shapes are warped along the meridians, which do not run in parallel in this projection but converge to the polar ends of the prime meridian:

Restricting the display window for plotting achieves a very similar result, only that we don't need to modify the spatial dataset. We can use coord_sf() to specify a range of longitudes (xlim vector with minimum and maximum) and latitudes (ylim vector) to display, similar to as we did it for st_crop():

ggplot() + geom_sf(data = worldmap) +      coord_sf(xlim = c(-20, 45), ylim = c(30, 73), expand = FALSE) +      theme_bw()

You see that the result is very similar, but we use the whole worldmap dataset without cropping it. Notice that I used expand = FALSE which removes the padding around the limits of the display window (this is what produced the "gaps" at the plot margins before).

Now what if we wanted to use the Mollweide projection again as target projection for this section? Let's try:

ggplot() + geom_sf(data = worldmap) +      coord_sf(crs = st_crs('+proj=moll'),               xlim = c(-20, 45), ylim = c(30, 73),               expand = FALSE) +      theme_bw()

What happened here? The problem is that we specified xlim and ylim as WGS84 coordinates (longitude / latitude in degrees) but coord_sf needs these coordinates to be in the same CRS as the target projection, which is the Mollweide projection in our case. Mollweide uses meters as unit, so we actually specified a range of -20m to 45m on the x-scale and 30m to 73m on the y-scale, which is a really tiny patch near the center of projection.

So in order to fix that, we'd need to transform the coordinates of the display window to the target CRS first. First, let's get a visual overview on what we want to do:

What we see is the whole worldmap dataset projected to our target CRS (standard Mollweide projection). There are two points, A = (-20°, 30°) and B = (45°, 73°), which define our display window, highlighted with red lines. The display window is a trapezoid because the originally rectangular display window defined by longitude / latitude coordinates is warped along the meridians, just as our cropped shape of Europe before. When you have a look at the axes, you'll see that there are no degrees displayed here anymore. Instead, the graticule shows the target CRS, which has the unit meters (this is why the numbers on the axes are so large). Our corner points of the display window A and B must be converted to this CRS, too, and we can do that as follows:

At first, we set the target CRS and transform the whole worldmap dataset to that CRS.

target_crs <- '+proj=moll'  worldmap_trans <- st_transform(worldmap, crs = target_crs)

Next, we specify the display window in WGS84 coordinates as longitude / latitude degrees. We only specify the bottom left and top right corners (points A and B). The CRS is set to WGS84 by using the EPSG code 4236.

disp_win_wgs84 <- st_sfc(st_point(c(-20, 30)), st_point(c(45, 73)),                           crs = 4326)  disp_win_wgs84

Geometry set for 2 features  geometry type:  POINT  dimension:      XY  bbox:           xmin: -20 ymin: 30 xmax: 45 ymax: 73  epsg (SRID):    4326  proj4string:    +proj=longlat +datum=WGS84 +no_defs  POINT (-20 30)  POINT (45 73)

These coordinates can be transformed to our target CRS via st_transform.

disp_win_trans <- st_transform(disp_win_wgs84, crs = target_crs)  disp_win_trans

Geometry set for 2 features  geometry type:  POINT  dimension:      XY  bbox:           xmin: -1833617 ymin: 3643854 xmax: 2065900 ymax: 8018072  epsg (SRID):    NA  proj4string:    +proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs  POINT (-1833617 3643854)  POINT (2065900 8018072)

We can extract the coordinates from these points and pass them as limits for the x and y scale respectively. Note also that I set datum = target_crs which makes sure that the graticule is displayed in the target CRS. Otherwise ggplot2 by default displays a WGS84 graticule.

disp_win_coord <- st_coordinates(disp_win_trans)    ggplot() + geom_sf(data = worldmap_trans) +      coord_sf(xlim = disp_win_coord[,'X'], ylim = disp_win_coord[,'Y'],               datum = target_crs, expand = FALSE) +      theme_bw()

This works, but if you look closely, you'll notice that the display window is slightly different than the one specified in WGS84 coordinates before (the red trapezoid). For example, Turkey is completely missing. This is because the coordinates describe the rectangle between A and B in Mollweide projection, and not the red trapezoid that comes from the original projection.

Calculating the display window for given a center point and zoom level

Instead of specifying a rectangular region of interest, I found it more comfortable to set a center point of interest and specify a "zoom level" which specifies how much area around this spot should be displayed.

I oriented myself on how zoom levels are defined for OpenStreetMap: A level of zero shows the whole world. A level of 1 shows a quarter of the world map. A level of two shows 1/16th of the world map and so on. This goes up to a zoom level of 19. So if z is the zoom level, we see a region that covers 1/4^z of the world. This means the range on the longitude axis that is displayed is 360°/2^z (because the longitude spans the full horizontal circle around the world from -180° to +180°) and the range on the latitude axis is 180°/2^z (latitude spans the half vertical circle from -90° to +90°).

Let's set a center point zoom_to, a zoom_level and calculate the ranges for longitude and latitude (lon_span and lat_span).

zoom_to <- c(13.38, 52.52)  # Berlin    zoom_level <- 3    lon_span <- 360 / 2^zoom_level  lat_span <- 180 / 2^zoom_level

Now we can calculate the longitude and latitude ranges for the the display window by subtracting/adding the half of the above ranges from/to the zoom center coordinates respectively.

lon_bounds <- c(zoom_to[1] - lon_span / 2, zoom_to[1] + lon_span / 2)  lat_bounds <- c(zoom_to[2] - lat_span / 2, zoom_to[2] + lat_span / 2)

We can now plot the map with the specified display window for zoom level 3, highlighting the zoom center as red dot:

ggplot() + geom_sf(data = worldmap) +      geom_sf(data = st_sfc(st_point(zoom_to), crs = 4326),              color = 'red') +      coord_sf(xlim = lon_bounds, ylim = lat_bounds) + theme_bw()

So this works quite good when using the default projection and WGS84 coordinates. How about another projection? Let's put Kamchatka at the center of our map, set a zoom level, and use a Lambert azimuthal equal-area projection (LAEA) centered at our zoom point:

zoom_to <- c(159.21, 56.41)  # ~ center of Kamchatka  zoom_level <- 2.5    # Lambert azimuthal equal-area projection around center of interest  target_crs <- sprintf('+proj=laea +lon_0=%f +lat_0=%f',                        zoom_to[1], zoom_to[2])

When we worked with WGS84 coordinates before, we calculated the displayed range of longitude and latitude at a given zoom level z in degrees by dividing the full circle for the longitude (360°) by 2^z and the half circle (180°) for the latitude in the same way. A LAEA projection uses meters as unit instead of degrees, so instead of dividing a circle of 360°, we divide the full circumference of the Earth in meters, which is about C = 40,075,016m, by 2^z to get the display range for the x-axis (corresponding to longitude range before). Similar as before, we use the half circumference (assuming that the Earth is a perfect sphere, which she is not) for the displayed range on the y-axis, so we get C / 2^(z+1).

C <- 40075016.686   # ~ circumference of Earth in meters  x_span <- C / 2^zoom_level  y_span <- C / 2^(zoom_level+1)

We can now calculate the lower left and upper right corner of our display window as before, by subtracting/adding half of x_span and y_span from/to the zoom center coordinate as before. But of course, the zoom center has to be converted to the target CRS, too. We put the zoom point at the center of the projection, so in this special case we already know that it is at (0m, 0m) in the target CRS. Anyway I will provide a, because when you use a standardized CRS with EPSG code, your zoom point probably won't be the center of projection:

zoom_to_xy <- st_transform(st_sfc(st_point(zoom_to), crs = 4326),                             crs = target_crs)  zoom_to_xy

Geometry set for 1 feature  geometry type:  POINT  dimension:      XY  bbox:           xmin: 1.570701e-09 ymin: 7.070809e-10 xmax: 1.570701e-09 ymax: 7.070809e-10  epsg (SRID):    NA  proj4string:    +proj=laea +lat_0=56.41 +lon_0=159.21 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs  POINT (1.570701e-09 7.070809e-10)

Rounding errors aside, this confirms that the zoom point is at the center of projection. We can now calculate the display window coordinates as stated before¹:

disp_window <- st_sfc(      st_point(st_coordinates(zoom_to_xy - c(x_span / 2, y_span / 2))),      st_point(st_coordinates(zoom_to_xy + c(x_span / 2, y_span / 2))),      crs = target_crs  )

And now passing the display window coordinates to coord_sf, we can plot the result:

 ggplot() + geom_sf(data = worldmap) +      geom_sf(data = zoom_to_xy, color = 'red') +      coord_sf(xlim = st_coordinates(disp_window)[,'X'],               ylim = st_coordinates(disp_window)[,'Y'],               crs = target_crs, datum = target_crs) +      theme_bw()

You can try this out with different zoom levels, zoom center points and projections. Note that some combinations of zoom levels and projections will cause projection errors, since some projection are only specified within certain bounds and not for the whole world. I put the full R script in this gist.

¹ Again, we could simplify this in the special case where the zoom point is the center of projection. Also note that I had to wrap each coordinate in st_point(st_coordinates(...)), which seems redundant but was the only way to get it working. It looks like the sf objects lose the CRS information when applying vector operations to them.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog.

This posting includes an audio/video/photo media file: Download Now

RcppArmadillo 0.9.400.2.0

Posted: 29 Apr 2019 06:59 PM PDT

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

armadillo image

A new RcppArmadillo release based on the very recent Armadillo upstream release arrived on CRAN earlier today, and will get to Debian shortly.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 587 other packages on CRAN.

The (upstream-only again this time) changes are listed below:

Upgraded to Armadillo release 9.400.2 (Surrogate Miscreant)

faster cov() and cor()

added .as_col() and .as_row()

expanded .shed_rows() / .shed_cols() / .shed_slices() to remove rows/columns/slices specified in a vector

expanded vectorise() to handle sparse matrices

expanded element-wise versions of max() and min() to handle sparse matrices

optimised handling of sparse matrix expressions: sparse % (sparse +- scalar) and sparse / (sparse +- scalar)

expanded eig_sym(), chol(), expmat_sym(), logmat_sympd(), sqrtmat_sympd(), inv_sympd() to print a warning if the given matrix is not symmetric

more consistent detection of vector expressions

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

This posting includes an audio/video/photo media file: Download Now

Could not Resist

Posted: 29 Apr 2019 05:21 PM PDT

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Also, Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019 is now content complete! It is deep into editing and soon into production!

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

This posting includes an audio/video/photo media file: Download Now

RStudio 1.2 Released

Posted: 29 Apr 2019 05:00 PM PDT

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Screenshot RStudio 1.2

We're excited to announce the official release of RStudio 1.2!

What's new in RStudio 1.2?

Over a year in the making, this new release of RStudio includes dozens of new productivity enhancements and capabilities. You'll now find RStudio a more comfortable workbench for working in SQL, Stan, Python, and D3. Testing your R code is easier, too, with integrations for shinytest and testthat. Create, and test, and publish APIs in R with Plumber. And get more done with background jobs, which let you run R scripts while you work.

Underpinning it all is a new rendering engine based on modern Web standards, so RStudio Desktop looks sharp on displays large and small, and performs better everywhere – especially if you're using the latest Web technology in your visualizations, Shiny applications, and R Markdown documents. Don't like how it looks now? No problem–just make your own theme.

You can read more about what's new this release in the release notes, or our RStudio 1.2 blog series.

Thank you!

We'd like to thank the open source community for helping us to make this release possible. Many of you used the preview releases for your day to day work and gave us invaluable bug reports, ideas, and feedback. We're grateful for your support–thank you for helping us to continue making RStudio the best workbench for data science!

All products based on RStudio have been updated. You can download the new release, RStudio 1.2.1335, here:

Download RStudio 1.2

Feedback on the new release is always welcome in the community forum. Update and let us know what you think!

Coming Soon: RStudio Pro 1.2 General Availability announcement

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

This posting includes an audio/video/photo media file: Download Now

ViennaR Meetup March – Full Talks Online

Posted: 29 Apr 2019 05:00 PM PDT

(This article was first published on Quantargo Blog, and kindly contributed to R-bloggers)

The full talks of the ViennaR March Meetup are finally online: A short Introduction to ViennaR, Laura Vana introducing R-Ladies Vienna and Hadley Wickham with a great introduction to tidy(er) data and the new functions pivot_wider() and pivot_longer(). Stay tuned for the next ViennaR Meetups!

Introduction

You can download the slides of the introduction here.

Ronald Hochreiter, Mario Annau, Walter Djuric

Laura Vana: R-Ladies

Laura introduced the R-Ladies Vienna – a program initiated by the R-Consortium to

achieve proportionate representation by encouraging, inspiring, and empowering the minorities currently underrepresented in the R community. R-Ladies' primary focus, therefore, is on supporting the R enthusiasts who identify as an underrepresented minority to achieve their programming potential, by building a collaborative global network of R leaders, mentors, learners, and developers to facilitate individual and collective progress worldwide.

from https://rladies.org.

Visit the Meetup site to find out more about upcoming R-Ladies events including workshops covering an R Introduction and Bayesian Statistics.

Laura Vana

Hadley Wickham: Tidy(er) Data

Hadley Wickham's talk covered the tidyr package and two new functions: pivot_wide() and pivot_long() which have finally been renamed to pivot_wider() and pivot_longer() as a result of a survey, see also the posts on Twitter and Github. These function should replace gather() and spread() since they seem to be hard-to-remember for most users coming to tidyr.

Hadley Wickham

See you at one of our next Meetup/Course/Get-Together!

Cheers,

Your Quantargo Team

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

This posting includes an audio/video/photo media file: Download Now

Relaunching the qualtRics package

Posted: 29 Apr 2019 05:00 PM PDT

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

Note: cross-posted with the rOpenSci blog.

rOpenSci is one of the first organizations in the R community I ever interacted with, when I participated in the 2016 rOpenSci unconf. I have since reviewed several rOpenSci packages and been so happy to be connected to this community, but I have never submitted or maintained a package myself. All that changed when I heard the call for a new maintainer for the qualtRics package. "IT'S GO TIME," I thought.

Qualtrics is an online survey and data collection software platform. Qualtrics is used across many domains in both academia and industry for online surveys and research, including by me at my day job as a data scientist at Stack Overflow. While users can manually download survey responses from Qualtrics through a browser, importing this data into R is then cumbersome. The qualtRics R package implements the retrieval of survey data using the Qualtrics API and aims to reduce the pre-processing steps needed in analyzing such surveys. This package has been a huge help to me in my real day-to-day work, and I have been so grateful for the excellent work of the package's original authors, including the previous maintainer Jasper Ginn. This package is currently the only package on CRAN that offers functionality like this for Qualtrics' API, and is included in the official Qualtrics API documentation.

The previous maintainer of this package was no longer using Qualtrics in his work life and needed to step away from maintaining qualtRics. Also, early in 2018, the package had been archived on CRAN for a fairly minor issue: using testthat for tests but not having it included in Suggests. So for me, that meant this project would involve several steps: making improvements I wanted to the package for usability (especially with regards to authentication credentials), bringing the package in line with CRAN policies, and getting the package back on CRAN but as a new maintainer of an archived package.

Qualtrics and the Stack Overflow Developer Survey

The timing of when I did the usability work for this package was either wonderful or terrible, depending on your perspective. I did that work concurrently with the analysis for the 2019 Stack Overflow Developer Survey. This survey is the largest and most comprehensive survey of people who code around the world each year, and this year we had almost 90,000 respondents who answered questions covering everything from their favorite technologies to what music they listen to while coding.

The survey is fielded using Qualtrics and I've used the qualtRics package for the past several years as part of my workflow. This year, I made changes to the package while I was working with the Developer Survey data. I'll be honest; it was… A LOT to juggle at one time. However, it did allow me to test out improvements and changes iteratively with the real world data I had at hand, and I think I ended up with a better result than I would have otherwise. We use Qualtrics for other surveys at Stack Overflow throughout the year, so I am not just a package maintainer but also a year-round, real-life user of this package.

After the bulk of my analysis work was done for the survey, I turned to getting qualtRics back on CRAN. The previous maintainer, who has been unfailingly helpful throughout this process, sent an email to CRAN with the details involved (his name, my name, the fact of the package's archiving). Then I submitted the package to CRAN, basically as if it were a new package. I got some feedback back from CRAN initially, to use more single quotes in Description around the word "Qualtrics" and also some remarks about how my examples were surrounded by \dontrun{} as well as a reminder not to have my package write to the user's home filespace. I was somewhat puzzled by the last items, because nothing in this package does write to the user's home filespace and the examples all are calling the API and thus need authentication (which is clearly explained in the documentation/examples). I was somewhat worried but I ended up responding via email and submitting again with the same text in CRAN comments, explaining those points. Fortunately, the package was accepted to CRAN that time!

GOOD IDEA/BAD IDEA

Changing the names of all the functions in a package already being used in the real world is, generally, a bad idea. I did it anyway, everybody. Let me try to convince you that this was the right call.

When I took over as package maintainer, the qualtRics package had been off CRAN for about nine months. If ever there is an appropriate time for a revamp in user-facing function names, it's a time like this, when there is a long absence from a centralized repository.
The most important user-facing functions in this package previously had names like getSurvey() and getSurveys(), which users (including me) found difficult to distinguish between. These functions called the Qualtrics API, but not the part of the API that Qualtrics has named getSurvey. It was a confusing situation.
When this package was first reviewed by rOpenSci, the original maintainer noted that the function names did not conform to rOpenSci standards but he preferred to keep them as is since the package was already on CRAN. Opening up the possibility of this kind of more drastic change now allowed me to bring the function and argument names within rOpenSci standards.

The old versions of the functions currently still work but give a warning, suggesting that the user try out the new versions. If this change causes more than a passing inconvenience to you as a user, I AM SO SORRY. I believe this decision will set up the qualtRics package for improved usability in the long term.

Explore the new qualtRics!

Another change with this release is a pkgdown site for qualtRics. I used the custom pkgdown template rotemplate created for rOpenSci packages to style the site, and I am so pleased with the results! If you are a new user for the qualtRics package, or someone who wants to poke around and see what's changed and how to get up to speed with the update, I recommend checking out this spiffy new site for all the details.

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

This posting includes an audio/video/photo media file: Download Now

Relaunching the qualtRics package

Posted: 29 Apr 2019 05:00 PM PDT

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Qualtrics and the Stack Overflow Developer Survey

GOOD IDEA/BAD IDEA

When I took over as package maintainer, the qualtRics package had been off CRAN for about nine months. If ever there is an appropriate time for a revamp in user-facing function names, it's a time like this, when there is a long absence from a centralized repository.
The most important user-facing functions in this package previously had names like getSurvey() and getSurveys(), which users (including me) found difficult to distinguish between. These functions called the Qualtrics API, but not the part of the API that Qualtrics has named getSurvey. It was a confusing situation.
When this package was first reviewed by rOpenSci, the original maintainer noted that the function names did not conform to rOpenSci standards but he preferred to keep them as is since the package was already on CRAN. Opening up the possibility of this kind of more drastic change now allowed me to bring the function and argument names within rOpenSci standards.

Explore the new qualtRics!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

This posting includes an audio/video/photo media file: Download Now

What matters more in a cluster randomized trial: number or size?

Posted: 29 Apr 2019 05:00 PM PDT

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I am involved with a trial of an intervention designed to prevent full-blown opioid use disorder for patients who may have an incipient opioid use problem. Given the nature of the intervention, it was clear the only feasible way to conduct this particular study is to randomize at the physician rather than the patient level.

There was a concern that the number of patients eligible for the study might be limited, so that each physician might only have a handful of patients able to participate, if that many. A question arose as to whether we can make up for this limitation by increasing the number of physicians who participate? That is, what is the trade-off between number of clusters and cluster size?

This is a classic issue that confronts any cluster randomized trial – made more challenging by the potentially very small cluster sizes. A primary concern of the investigators is having sufficient power to estimate an intervention effect – how would this trade-off impact that? And as a statistician, I have concerns about bias and variance, which could have important implications depending on what you are interested in measuring.

Clustering in a nutshell

This is an immense topic – I won't attempt to point you to the best resources, because there are so many out there. For me, there are two salient features of cluster randomized trials that present key challenges.

First, individuals in a cluster are not providing as much information as we might imagine. If we take an extreme example of a case where the outcome of everyone in a cluster is identical, we learn absolutely nothing by taking an additional subject from that cluster; in fact, all we need is one subject per cluster, because all the variation is across clusters, not within. Of course, that is overly dramatic, but the same principal is in play even when the outcomes of subjects in a cluster are moderately correlated. The impact of this phenomenon depends on the within cluster correlation relative to the between cluster correlation. The relationship of these two correlations is traditionally characterized by the intra-class coefficient (ICC), which is the ratio of the between-cluster variation to total variation.

Second, if there is high variability across clusters, that gets propagated to the variance of the estimate of the treatment effect. From study to study (which is what we are conceiving of in a frequentist frame of mind), we are not just sampling individuals from the clusters, but we are changing the sample of clusters that we are selecting from! So much variation going on. Of course, if all clusters are exactly the same (i.e. no variation between clusters), then it doesn't really matter what clusters we are choosing from each time around, and we have no added variability as a result of sampling from different clusters. But, as we relax this assumption of no between-cluster variability, we add over-all variability to the process, which gets translated to our parameter estimates.

The cluster size/cluster number trade-off is driven largely by these two issues.

Simulation

I am generating data from a cluster randomized trial that has the following underlying data generating process:

\[ Y_{ij} = 0.35 * R_j + c_j + \epsilon_{ij}\ ,\]
where \(Y_{ij}\) is the outcome for patient \(i\) who is being treated by physician \(j\). \(R_j\) represents the treatment indicator for physician \(j\) (0 for control, 1 for treatment). \(c_j\) is the physician-level random effect that is normally distributed \(N(0, \sigma^2_c)\). \(\epsilon_{ij}\) is the individual-level effect, and \(\epsilon_{ij} \sim N(0, \sigma^2_\epsilon)\). The expected value of \(Y_{ij}\) for patients treated by physicians in the control group is \(0\). And for the patients treated by physicians in the intervention \(E(Y_{ij}) = 0.35\).

Defining the simulation

The entire premise of this post is that we have a target number of study subjects (which in the real world example was set at 480), and the question is should we spread those subjects across a smaller or larger number of clusters? In all the simulations that follow, then, we have fixed the total number of subjects at 480. That means if we have 240 clusters, there will be only 2 in each one; and if we have 10 clusters, there will be 48 patients per cluster.

In the first example shown here, we are assuming an ICC = 0.10 and 60 clusters of 8 subjects each:

library(simstudy)  Var <- iccRE(0.10, varWithin = 0.90, dist = "normal")    defC <- defData(varname = "ceffect", formula = 0, variance = Var,                    dist = "normal", id = "cid")  defC <- defData(defC, "nperc", formula = "8",                     dist = "nonrandom" )      defI <- defDataAdd(varname = "y", formula = "ceffect + 0.35 * rx",                      variance = 0.90)

Generating a single data set and estimating parameters

Based on the data definitions, I can now generate a single data set:

set.seed(711216)      dc <- genData(60, defC)  dc <- trtAssign(dc, 2, grpName = "rx")        dd <- genCluster(dc, "cid", numIndsVar = "nperc", level1ID = "id" )  dd <- addColumns(defI, dd)    dd

##      cid rx  ceffect nperc  id     y  ##   1:   1  0  0.71732     8   1  2.94  ##   2:   1  0  0.71732     8   2  1.23  ##   3:   1  0  0.71732     8   3  1.33  ##   4:   1  0  0.71732     8   4  1.11  ##   5:   1  0  0.71732     8   5  1.15  ##  ---                                  ## 476:  60  1 -0.00034     8 476 -0.52  ## 477:  60  1 -0.00034     8 477 -1.10  ## 478:  60  1 -0.00034     8 478  1.83  ## 479:  60  1 -0.00034     8 479 -0.27  ## 480:  60  1 -0.00034     8 480 -0.25

We use a linear mixed effect model to estimate the treatment effect and variation across clusters:

library(lmerTest)  lmerfit <- lmer(y~rx + (1 | cid), data = dd)

Here are the estimates of the random and fixed effects:

as.data.table(VarCorr(lmerfit))

##         grp        var1 var2  vcov sdcor  ## 1:      cid (Intercept)  0.098  0.31  ## 2: Residual          0.797  0.89

coef(summary(lmerfit))

##             Estimate Std. Error df t value Pr(>|t|)  ## (Intercept)    0.013      0.081 58    0.16   0.8720  ## rx             0.333      0.115 58    2.90   0.0053

And here is the estimated ICC, which happens to be close to the "true" ICC of 0.10 (which is definitely not a sure thing given the relatively small sample size):

library(sjstats)  icc(lmerfit)

##   ## Intraclass Correlation Coefficient for Linear mixed model  ##   ## Family : gaussian (identity)  ## Formula: y ~ rx + (1 | cid)  ##   ##   ICC (cid): 0.1098

A deeper look at the variation of estimates

In these simulations, we are primarily interested in investigating the effect of different numbers of clusters and different cluster sizes on power, variation, bias (and mean square error, which is a combined measure of variance and bias). This means replicating many data sets and studying the distribution of the estimates.

To do this, it is helpful to create a functions that generates the data:

reps <- function(nclust) {          dc <- genData(nclust, defC)      dc <- trtAssign(dc, 2, grpName = "rx")            dd <- genCluster(dc, "cid", numIndsVar = "nperc", level1ID = "id" )      dd <- addColumns(defI, dd)            lmerTest::lmer(y ~ rx + (1 | cid), data = dd)  }

And here is a function to check if p-values from model estimates are less than 0.05, which will come in handy later when estimating power:

pval <- function(x) {      coef(summary(x))["rx", "Pr(>|t|)"] < 0.05  }

Now we can generate 1000 data sets and fit a linear fixed effects model to each one, and store the results in an R list:

library(parallel)    res <- mclapply(1:1000, function(x) reps(60))

Extracting information from all 1000 model fits provides an estimate of power:

mean(sapply(res, function(x) pval(x)))

## [1] 0.82

And here are estimates of bias, variance, and root mean square error of the treatment effect estimates. We can see in this case, the estimated treatment effect is not particularly biased:

RX <- sapply(res, function(x) getME(x, "fixef")["rx"])    c(true = 0.35, avg = mean(RX), var = var(RX),     bias = mean(RX - 0.35), rmse = sqrt(mean((RX - 0.35)^2)))

##   true    avg    var   bias   rmse   ## 0.3500 0.3516 0.0149 0.0016 0.1219

And if we are interested in seeing how well we measure the between cluster variation, we can evaluate that as well. The true variance (used to generate the data), was 0.10, and the average of the estimates was 0.099, quite close:

RE <- sapply(res, function(x) as.numeric(VarCorr(x)))    c(true = Var, avg = mean(RE), var = var(RE),     bias = mean(RE - Var), rmse = sqrt(mean((RE - Var)^2)))

##   true    avg    var   bias   rmse   ## 0.1000 0.1005 0.0017 0.0005 0.0412

Replications under different scenarios

Now we are ready to put all of this together for one final experiment to investigate the effects of the ICC and cluster number/size on power, variance, and bias. I generated 2000 data sets for each combination of assumptions about cluster sizes (ranging from 10 to 240) and ICC's (ranging from 0 to 0.15). For each combination, I estimated the variance and bias for the treatment effect parameter estimates and the between-cluster variance. (I include the code in case any one needs to do something similar.)

ps <- list()  pn <- 0    nclust <- c(10, 20, 30, 40, 48, 60, 80, 96, 120, 160, 240)  iccs <- c(0, 0.02, 0.05 , 0.10, 0.15)    for (s in seq_along(nclust)) {    for (i in seq_along(iccs)) {            newvar <- iccRE(iccs[i], varWithin = .90, dist = "normal")      newperc <- 480/nclust[s]            defC <- updateDef(defC, "ceffect", newvariance = newvar)      defC <- updateDef(defC, "nperc", newformula = newperc)            res <- mclapply(1:2000, function(x) reps(nclust[s]))      RX <- sapply(res, function(x) getME(x, "fixef")["rx"])      RE <- sapply(res, function(x) as.numeric(VarCorr(x)))      power <- mean(sapply(res, function(x) pval(x))            pn <- pn + 1      ps[[pn]] <- data.table(nclust = nclust[s],                              newperc,                              icc=iccs[i],                              newvar,                             power,                             biasRX = mean(RX - 0.35),                             varRX = var(RX),                             rmseRX = sqrt(mean((RX - 0.35)^2)),                             avgRE = mean(RE),                             biasRE = mean(RE - newvar),                             varRE = var(RE),                             rmseRE = sqrt(mean((RE - newvar)^2))      )    }  }      ps <- rbindlist(ps)

First, we can take a look at the power. Clearly, for lower ICC's, there is little marginal gain after a threshold between 60 and 80 clusters; with the higher ICC's, a study might benefit with respect to power from adding more clusters (and reducing cluster size):

library(ggthemes)  library(scales)    ggplot(data = ps, aes(x = nclust, y = power, group = icc)) +    geom_smooth(aes(color = factor(icc)), se = FALSE) +    theme(panel.grid.minor = element_blank()) +    scale_color_ptol(name = "ICC", labels = number(iccs, accuracy = .01)) +    scale_x_continuous(name = "number of clusters", breaks = nclust)

Not surprisingly, the same picture emerges (only in reverse) when looking at the variance of the estimate for treatment effect. Variance declines quite dramatically as we increase the number of clusters (again, reducing cluster size) up to about 60 or so, and little gain in precision beyond that:

If we are interested in measuring the variation across clusters (which was \(\sigma^2_c\) in the model), then a very different picture emerges. First, the plot of RMSE (which is \(E[(\hat{\theta} – \theta)^2]^{\frac{1}{2}}\), where \(\theta = \sigma^2_c\)), indicates that after some point, actually increasing the number of clusters after a certain point may be a bad idea.

The trends of RMSE are mirrored by the variance of \(\hat{\sigma^2_c}\):

I show the bias of the variance estimate, because it highlights the point that it is very difficult to get an unbiased estimate of \(\sigma^2_c\) when the ICC is low, particularly with a large number of clusters with small cluster sizes. This may not be so surprising, because with small cluster sizes it may be more difficult to estimate the within-cluster variance, an important piece of the total variation.

Almost an addendum

I've focused entirely on the direct trade-off between the number of clusters and cluster size, because that was the question raised by the study that motivated this post. However, we may have a fixed number of clusters, and we might want to know if it makes sense to recruit more subjects from each cluster. To get a picture of this, I re-ran the simulations with 60 clusters, by evaluated power and variance of the treatment effect estimator at cluster sizes ranging from 5 to 60.

Under the assumptions used here, it also looks like there is a point after which little can be gained by adding subjects to each cluster (at least in terms of both power and precision of the estimate of the treatment effect):

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

This posting includes an audio/video/photo media file: Download Now

Compute R2s and other performance indices for all your models!

Posted: 29 Apr 2019 05:00 PM PDT

(This article was first published on R on easystats, and kindly contributed to R-bloggers)

Indices of model performance (i.e., model quality, goodness of fit, predictive accuracy etc.) are very important, both for model comparison and model description purposes. However, their computation or extraction for a wide variety of models can be complex.

To address this, please let us introduce the performance package!

performance

We have recently decided to collaborate around the new easystats project, a set of packages designed to make your life easier (currently WIP). This project encompasses several packages, devoted for instance to model access or Bayesian analysis, as well as indices of model performance.

The goal of performance is to provide lightweight tools to assess and check the quality of your model. It includes functions such as R2 for many models (including logistic, mixed and Bayesian models), ICC or helpers to check convergence, overdipsersion or zero-inflation. See the list of functions here.

performance can be installed as follows:

install.packages("performance")  # Install the package  library(performance)  # Load it

Examples

Mixed Models

First, we calculate the r-squared value and intra-class correlation coefficient (ICC) for a mixed model, using r2() and icc(). r2() internally calls the appropriate function for the given model. In case of mixed models this will be r2_nakagawa(). r2_nakagawa() computes the marginal and conditional r-squared values, while icc() calculates an adjusted and conditional ICC, both based on the proposals from Nakagawa et al. 2017. For more details on the computation of the variances, see get_variance().

# Load the lme4 package  library(lme4)    # Fit a mixed model  model <- lmer(Sepal.Width ~ Petal.Length + (1|Species), data = iris)    # compute R2, based on Nakagawa et al. 2017  r2(model)

> # R2 for mixed models  >   >   Conditional R2: 0.913  >      Marginal R2: 0.216

# compute intra-class correlation coefficient (ICC)  icc(model)

> # Intraclass Correlation Coefficient  >   >      Adjusted ICC: 0.889  >   Conditional ICC: 0.697

Now let's compute all available indices of performance appropriate for a given model. This can be done via the model_performance().

# Compute all performance indices  model_performance(model)

>   AIC BIC R2_conditional R2_marginal ICC_adjusted ICC_conditional RMSE  > 1 107 119           0.91        0.22         0.89             0.7 0.31

Bayesian Mixed Models

For Bayesian mixed models, we have the same features available (r-squared, ICC, …). In this example, we focus on the output from model_performance() only.

# Load the rstanarm package  library(rstanarm)    # Fit a Bayesian mixed model  model <- stan_glmer(Sepal.Width ~ Petal.Length + (1|Species), data = iris)    # Compute performance indices  model_performance(model)

>   ELPD ELPD_SE LOOIC LOOIC_SE R2_Median R2_MAD R2_Mean R2_SD R2_MAP  > 1  -44      10    87       20      0.47  0.045    0.47 0.046   0.47  >   R2_CI_low R2_CI_high R2_marginal_Median R2_marginal_MAD R2_marginal_Mean  > 1       0.4       0.55               0.27           0.045             0.27  >   R2_marginal_SD R2_marginal_MAP R2_marginal_CI_low R2_marginal_CI_high  > 1          0.048            0.29               0.19                0.34  >   R2_LOO_adjusted  > 1            0.45

Don't forget to check out the documentation here for more!

More details about performance's features are comming soon, stay tuned

Get Involved

There is definitely room for improvement, and some new exciting features are already planned. Feel free to let us know how we could further improve this package!

To conclude, note that easystats is a new project in active development. Thus, do not hesitate to contact us if you want to get involved

Check out our other blog posts here!

To leave a comment for the author, please follow the link and comment on their blog: R on easystats.

This posting includes an audio/video/photo media file: Download Now

Marketing with Machine Learning: Apriori

Posted: 29 Apr 2019 04:05 PM PDT

(This article was first published on R – Detroit Data Lab, and kindly contributed to R-bloggers)

This is part 1 of an ongoing series, introduced in Detroit Data Lab Presents: Marketing with Machine Learning

Introduction

Apriori, from the latin "a priori" means "from the earlier." As with many of our predictions, we're learning from the past and applying it toward the future.

It's the "Hello World" of marketing with machine learning! The simple application is growth in sales by identifying items that are commonly purchased together as part of a set by a customer, segment, or market. Talented marketers can power a variety of strategies with insights from this, including intelligent inventory management, market entry strategies, synergy identification in mergers & acquisitions, and much more.

The Marketing with Machine Learning Starter Kit: A problem, a data specification, a model, and sample code.

First, the problem we're solving. We all understand Amazon's "customers like you also purchased…" so let's branch out. In the example at the bottom, we're looking to help shape our market strategy around a key client. This tactic would be particularly useful as part of an account-based marketing strategy, where a single client is thought of as a "market of one" for independent analysis.

We'll pretend your business has gone to great lengths to understand where one of your customers sources all of its materials, and mapped it out accordingly. You want to understand your opportunity for expansion. If this customer purchases services X from other customers, are there other strongly correlated services? You could be missing out on an opportunity to land more business, simply by not knowing enough to ask.

We move on to the data specification. There are two best ways to receive this data. The long dataset, or "single" dataset, looks like this:

The wide dataset, or the "basket" dataset, looks like this:

CSV's, as usual, are the typical way these are transferred. Direct access to a database or system is always preferred, since we want to tap into pre-existing data pipelines where necessary so it's already cleaned and scrubbed of any sensitive information.

Next, the model to use. In this case, we're interested in determining the strength of association between purchasing service X and any other services, so we've selected the apriori algorithm. The apriori algorithm has 3 key terms that provide an understanding of what's going on: support, confidence, and lift.

Support is the count of how often items appear together. For example, if there are 3 purchases:

Pen, paper, keyboard
Computer, soda, pen
Pen, marker, TV, paper

Then we say the support for pen + paper is 2, since the group occurred two times.

Confidence is essentially a measure of how much we believe product will follow, once we know product X is being purchased. In the example above, we would say that buying pen => buying paper has confidence of 66.7%, since there are 3 transactions with a pen, and 2 of them also include paper.

Lift is the trickiest one. It's the ratio of your confidence value to your expected confidence value. It often is considered the importance of a rule. Our lift value for the example above is (2/3) / ((3/3) * (2/3)) or 1. A lift value of 1 is actually uneventful or boring, since what it tells us is that the pen is very popular, and therefore the paper purchase probably has nothing to do with the pen being in the transaction. A lift value of greater than 1 implies some significance in the two items, and warrants further investigation.

Finally, the code you can use. To keep the introduction simple, we're using R with only a few packages, and nothing more sophisticated than a simple CSV.

Using the Detroit Business Certification Register from the City of Detroit, we'll interpret it as a consumption map. This dataset, with its NIGP Codes, provides us with a transactional set of all categories a business is in.

## Apriori Demonstration code    # Loading appropriate packages before analysis  library(tidyr)  library(dplyr)  library(stringr)  library(arules)    # Read in the entire Detroit Business Register CSV  dfDetBusiness <- read.csv("Detroit_Business_Certification_Register.csv")    # Prepare a new data frame that puts each NIGP code in its own row, by business  dfCodes <- dfDetBusiness %>%      mutate(Codes = strsplit(as.character(NIGP.Codes),",")) %>%      unnest(Codes) %>%      select(Business.Name, Codes)    # Some rows appear with a decimal place. Strip out the decimal place.  dfCodes$Codes <- str_remove(dfCodes$Codes, "\\.0")    # Also trim the whitespace from any codes if it exists  dfCodes$Codes <- trimws(dfCodes$Codes)    # Write the data  write.csv(dfCodes, file = "TransactionsAll.csv", row.names = FALSE)    # Import as single with arule  objTransactions <- read.transactions("TransactionsAll.csv", format = "single", skip = 1, cols = c(1,2))    # Glance at the summary; does it make sense?  summary(objTransactions)    # Relative frequency of top 10 items across all transactions  itemFrequencyPlot(objTransactions,topN=10,type="relative")    # Generate the association rules; maximum 4 items in a rule  assoc.rules <- apriori(objTransactions, parameter = list(supp=0.001, conf=0.8,maxlen=4))    # Check out the summary of the object  summary(assoc.rules)    # View the top 25 rules  inspect(assoc.rules[1:25])

By plotting the frequency, reviewing the summary of the association rules, and inspecting a few of the rows, a plan can begin to emerge on what actions to take.

In this case, we find Rental Equipment (977) and Amusement & Decorations (037) always go together. Similarly, when Acoustical Tile (010) and Roofing Materials (077) are purchased, so are Elevators & Escalators (295). An odd combination!

Conclusion

With these insights uncovered, a marketer can begin to form a strategy. Do you develop a new product? Exit a specific market? Press on the customer to purchase more in a certain area from us, or a strategic partner? The possibilities are endless, but at least you're armed with knowledge to make the decision!

To leave a comment for the author, please follow the link and comment on their blog: R – Detroit Data Lab.