[R-bloggers] Which Technology Should I Learn? (and 7 more aRticles)

Which Technology Should I Learn?
Why R? Webinar – Development pipeline for R production – rZYPAD
Z is for Additional Axes
Expert opinion (again)
Highlights of Hugo Code Highlighting
Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020
Movie Recommendation With Recommenderlab
Testing for Covid-19 in the U.S.

Posted: 30 Apr 2020 09:00 AM PDT

[This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Knowing where to start can be challenging, but we're here to help. Read on to learn more about where to begin on your data science and analytics journey.

Data science and analytics languages

If you're new to data science and analytics, or your organization is, you'll need to pick a language to analyze your data and a thoughtful way to make that decision. Read our blog post and tutorial to learn how to choose between the two most popular languages for data science—Python and R—or read on for a brief summary.

Python

Python is one of the world's most popular programming languages. It is production-ready, meaning it has the capacity to be a single tool that integrates with every part of your workflow. So whether you want to build a web application or a machine learning model, Python can get you there!

General-purpose programming language (can be used to make anything)
Widely considered one of the accessible programming languages to read and learn
The language of choice for cutting edge machine learning and AI applications
Commonly used for putting models "in production"
Has high ease of deployment and reproducibility

R

R has been used primarily in academics and research, but in recent years, enterprise usage has rapidly expanded. Built specifically for working with data, R provides an intuitive interface to the most advanced statistical methods available today.

Built specifically for data analysis and visualization
Traditionally used by statisticians and academic researchers
The language of choice for cutting edge statistics
A vast collection of community-contributed packages
Rapid prototyping of data-driven apps and dashboards

SQL

Much of the world's raw data lives in organized collections of tables called relational databases. Data analysts and data scientists must know how to wrangle and extract data from these databases using SQL.

Useful for every organization that stores information in databases
One of the most in-demand skills in business
Used to access, query, and extract structured data which has been organized into a formatted repository, e.g., a database
Its scope includes data query, data manipulation, data definition, and data access control

Databases

Data scientists, analysts, and engineers must constantly interact with databases, which can store a vast amount of information in tables without slowing down performance. You can use SQL to query data from databases and model different phenomena in your data and the relationships between them. Find out the differences between the most popular databases in our blog post or read on for a summary.

Microsoft SQL Server

Commercial relational database management system (RDBMS), built and maintained by Microsoft
Available on Windows and Linux operating systems

PostgreSQL

Free and open-source RDBMS, maintained by PostgreSQL Global Development Group and its community
Beginner-friendly

Oracle Database

The most popular RDBMS, used by 97% of Fortune 100 companies
Requires knowledge of PL/SQL, an extension of SQL, to access and query data

Spreadsheets

Spreadsheets are used across the business world to transform mountains of raw data into clear insights by organizing, analyzing, and storing data in tables. Microsoft Excel and Google Sheets are the most popular spreadsheet software, with a flexible structure that allows data to be entered in cells of a table.

Google Sheets

Free for users
Allows collaboration between users via link sharing and permissions
Statistical analysis and visualization must be done manually

Microsoft Excel

Requires a paid license
Not as favorable as Google Sheets for collaboration
Contains built-in functions for statistical analysis and visualization

Business intelligence tools

Business intelligence (BI) tools make data discovery accessible for all skill levels—not just advanced analytics professionals. They are one of the simplest ways to work with data, providing the tools to collect data in one place, gain insight into what will move the needle, forecast outcomes, and much more.

Tableau

Tableau is a data visualization software that is like a supercharged Microsoft Excel. Its user-friendly drag-and-drop functionality makes it simple for anyone to access, analyze and create highly impactful data visualizations.

A widely used business intelligence (BI) and analytics software trusted by companies like Amazon, Experian, and Unilever
User-friendly drag-and-drop functionality
Supports multiple data sources including Microsoft Excel, Oracle, Microsoft SQL, Google Analytics, and SalesForce

Microsoft Power BI

Microsoft Power BI allows users to connect and transform raw data, add calculated columns and measures, create simple visualizations, and combine them to create interactive reports.

Web-based tool that provides real-time data access
User-friendly drag-and-drop functionality
Leverages existing Microsoft systems like Azure, SQL, and Excel

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why R? Webinar – Development pipeline for R production – rZYPAD

Posted: 30 Apr 2020 07:00 AM PDT

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

April 30th (8:00pm GMT+2) is another date for a webinar at Why R? Foundation YouTube channel. We will have a blast talk by Lorenzo Braschi from Roche IT. The title of the meeting is rZYPAD: Development pipeline for R production

See you on the Webinar!

Details

donate: whyr.pl/donate/
channel: youtube.com/c/WhyRFoundation
date: every Thursday 8:00 pm GMT+2
format: 45 minutes long talk streamed on YouTube + 10 minutes for Q&A
comments: ask questions on YouTube live chat

Next talks

2020-05-07 – Nina Zumel and John Mount – Advanced Data Preparation for Supervised Machine Learning
2020-05-14 – Ahmadou H. Dicko – Humanitarian Data Analysis with R
2020-05-21 – Erin LeDell (H2O) – TBA
2020-05-28 – Bernd Bischl, Florian Pfisterer, Martin Binder – Pipelines and AutoML with mlr3

Previous talks

Robin Lovelace and Jakub Nowosad (authors of Geocomputation with R) – Recent changes in R spatial and how to be ready for them. Video

Heidi Seibold, Department of Statistics (collaboration with LMU Open Science Center) (University of Munich) – Teaching Machine Learning online. Video

Olgun Aydin – PwC Poland – Introduction to shinyMobile. Video

Achim Zeileis from Universität Innsbruck – R/exams: A One-for-All Exams Generator – Online Tests, Live Quizzes, and Written Exams with R. Video

Stay up to date

subscribe to YouTube channel youtube.com/c/WhyRFoundation
join Why R? Slack whyr.pl/slack/
join Meetup

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Z is for Additional Axes

Posted: 30 Apr 2020 07:00 AM PDT

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here we are at the last post in Blogging A to Z! Today, I want to talk about adding additional axes to your ggplot, using the options for fill or color. While these aren't true z-axes in the geometric sense, I think of them as a third, z, axis.

Some of you may be surprised to learn that fill and color are different, and that you could use one or both in a given plot.

Color refers to the outline of the object (bar, piechart wedge, etc.), while fill refers to the inside of the object. For scatterplots, the default shape doesn't have a fill, so you'd just use color to change the appearance of those points.

Let's recreate the pages read over 2019 chart, but this time, I'll just use fiction books and separate them as either fantasy or other fiction; this divides that dataset pretty evenly in half. Here's how I'd generate the pages read over time separately by those two genre categories.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

##  ggplot2 3.2.1      purrr   0.3.3
##  tibble  2.1.3      dplyr   0.8.3
##  tidyr   1.0.0      stringr 1.4.0
##  readr   1.3.1      forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

fantasy <- reads2019 %>%
  filter(Fiction == 1) %>%
  mutate(date_read = as.Date(date_read, format = '%m/%d/%Y'),
         Fantasy = factor(Fantasy, levels = c(0,1),
                          labels = c("Other Fiction",
                                     "Fantasy"))) %>%
  group_by(Fantasy) %>%
  mutate(GenreRead = order_by(date_read, cumsum(Pages))) %>%
  ungroup()

Now I'd just plug that information into my ggplot code, but add a third variable in the aesthetics (aes) for ggplot – color = Fantasy.

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

myplot <- fantasy %>%
  ggplot(aes(date_read, GenreRead, color = Fantasy)) +
  geom_point() +
  xlab("Date") +
  ylab("Pages") +
  scale_x_date(date_labels = "%b",
               date_breaks = "1 month") +
  scale_y_continuous(labels = comma, breaks = seq(0,30000,5000)) +
  labs(color = "Genre of Fiction")

This plot uses the default R colorscheme. I could change those colors, using an existing colorscheme, or define my own. Let's make a fivethirtyeight style figure, using their theme for the overall plot, and their color scheme for the genre variable.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

myplot +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight()

I can also specify my own colors.

myplot +
  scale_color_manual(values = c("#4b0082","#ffd700")) +
  theme_minimal()

The geom_point offers many point shapes; 21-25 allow you to specify both color and fill. But for the rest, only use color.

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 3.6.3

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

ggpubr::show_point_shapes()

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

Of course, you may have plots where changing fill is best, such as on a bar plot. In my summarize example, I created a stacked bar chart of fiction versus non-fiction with author gender as the fill.

reads2019 %>%
  mutate(Gender = factor(Gender, levels = c(0,1),
                         labels = c("Male",
                                    "Female")),
         Fiction = factor(Fiction, levels = c(0,1),
                          labels = c("Non-Fiction",
                                     "Fiction"),
                          ordered = TRUE)) %>%
  group_by(Gender, Fiction) %>%
  summarise(Books = n()) %>%
  ggplot(aes(Fiction, Books, fill = reorder(Gender, desc(Gender)))) +
  geom_col() +
  scale_fill_economist() +
  xlab("Genre") +
  labs(fill = "Author Gender")

Stacking is the default, but I could also have the bars next to each other.

reads2019 %>%
  mutate(Gender = factor(Gender, levels = c(0,1),
                         labels = c("Male",
                                    "Female")),
         Fiction = factor(Fiction, levels = c(0,1),
                          labels = c("Non-Fiction",
                                     "Fiction"),
                          ordered = TRUE)) %>%
  group_by(Gender, Fiction) %>%
  summarise(Books = n()) %>%
  ggplot(aes(Fiction, Books, fill = reorder(Gender, desc(Gender)))) +
  geom_col(position = "dodge") +
  scale_fill_economist() +
  xlab("Genre") +
  labs(fill = "Author Gender")

You can also use fill (or color) with the same variable you used for x or y; that is, instead of having it be a third scale, it could add some color and separation to distinguish categories from the x or y variable. This is especially helpful if you have multiple categories being plotted, because it helps break up the wall of bars. If you do this, I'd recommend choosing a color palette with highly complementary colors, rather than highly contrasting ones; you probably also want to drop the legend, though, since the axis will also be labeled.

genres <- reads2019 %>%
  group_by(Fiction, Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n())

genres <- genres %>%
  bind_cols(Genre = c("Non-Fiction",
           "General Fiction",
           "Mystery",
           "Science Fiction",
           "Fantasy",
           "Fantasy Sci-Fi",
           "Children's Fiction",
           "Children's Fantasy"))

genres %>%
  filter(Genre != "Non-Fiction") %>%
  ggplot(aes(reorder(Genre, -Books), Books, fill = Genre)) +
  geom_col() +
  xlab("Genre") +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  scale_fill_economist() +
  theme(legend.position = "none")

If you only have a couple categories and want to draw a contrast, that's when you can use contrasting shades: for instance, at work, when I plot performance on an item, I use red for incorrect and blue for correct, to maximize the contrast between the two performance levels for whatever data I'm presenting.

I hope you enjoyed this series! There's so much more you can do with tidyverse than what I covered this month. Hopefully this has given you enough to get started and sparked your interest to learn more. Once again, I highly recommend checking out R for Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Expert opinion (again)

Posted: 29 Apr 2020 05:00 PM PDT

[This article was first published on R | Gianluca Baio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

THis is the second video I was mentioning here — took a while to get out but it's available now. I think you need to register here and then you can see our panel discussion. Like I said earlier, it was good fun and I think the actual session we did at ISPOR last year was, I think, very well received and it's a shame that we can't build on the momentum in the next R-HTA (which, I think, we're going to have to postpone, given the COVID-19 emergency…).

To leave a comment for the author, please follow the link and comment on their blog: R | Gianluca Baio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Highlights of Hugo Code Highlighting

Posted: 29 Apr 2020 05:00 PM PDT

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Thanks to a quite overdue update of Hugo on our build system¹, our website can now harness the full power of Hugo code highlighting for Markdown-based content.
What's code highlighting apart from the reason behind a tongue-twister in this post title?
In this post we shall explain how Hugo's code highlighter, Chroma, helps you prettify your code (i.e. syntax highlighting), and accentuate parts of your code (i.e. line highlighting).

Make your code look pretty

If you notice and appreciate the difference between

a <- c(1:7, NA)  mean(a, na.rm = TRUE)

and

a <- c(1:7, NA)  mean(a, na.rm = TRUE)

you might agree with Mara Averick's opinion,

Syntax highlighting! Just do it. Life is better when things are colourful.

Syntax highlighting means some elements of code blocks, like functions, operators, comments, etc. get styled differently: they could be colored or in italic.

Now, how do the colors of the second block appear?

First of all, it's a code block with language information, in this case R (note the r after the backticks),

```r  a <- c(1:7, NA)  mean(a, na.rm = TRUE)  ```

as opposed to

```  a <- c(1:7, NA)  mean(a, na.rm = TRUE)  ```

without language information, that won't get highlighted – although some syntax highlighting tools, not Hugo Chroma, do some guessing.

There are in general two ways in which colors are added to code blocks, client-side syntax highlighting and server-side syntax highlighting.
The latter is what Hugo supports nowadays but let's dive into both for the sake of completeness (~~or because I'm proud I now get it~~²).

Client-side syntax highlighting

In this sub-section I'll mostly refer to highlight.js but principles probably apply to other client-side syntax highlighting tools.
The "client-side" part of this phrase is that the html that is served by your website host does not have styling for the code.
In highlight.js case, styling appears after a JS script is loaded and applied.

If we look at a post of Mara Averick's at the time of writing, the html of a block is just

<pre class="r"><code>pal_a <- extract_colours("https://i.imgur.com/FyEALqr.jpg", num_col = 8)  par(mfrow = c(1,2))  pie(rep(1, 8), col = pal_a, main = "Palette based on Archer Poster")  hist(Nile, breaks = 8, col = pal_a, main = "Palette based on Archer Poster")code>pre>

Now, using Firefox Developer Console,

Screenshot of blog post with Firefox Developer Console open

we see colors come from CSS classes starting with "hljs".

And in the head of that page (examined via "View source"), there's

<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.9.0/highlight.min.js">script>  <script>hljs.initHighlightingOnLoad();script>

which is the part loading and applying highlight.js to the page.
Now, how does it know what's for instance a string in R?
If we look at highlight.js highlighter for the R language, authored by Joe Cheng in 2012, it's a bunch of regular expressions, see for instance the definition of a string.

        className: 'string',          contains: [hljs.BACKSLASH_ESCAPE],          variants: [            {begin: '"', end: '"'},            {begin: "'", end: "'"}          ]

When using highlight.js on your website, you might need to specify R as a supplementary language in your config, since some languages are bundled by default whilst others are not.
You could also whip up some code to conditionally load supplementary highlight.js languages.

A big downside of client-side syntax highlighting is loading time:
it appears quite fast if your internet connection isn't poor, but you might have noticed code blocks changing aspect when loading a web page (first not styled, then styled).
Moreover, Hugo now supports, and uses by default, an alternative that we'll describe in the following subsection and take advantage of in this post's second section.

Server-side syntax highlighting

In server-side syntax highlighting, with say Pygments or Chroma (Hugo default), your website html as served already has styling information.

With Chroma, that styling information is either:

hard-coded in html³, as is since recently the case of tidyverse.org and blog.r-hub.io;

The html source for one of the blocks of the page screenshot above is

div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">df <span style="color:#666">%>%span>     <span style="color:#00f">group_byspan>(g1, g2) <span style="color:#666">%>%span>     <span style="color:#00f">summarisespan>(a <span style="color:#666">=span> <span style="color:#00f">meanspan>(a), b <span style="color:#666">=span> <span style="color:#00f">meanspan>(b), c <span style="color:#666">=span> <span style="color:#00f">meanspan>(c), d <span style="color:#666">=span> <span style="color:#00f">meanspan>(c))  code>pre>div>

The style used is indicated in the website config and picked from Chroma style gallery.

via the use of CSS classes also indicated in html, as is the case of this website.

The html of the block seen above is

<div class="highlight"><pre class="chroma"><code class="language-r" data-lang="r"><span class="nf">install.packagesspan><span class="p">(span><span class="s">"parzer"span><span class="p">,span> <span class="n">reposspan> <span class="o">=span> <span class="s">"https://dev.ropensci.org/"span><span class="p">)span>  code>pre>div>

and it goes hand in hand with having styling for different ".chroma" classes in our website CSS.

.chroma .s { color: #a3be8c }

To have this behaviour, in our website config there's

pygmentsUseClasses=true

which confusingly enough uses the name "Pygments", not Chroma, for historical reasons.
You'd use CSS like we do if none of Chroma default styles suited you, if you wanted to make sure the style colors respect WCAG color contrast guidelines (see last section), or if you want to add a button switching the CSS applied to the classes, which we did for this note using a dev.to post by Alberto Montalesi.⁴
Click the button below!
It will also let you switch back to light mode.

To generate a stylesheet for a given style, use Hugo hugo gen chromastyles --style=monokai > syntax.css command.
You can then use the stylesheet as is, or tweak it.

How does Chroma know what parts of code is of the string class for instance?
Once again, regular expressions help, in this case in what is called a lexer.
Chroma is inspired by Pygments, and in Pygments docs it is explained that "A lexer splits the source into tokens, fragments of the source that have a token type that determines what the text represents semantically (e.g., keyword, string, or comment)."
In R lexer, ported from Pygments to Chroma by Chroma maintainer Alec Thomas, for strings we e.g. see

			{`\'`, LiteralString, Push("string_squote")},  			{`\"`, LiteralString, Push("string_dquote")},  // ... code  		"string_squote": {  			{`([^\'\\]|\\.)*\'`, LiteralString, Pop(1)},  		},  		"string_dquote": {  			{`([^"\\]|\\.)*"`, LiteralString, Pop(1)},  		},

Chroma works on Markdown content, so if you use blogdown to generate pages as html, you can only use client-side highlighting, like this tidyverse.org page whose source is html.
By default nowadays Hugo does server-side syntax highlighting but you could choose to turn it off via codeFences = false.

We have now seen how Hugo websites have syntax highlighting, which for Yihui Xie "is only for cosmetic purposes".
Well, Chroma actually also offers one thing more: line numbering and line highlighting!

Emphasize parts of your code

With Chroma, you can apply special options to code blocks defined with fences, i.e. starting with three backticks and language info, and ending with three backticks⁵.

On Chroma options for line highlighting

See how

```r {hl_lines=[1,"4-5"]}  library("dplyr")  df %>%    mutate(date = lubridate::ymd(date_string)) %>%    select(- date_string)  str(df)  nrow(df)  ```

is rendered below: lines 1 and 4 to 5 are highlighted.

library("dplyr")  df %>%    mutate(date = lubridate::ymd(date_string)) %>%    select(- date_string)  str(df)  nrow(df)

There are also options related to line numbering.

```r {hl_lines=[1,"4-5"],linenos=table,linenostart=3}  library("dplyr")  df %>%    mutate(date = lubridate::ymd(date_string)) %>%    select(- date_string)  str(df)  nrow(df)  ```

gives a code block with line numbered as table (easier for copy-pasting the code without line numbers), starting from number 3.

        3  4  5  6  7  8        library("dplyr")  df %>%    mutate(date = lubridate::ymd(date_string)) %>%    select(- date_string)  str(df)  nrow(df)      
  

You can also configure line numbering for your whole website.

The real magic to me is that if you write your code from R Markdown you can

apply the options to the source chunk using a knitr hook like the one defined in our archetype;
use R code to programmatically produce code block between fences, e.g. choosing which lines to highlight.

knitr hook to highlight lines of source code

Our hook is

# knitr hook to use Hugo highlighting options  knitr::knit_hooks$set(    source = function(x, options) {    hlopts <- options$hlopts      paste0(        "```", "r ",        if (!is.null(hlopts)) {        paste0("{",          glue::glue_collapse(            glue::glue('{names(hlopts)}={hlopts}'),            sep = ","          ), "}"          )        },        "\n", glue::glue_collapse(x, sep = "\n"), "\n```\n"      )    }  )

The chunk⁶

```{r name-your-chunks, hlopts=list(linenos="table")}   a <- 1+1  b <- 1+2  c <- 1+3  a + b + c  ```

is rendered as

        1  2  3  4        a <- 1+1  b <- 1+2  c <- 1+3  a + b + c      
  

[1] 9

PSA! Note that if you're after line highlighting, or function highlighting, for R Markdown documents in general, you should check out Kelly Bodwin's flair package!

Produce line-highlighted code blocks with `glue`/`paste0`

What Chroma highlights are code blocks with code fences, which you might as well generate from R Markdown using some string manipulation and knitr results="asis" chunk option. E.g.

  ```{r, results="asis"}   script <- c(    "a <- 1",    "b <- 2",    "c <- 3",    "a + b + c")  cool_lines <- sample(1:4, 2)  cool_lines <- stringr::str_remove(toString(cool_lines), " ")  fences_start <- paste0('```', 'r {hl_lines=[', cool_lines,']}')  glue::glue_collapse(    c(fences_start,  script,  "```"),  sep = "\n")  ```

will be knit to produce

a <- 1  b <- 2  c <- 3  a + b + c

This is a rather uninteresting toy example since we used randomly drawn line numbers to be highlighted, but you might find use cases for this.
We used such an approach in the recent blog post about Rclean, actually!

Accessibility

Since highlighting syntax and lines changes the color of things, it might make it harder for some people to read your content, so the choice color is a bit more than about cosmetics.

Disclaimer: I am not an accessibility expert. Our efforts were focused on contrast only, not differences between say green and red, since these do not endanger legibility of code.

We referred to the contrast criterion of the Web Content Accessibility Guidelines of the World Wide Web Consortium that state The intent of this Success Criterion is to provide enough contrast between text and its background so that it can be read by people with moderately low vision (who do not use contrast-enhancing assistive technology).

For instance, comments could be lighter or darker than code, but it is crucial to pay attention to the contrast between comments and code background!
Like Max Chadwick, we darkened colors of a default Chroma style, friendly, until it passed on an online tool.
Interestingly, this online tool can only work with a stylesheet: for a website with colors written in-line (Hugo default of pygmentsUseClasses=false), it won't pick up color contrast problems.
We chose friendly as a basis because its background can stand out a bit against white, without being a dark theme, which might be bad on a mobile device in direct sunglight.
Comments are moreover in italic which helps distinguish them from other code parts.

Our approach is less good than having an actual designer pick colors like what Codepen recently did, but will do for now.
Apart from Max Chadwick efforts on 10 Pygments styles, we only know of Eric Bailey's a11y dark and light themes as highlighting themes that are advertised as accessible.

A further aspect of contrast when using Chroma is that when highlighting a line, its background will have a different color than normal code.
This color also needs to not endanger the contrast between code and code background, so if your code highlighting is "dark mode", yellow highlighting is probably a bad idea: in this post, for the dark mode, we used the "fruity" Chroma style but with #301934 as background color for the highlighted lines.
It would also be a bad idea to only rely on line highlighting, as opposed to commenting code blocks, since some readers might not be able to differentiate highlighted lines.
Commenting code blocks is probably a good practice in general anyway, explaining what it does instead of just sharing the code like you'd share a gist.

For further reading on accessibility of R Markdown documents, we recommend "Accessible R Markdown Documents" by A. Jonathan R. Godfrey.

Conclusion

In this post we've explained some concepts around code highlighting: both client-side and server-side syntax highlighting; and line highlighting with Chroma.
We've even included a button for switching to dark mode and back as a proof-of-concept.
Being able to properly decorate code might make your content more attractive to your readers, or motivate you to write more documentation, which is great.
Now, how much time to fiddle with code appearance is probably a question of taste.

Our website is deployed via Netlify. ︎
Support for striking text, with ~~blablabla~~ is also quite new in Hugo, thanks to its new Markdown handler Goldmark! ︎
In this case colors are also hard-coded in RSS feeds which means the posts will look better in feed readers. ︎
With color not hard-coded in the html, but as classes, you could imagine folks developing browser extensions to override your highlighting style. ︎
There is also a highlight shortcode which to me is less natural to use in R Markdown or in Markdown as someone used to Markdown. ︎
I never remember how to show code chunks without their being evaluated so I always need to look at the source of Garrick Aden-Buie's blog post about Rmd fragments. ︎

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020

Posted: 29 Apr 2020 02:52 PM PDT

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nina Zumel and John Mount will be speaking on advanced data preparation for supervised machine learning at the Why R? Webinar Thursday, May 7, 2020.

This is a 8pm in a GMT+2 timezone, which for us is 11AM Pacific Time. Hope to see you there!

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Movie Recommendation With Recommenderlab

Posted: 29 Apr 2020 12:05 AM PDT

[This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Because You Are Interested In Data Science, You Are Interested In This Blog Post

If you love streaming movies and tv series online as much as we do here at STATWORX, you've probably stumbled upon recommendations like „Customers who viewed this item also viewed…" or „Because you have seen …, you like …". Amazon, Netflix, HBO, Disney+, etc. all recommend their products and movies based on your previous user behavior – But how do these companies know what their customers like? The answer is collaborative filtering.

In this blog post, I will first explain how collaborative filtering works. Secondly, I'm going to show you how to develop your own small movie recommender with the R package recommenderlab and provide it in a shiny application.

Different Approaches

There are several approaches to give a recommendation. In the user-based collaborative filtering (UBCF), the users are in the focus of the recommendation system. For a new proposal, the similarities between new and existing users are first calculated. Afterward, either the n most similar users or all users with a similarity above a specified threshold are consulted. The average ratings of the products are formed via these users and, if necessary, weighed according to their similarity. Then, the x highest rated products are displayed to the new user as a suggestion.

For the item-based collaborative filtering IBCF, however, the focus is on the products. For every two products, the similarity between them is calculated in terms of their ratings. For each product, the k most similar products are identified, and for each user, the products that best match their previous purchases are suggested.

Those and other collaborative filtering methods are implemented in the recommenderlab package:

ALS_realRatingMatrix: Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.
ALS_implicit_realRatingMatrix: Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.
IBCF_realRatingMatrix: Recommender based on item-based collaborative filtering.
LIBMF_realRatingMatrix: Matrix factorization with LIBMF via package recosystem.
POPULAR_realRatingMatrix: Recommender based on item popularity.
RANDOM_realRatingMatrix: Produce random recommendations (real ratings).
RERECOMMEND_realRatingMatrix: Re-recommends highly-rated items (real ratings).
SVD_realRatingMatrix: Recommender based on SVD approximation with column-mean imputation.
SVDF_realRatingMatrix: Recommender based on Funk SVD with gradient descend.
UBCF_realRatingMatrix: Recommender based on user-based collaborative filtering.

Developing your own Movie Recommender

Dataset

To create our recommender, we use the data from movielens. These are film ratings from 0.5 (= bad) to 5 (= good) for over 9000 films from more than 600 users. The movieId is a unique mapping variable to merge the different datasets.

head(movie_data)

  movieId                              title                                      genres  1       1                   Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy  2       2                     Jumanji (1995)                  Adventure|Children|Fantasy  3       3            Grumpier Old Men (1995)                              Comedy|Romance  4       4           Waiting to Exhale (1995)                        Comedy|Drama|Romance  5       5 Father of the Bride Part II (1995)                                      Comedy  6       6                        Heat (1995)                       Action|Crime|Thriller

head(ratings_data)

  userId movieId rating timestamp  1      1       1      4 964982703  2      1       3      4 964981247  3      1       6      4 964982224  4      1      47      5 964983815  5      1      50      5 964982931  6      1      70      3 964982400

To better understand the film ratings better, we display the number of different ranks and the average rating per film. We see that in most cases, there is no evaluation by a user. Furthermore, the average ratings contain a lot of „smooth" ranks. These are movies that only have individual ratings, and therefore, the average score is determined by individual users.

# ranting_vector  0         0.5    1      1.5    2      2.5   3      3.5    4       4.5   5  5830804   1370   2811   1791   7551   5550  20047  13136  26818   8551  13211

In order not to let individual users influence the movie ratings too much, the movies are reduced to those that have at least 50 ratings.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   ##   2.208   3.444   3.748   3.665   3.944   4.429

Under the assumption that the ratings of users who regularly give their opinion are more precise, we also only consider users who have given at least 50 ratings. For the films filtered above, we receive the following average ratings per user:

You can see that the distribution of the average ratings is left-skewed, which means that many users tend to give rather good ratings. To compensate for this skewness, we normalize the data.

ratings_movies_norm <- normalize(ratings_movies)

Model Training and Evaluation

To train our recommender and subsequently evaluate it, we carry out a 10-fold cross-validation. Also, we train both an IBCF and a UBCF recommender, which in turn calculate the similarity measure via cosine similarity and Pearson correlation. A random recommendation is used as a benchmark. To evaluate how many recommendations can be given, different numbers are tested via the vector n_recommendations.

eval_sets <- evaluationScheme(data = ratings_movies_norm,                                method = "cross-validation",                                k = 10,                                given = 5,                                goodRating = 0)    models_to_evaluate <- list(    `IBCF Cosinus` = list(name = "IBCF",                           param = list(method = "cosine")),    `IBCF Pearson` = list(name = "IBCF",                           param = list(method = "pearson")),    `UBCF Cosinus` = list(name = "UBCF",                          param = list(method = "cosine")),    `UBCF Pearson` = list(name = "UBCF",                          param = list(method = "pearson")),    `Zufälliger Vorschlag` = list(name = "RANDOM", param=NULL)  )    n_recommendations <- c(1, 5, seq(10, 100, 10))    list_results <- evaluate(x = eval_sets,                            method = models_to_evaluate,                            n = n_recommendations)

We then have the results displayed graphically for analysis.

We see that the best performing model is built by using UBCF and the Pearson correlation as a similarity measure. The model consistently achieves the highest true positive rate for the various false-positive rates and thus delivers the most relevant recommendations. Furthermore, we want to maximize the recall, which is also guaranteed at every level by the UBCF Pearson model. Since the n most similar users (parameter nn) are used to calculate the recommendations, we will examine the results of the model for different numbers of users.

vector_nn <- c(5, 10, 20, 30, 40)    models_to_evaluate <- lapply(vector_nn, function(nn){    list(name = "UBCF",         param = list(method = "pearson", nn = vector_nn))  })  names(models_to_evaluate) <- paste0("UBCF mit ", vector_nn, "Nutzern")  list_results <- evaluate(x = eval_sets,                            method = models_to_evaluate,                            n = n_recommendations)

Conclusion

Our user based collaborative filtering model with the Pearson correlation as a similarity measure and 40 users as a recommendation delivers the best results. To test the model by yourself and get movie suggestions for your own flavor, I created a small Shiny App.

However, there is no guarantee that the suggested movies really meet the individual taste. Not only is the underlying data set relatively small and can still be distorted by user ratings, but the tech giants also use other data such as age, gender, user behavior, etc. for their models.

But what I can say is: Data Scientists who read this blog post also read the other blog posts by STATWORX.

Shiny-App

Here you can find the Shiny App. To get your own movie recommendation, select up to 10 movies from the dropdown list, rate them on a scale from 0 (= bad) to 5 (= good) and press the run button. Please note that the app is located on a free account of shinyapps.io. This makes it available for 25 hours per month. If the 25 hours are used and therefore the app is this month no longer available, you will find the code here to run it on your local RStudio.

Über den Autor

Andreas Vogl

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.

Der Beitrag Movie Recommendation With Recommenderlab erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Testing for Covid-19 in the U.S.

Posted: 28 Apr 2020 06:22 PM PDT

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For almost a month, on a daily basis, we are working with colleagues (Romuald, Chi and Mathieu) on modeling the dynamics of the recent pandemic. I learn of lot of things discussing with them, but we keep struggling with the tests. Paul, in Montréal, helped me a little bit, but I think we will still have to more to get a better understand. To but honest, we stuggle with two very simple questions

how many people are tested on a daily basis ?

Recently, I discovered Modelling COVID-19 exit strategies for policy makers in the United Kingdom, which is very close to what we try to do… and in the document two interesting scenarios are discussed, with, for the first one, "1 million 'reliable' daily tests are deployed" (in the U.K.) and "5 million 'useless' daily tests are deployed". There are about 65 millions unhabitants in the U.K. so we talk here about 1.5% people tested, on a daily basis, or 7.69% people ! It could make sense, but our question was, at some point, is that realistic ? where are we today with testing ? In the U.S. https://covidtracking.com/ collects interesting data, on a daily basis, per state.

1  2  3

url = "https://raw.githubusercontent.com/COVID19Tracking/covid-tracking-data/master/data/states_daily_4pm_et.csv"  download.file(url,destfile="covid.csv")  base = read.csv("covid.csv")

Unfortunately, there is no information about the population. That we can find on wikipedia. But in that table, the state is given by its full name (and the symbol in the previous dataset). So we new also to match the two datasets properly,

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16

url="https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population"  download.file(url,destfile = "popUS.html")  #pas contaminé 2/3 R=3  library(XML)  tables=readHTMLTable("popUS.html")  T=tables[[1]][3:54,c("V3","V4")]  names(T)=c("state","pop")  url="https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations"  download.file(url,destfile = "nameUS.html")  tables=readHTMLTable("nameUS.html")  T2=tables[[1]][13:63,c(1,4)]  names(T2)=c("state","symbol")  T=merge(T,T2)  T$population = as.numeric(gsub(",", "", T$pop, fixed = TRUE))  names(base)[2]="symbol"  base = merge(base,T[,c("symbol","population")])

Now our dataset is fine… and we can get a function to plot the number of people tested in the U.S. (cumulated). Here, we distinguish between the positive and the negative,

1  2  3  4  5  6  7  8  9  10  11

drawing = function(st ="NY"){  sbase=base[base$symbol==st,c("date","positive","negative","population")]  sbase$DATE = as.Date(as.character(sbase$date),"%Y%m%d")  sbase=sbase[order(sbase$DATE),]  par(mfrow=c(1,2))  plot(sbase$DATE,(sbase$positive+sbase$negative)/sbase$population,ylab="Proportion Test (/population of state)",type="l",xlab="",col="blue",lwd=3)  lines(sbase$DATE,sbase$positive/sbase$population,col="red",lwd=2)  legend("topleft",c("negative","positive"),lwd=2,col=c("blue","red"),bty="n")  title(st)  plot(sbase$DATE,sbase$positive/(sbase$positive+sbase$negative),ylab="Ratio of positive tests",ylim=c(0,1),type="l",xlab="",col="black",lwd=3)  title(st)}

Let us start with New York

1	drawing("NY")

As at now, 4% of the entiere population got tested… over 6 weeks…. The graph on the right is the proportion of people who tested positive… I won't get back on that one here today, I keep it for our work. In New Jersey, we got about 2.5% of the entiere population tested, overall,

1	drawing("NJ")

Let us try a last one, Florida

1	drawing("FL")

As at today, it is 1.5% of the population, over 6 weeks. Overall, in the U.S. less than 0.1% people are tested, on a daily basis. Which is far from the 1.5% in the U.K. scenarios. Now, here come the second question,

what are we actually testing for ?

On that one, my experience in biology is… very limited, and Paul helped me. He mentioned this morning a nice report, from a lab in UC Berkeley

One of my question was for instance, if you get tested positive, and you do it again, can you test negative ? Or, in the context of our data, do we test different people ? are some people tested on a regular basis (perhaps every week) ? For instance, with antigen tests (Reverse Transcription Quantitative Polymerase Chain Reaction (RT-qPCR) – also called molecular or PCR – Polymerase Chain Reaction – test) we test if someone is infectious, while with antibody test (using serological immunoassays that detect viral-specific antibodies — Immunoglobin M (IgM) and G (IgG) — also called serology test), we test for immunity. Which is rather different…

I have no idea what we have in our database, to be honest… and for the past six weeks, I have seen a lot of databases, and most of the time, I don't know how to interpret, I don't know what is measured… and it is scary. So, so far, we try to do some maths, to test dynamics by tuning parameters "the best we can" (and not estimate them). But if anyone has good references on testing, in the context of Covid-19 (for instance on specificity, sensitivity of all those tests) I would love to hear about it !

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now