[R-bloggers] The Bachelorette Ep. 1 – Every has its Thorn – Data Analysis in R (and 10 more aRticles)

[R-bloggers] The Bachelorette Ep. 1 – Every has its Thorn – Data Analysis in R (and 10 more aRticles)

Link to R-bloggers

The Bachelorette Ep. 1 – Every has its Thorn – Data Analysis in R

Posted: 16 Oct 2020 05:32 AM PDT

[This article was first published on Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all love a good love story. In The Bachelor TV Franchise, millions gather to see who will be the lucky winner of the contestant's heart. But what about those who don't make it? What do they have in common? Please check out our dashboard (built with shiny) at https://stoltzmaniac.shinyapps.io/TheBacheloretteApp/.

Night 1 of The Bachelorette found 23 men continuing on to night 2, while 8 men were sent packing. Based on our highly scientific rating system here at Stoltzman Consulting, we found that Clare has a preferred type. 

Let's dig into the data.

Below, we see that Clare prefers men whose characteristics are Bro-y and Vanit-y while she is less interested in men who are there with characteristics of the "Right Reason-y" and "Kind Heart-y". The "Drama-y" level does not appear to impact her decision making.

Screen Shot 2020-10-16 at 10.16.28 AM.png

Clare also appears to prefer men who have a stronger social media presence. Those who have not been cut are shown in the "tall and slender" side versus those who were eliminated are depicted in the "short and stout" side of the chart below. Dale Moss had to be excluded from the chart due to the fact that his follower count was way higher than everyone else's (over 180K) – he is still an active participant.

Screen Shot 2020-10-16 at 11.29.42 AM.png

We are looking forward to seeing where this season goes. Apparently, it is the most dramatic season yet! Remember to checkout our live analytics dashboard at https://stoltzmaniac.shinyapps.io/TheBacheloretteApp/ to build models to predict the winner and see current Twitter and Instagram trends.

Screen Shot 2020-10-16 at 11.37.10 AM.png

Screen Shot 2020-10-16 at 11.38.12 AM.png

Screen Shot 2020-10-16 at 11.38.43 AM.png

Screen Shot 2020-10-16 at 11.39.12 AM.png

For those interested in the code (data available upon request, just visit the contact us section of this site).

library(tidyverse)    GLOBAL_DATA = get_database_data()  saveRDS(GLOBAL_DATA, 'GLOBAL_DATA.rds')      # Contestant circular bar chart  all_contestant_data = GLOBAL_DATA$contestant_data_raw    suitor_data = all_contestant_data %>%     filter(!instagram %in% c('tayshiaaa', 'chrisbharrison', 'clarecrawley')) %>%    mutate(status = as.factor(      case_when(        end_episode > latest_episode ~ 'Active',        TRUE ~ 'Eliminated'      )    )) %>%    select(status, everything())      coord_plot_data = suitor_data %>%    select(status:`Right Reason-y`) %>%    group_by(status) %>%    pivot_longer(`Vanit-y`:`Right Reason-y`, names_to = "characteristic") %>%    group_by(characteristic, status) %>%    summarize(avg = mean(value), .groups = 'drop')    coord_plot_data %>%    ggplot(aes(x = characteristic)) +     geom_col(aes( y = avg, col = status, fill = status), position = 'dodge') +     geom_text(aes( y = avg, label = characteristic), data = coord_plot_data %>% filter(status == 'Active'), size = 4, position = position_stack(vjust = 1.4)) +     coord_polar() +     ggtitle('Preferred Suitor Characteristics') +     theme_minimal() +     scale_color_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) +     scale_fill_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) +     theme(legend.position = c(0.5, 0.95), legend.direction = "horizontal", legend.title = element_blank(), axis.title = element_blank(),           axis.text.y = element_blank(), axis.text.x = element_blank(), plot.title = element_text( hjust = 0.5, vjust = -1))      # Contestant instagram    insta_follower_data = GLOBAL_DATA$insta_followers_w_losers %>%    drop_na() %>%    mutate(relative_air_date =              case_when(               datetime <= '2020-10-12' ~ 'Pre Air',               TRUE ~ 'Aired')           ) %>%    left_join(      suitor_data,      by = 'name'    ) %>%    drop_na()    insta_follower_data %>%    filter(relative_air_date == 'Pre Air', follower_count < 1e5) %>%    ggplot(aes(x = status, y = follower_count, col = status, fill = status)) +     geom_violin(alpha = 0.) +     ggtitle('Pre-Air Instagram Followers', subtitle = "*Excludes Dale Moss") +     theme_minimal() +     scale_color_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) +     scale_fill_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) +     scale_y_continuous(label = scales::comma) +    theme(legend.position = 'top', legend.direction = "horizontal", legend.title = element_blank(), axis.title = element_blank(),           plot.title = element_text( hjust = 0.5),          plot.subtitle = element_text(hjust = 0.5))          

To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Bachelorette Ep. 1 - Every has its Thorn - Data Analysis in R first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Free Workshops for Meetup Groups

Posted: 16 Oct 2020 12:37 AM PDT

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For the last few years we've offered automatic sponsorship for meet-ups and satRday events. However for obvious COVID related reasons, most (all?) meet-ups have meeting getting together virtually, so the need for extra Pizza money has diminished.

As with most organisations, we've had to adapted to the new online-first environment. In particular, running primarily online training courses. We've always ran some, online events, but now we're running in excess of ten days of training per month on topics ranging from R, Python, Git, Shiny and Stan. So far our average course rating is a healthy 4.7 out of 5!

With this new found experience in online training, and our regret at not getting to interact with useR groups from around the world, we thought that one way we could contribute back to the community would be to offer free training workshops.

Potential topics fall into two groups. Introductory sessions, where the background would be kept to a minimum, e.g.

  • starting out with the tidyverse
  • Rmarkdown
  • basic git
  • building an R package
  • regular expressions

Or more advanced topics that would require some pre-requisites

  • getting started with Shiny
  • continuous integration
  • advanced git topics, e.g. rebase vs merge
  • anything else?

If you are interested, then please complete this short form and we'll get in touch.


Jumping Rivers are full service, RStudio certified partners. Part of our role is to offer support in RStudio Pro products. If you use any RStudio Pro products, feel free to contact us (). We may be able to offer free support.

The post Free Workshops for Meetup Groups appeared first on Jumping Rivers.

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Free Workshops for Meetup Groups first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Hack: How to Install and Load Packages Dynamically

Posted: 15 Oct 2020 09:28 PM PDT

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When we share an R script file with someone else, we assumed that they have already installed the required R packages. However, this is not always the case and for that reason, I strongly suggest adding this piece of code to every shared R script which requires a package. Let's assume that your code requires the following three packages: "readxl", "dplyr", "multcomp" .

The script below checks, if the package exists, and if not, then it installs it and finally it loads it to R

  mypackages<-c("readxl", "dplyr", "multcomp")    for (p in mypackages){    if(!require(p, character.only = TRUE)){      install.packages(p)      library(p, character.only = TRUE)    }    }  

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Hack: How to Install and Load Packages Dynamically first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Nucleci segmentation in R with Platypus.

Posted: 15 Oct 2020 12:00 PM PDT

[This article was first published on R on Data Science Guts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my previous post I've introduced you to my latest project platypus – R package for object detection and image segmentation. This time I will go into more details and show you how to use it on biomedical data.

The data

Today we will work on 2018 Data Science Bowl dataset.
You can download images and masks directly form the url or using Kagge API :

kaggle competitions download -c data-science-bowl-2018

After downloading the data, unpack them and move to preferred destination. For this example we will be interested only in stage1_train and stage1_test subdirectories, so you can delete other files if you want.

Before we start, let's investigate a little bit.

library(tidyverse)  library(platypus)  library(abind)  library(here)    # Print current working directory  here()    # [1] "/home/maju116/Desktop/PROJECTS/Moje Projekty/platypus"    # Set directories with the data and models  data_path <- here("examples/data/data-science-bowl-2018/")  models_path <- here("examples/models/")    # Investigate one instance of data (image + masks)  sample_image_path <- here("examples/data/data-science-bowl-2018/stage1_train/00071198d059ba7f5914a526d124d28e6d010c92466da21d4a04cd5413362552/")    list.files(sample_image_path, full.names = TRUE) %>%    set_names(basename(.)) %>%    map(~ list.files(.))    # $images  # [1] "00071198d059ba7f5914a526d124d28e6d010c92466da21d4a04cd5413362552.png"  #   # $masks  #  [1] "07a9bf1d7594af2763c86e93f05d22c4d5181353c6d3ab30a345b908ffe5aadc.png"  #  [2] "0e548d0af63ab451616f082eb56bde13eb71f73dfda92a03fbe88ad42ebb4881.png"  #  [3] "0ea1f9e30124e4aef1407af239ff42fd6f5753c09b4c5cac5d08023c328d7f05.png"  #  [4] "0f5a3252d05ecdf453bdd5e6ad5322c454d8ec2d13ef0f0bf45a6f6db45b5639.png"  #  [5] "2c47735510ef91a11fde42b317829cee5fc04d05a797b90008803d7151951d58.png"  #  [6] "4afa39f2a05f9884a5ff030d678c6142379f99a5baaf4f1ba7835a639cb50751.png"  #  [7] "4bc58dbdefb2777392361d8b2d686b1cc14ca310e009b79763af46e853e6c6ac.png"  #  [8] "4e3b49fb14877b63704881a923365b68c1def111c58f23c66daa49fef4b632bf.png"  #  [9] "5522143fa8723b66b1e0b25331047e6ae6eeec664f7c8abeba687e0de0f9060a.png"  # [10] "58656859fb9c13741eda9bc753c3415b78d1135ee852a194944dee88ab70acf4.png"  # [11] "6442251746caac8fc255e6a22b41282ffcfabebadbd240ee0b604808ff9e3383.png"  # [12] "7ff04129f8b6d9aaf47e062eadce8b3fcff8b4a29ec5ad92bca926ac2b7263d2.png"  # [13] "8bbec3052bcec900455e8c7728d03facb46c880334bcc4fb0d1d066dd6c7c5d2.png"  # [14] "9576fe25f4a510f12eecbabfa2e0237b98d8c2622b9e13b9a960e2afe6da844e.png"  # [15] "95deddb72b845b1a1f81a282c86e666045da98344eaa2763d67e2ab80bc2e5c3.png"  # [16] "a1b0cdb21f341af17d86f23596df4f02a6b9c4e0d59a7f74aaf28b9e408a4e4c.png"  # [17] "aa154c70e0d82669e9e492309bd00536d2b0f6eeec1210014bbafbfc554b377c.png"  # [18] "acba6646e8250aab8865cd652dfaa7c56f643267ea2e774aee97dc2342d879d6.png"  # [19] "ae00049dc36a1e5ffafcdeadb44b18a9cd6dfd459ee302ab041337529bd41cf2.png"  # [20] "af4d6ff17fa7b41de146402e12b3bab1f1fe3c1e6f37da81a54e002168b1e7dd.png"  # [21] "b0cbc2c553f9c4ac2191395236f776143fb3a28fb77b81d3d258a2f45361ca89.png"  # [22] "b6fc3b5403de8f393ca368553566eaf03d5c07148539bc6141a486f1d185f677.png"  # [23] "be98de8a7ba7d5d733b1212ae957f37b5b69d0bf350b9a5a25ba4346c29e49f7.png"  # [24] "cb53899ef711bce04b209829c61958abdb50aa759f3f896eb7ed868021c22fb4.png"  # [25] "d5024b272cb39f9ef2753e2f31344f42dd17c0e2311c4927946bc5008d295d2e.png"  # [26] "f6eee5c69f54807923de1ceb1097fc3aa902a6b20d846f111e806988a4269ed0.png"  # [27] "ffae764df84788e8047c0942f55676c9663209f65da943814c6b3aca78d8e7f7.png"

As you can see each image has its own directory, that has two subdirectories inside:

  • images – contains original image that will be the input of the neural network
  • masks – contains one or more segmentation masks. Segmentation mask is simply telling us which pixel belongs to which class, and this is what we will try to predict.

For the modeling, beside train and test sets, we will also need a validation set (No one is forcing you, but it's a good practice!):

train_path <- here("examples/data/data-science-bowl-2018/stage1_train/")  test_path <- here("examples/data/data-science-bowl-2018/stage1_test/")  validation_path <- here("examples/data/data-science-bowl-2018/stage1_validation/")    if (!dir.exists(validation_path)) {    dir.create(validation_path)    # List train images    train_samples <- list.files(train_path, full.names = TRUE)    set.seed(1234)    # Select 10% for validation    validation_samples <- sample(train_samples, round(0.1 * length(train_samples)))    validation_samples %>%      walk(~ system(paste0('mv "', ., '" "', validation_path, '"')))  }

Semantic segmentation with U-Net

Since we now something about our data, we can now move to the modeling part. We will start by selecting the architecture of the neural network. In case of semantic segmentation there is a few different choices like U-Net, Fast-FCN, DeepLab and many more. For the time being in the platypus package you have access only to the U-Net architecture.

U-Net was originally developed for biomedical data segmentation. As you can see in the picture above architecture is very similar to autoencoder and it looks like the letter U, hence the name. Model is composed of 2 parts, and each part has some number of convolutional blocks (3 in the image above). Number of blocks will be hyperparameter in our model.

To build a U-Net model in platypus use u_net function. You have to specify:

  • number of convolutional blocks,
  • input image height and width – must be in the form 2^N!,
  • will input image be loaded as grayscale or RGB,
  • number of classes – in our case we have only 2 (background and nuclei)
  • additional arguments form CNN like number of filters, dropout rate
blocks <- 4 # Number of U-Net convolutional blocks  n_class <- 2 # Number of classes  net_h <- 256 # Must be in a form of 2^N  net_w <- 256 # Must be in a form of 2^N  grayscale <- FALSE # Will input image be in grayscale or RGB    DCB2018_u_net <- u_net(    net_h = net_h,    net_w = net_w,    grayscale = grayscale,    blocks = blocks,    n_class = n_class,    filters = 16,    dropout = 0.1,    batch_normalization = TRUE,    kernel_initializer = "he_normal"  )

After that it's time to select loss and additional metrics. Because semantic segmentation is in essence classification for each pixel instead of the whole image, you can use categorical cross-entropy as a loss function and accuracy as a metric. Other common choice, available in platypus, would be dice coefficient/loss. You can think of it as of a F1-metric for semantic segmentation.

DCB2018_u_net %>%    compile(      optimizer = optimizer_adam(lr = 1e-3),      loss = loss_dice(),      metrics = metric_dice_coeff()    )

The next step will be data ingestion. As you remember we have a separate directory and multiple masks for each image. That's not a problem for platypus! You can ingest data using segmentation_generator function. The first argument to specify is the directory with all the images and masks. To tell platypus that it has to load images and masks from separate directories for each data sample specify argument mode = "nested_dirs". Additionally you can set images/masks subdirectories names using subdirs argument. platypus will automatically merge multiple masks for each image, but we have to tell him how to recognize which pixel belongs to which class. In the segmentation masks each class is recognized by a specific RGB value. In our case we have only black (R = 0, G = 0, B = 0) pixel for background and white (R = 255, G = 255, B = 255) pixels for nuclei. To tell platypus how to recognize classes on segmentation masks use colormap argument.

binary_colormap    # [[1]]  # [1] 0 0 0  #   # [[2]]  # [1] 255 255 255    train_DCB2018_generator <- segmentation_generator(    path = train_path, # directory with images and masks    mode = "nested_dirs", # Each image with masks in separate folder    colormap = binary_colormap,    only_images = FALSE,    net_h = net_h,    net_w = net_w,    grayscale = FALSE,    scale = 1 / 255,    batch_size = 32,    shuffle = TRUE,    subdirs = c("images", "masks") # Names of subdirs with images and masks  )    # 603 images with corresponding masks detected!  # Set 'steps_per_epoch' to: 19    validation_DCB2018_generator <- segmentation_generator(    path = validation_path, # directory with images and masks    mode = "nested_dirs", # Each image with masks in separate folder    colormap = binary_colormap,    only_images = FALSE,    net_h = net_h,    net_w = net_w,    grayscale = FALSE,    scale = 1 / 255,    batch_size = 32,    shuffle = TRUE,    subdirs = c("images", "masks") # Names of subdirs with images and masks  )    # 67 images with corresponding masks detected!  # Set 'steps_per_epoch' to: 3

We can now fit the model.

history <- DCB2018_u_net %>%    fit_generator(      train_DCB2018_generator,      epochs = 20,      steps_per_epoch = 19,      validation_data = validation_DCB2018_generator,      validation_steps = 3,      callbacks = list(callback_model_checkpoint(        filepath = file.path(models_path, "DSB2018_w.hdf5"),        save_best_only = TRUE,        save_weights_only = TRUE,        monitor = "dice_coeff",        mode = "max",        verbose = 1)      )    )

And calculate predictions for the new images. Our model will return a 4-dimensional array (number of images, height, width, number of classes). Each pixel will have N probabilities, where N is number of classes. To transform raw predictions into segmentation map (by selecting class with max probability for each pixel) you can use get_masks function.

test_DCB2018_generator <- segmentation_generator(    path = test_path,    mode = "nested_dirs",    colormap = binary_colormap,    only_images = TRUE,    net_h = net_h,    net_w = net_w,    grayscale = FALSE,    scale = 1 / 255,    batch_size = 32,    shuffle = FALSE,    subdirs = c("images", "masks")  )    # 65 images detected!  # Set 'steps_per_epoch' to: 3    test_preds <- predict_generator(DCB2018_u_net, test_DCB2018_generator, 3)  dim(test_preds)    # [1]  65 256 256   2    test_masks <- get_masks(test_preds, binary_colormap)  dim(test_masks[[1]])    # [1] 256 256   3

To visualize predicted masks with the original images you can use plot_masks function.

test_imgs_paths <- create_images_masks_paths(test_path, "nested_dirs", FALSE, c("images", "masks"), ";")$images_paths    plot_masks(    images_paths = test_imgs_paths[1:4],    masks = test_masks[1:4],    labels = c("background", "nuclei"),    colormap = binary_colormap  )

x
x
x
x

To leave a comment for the author, please follow the link and comment on their blog: R on Data Science Guts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Nucleci segmentation in R with Platypus. first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Raspberry Pi E-Paper Dashboard with R

Posted: 15 Oct 2020 11:00 AM PDT

[This article was first published on schochastics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For once, this is not a post on an R package for network analysis or random R code.
This post is about a DIY adventure of mine with a little R code on the side.

I have always been intrigued by the Raspberry Pi, but never really engaged with it since I lack
the practical skills (like soldering) for most projects you can find on the internet.
Until I found this build of a
Raspberry Pi dashboard, where very little handy skills are needed. So I set out to build one myself and it turned out quite nice.

This is what you need to build such a dashboard yourself:

E-Paper displays are kind of awesome since they can display an image without using any electricity (once it is loaded).
This YouTube video does a very good job in explaining the technicalities.
The downside of the screen is the resolution (800×480, which is still quite high for an E-Paper display!) and the long refresh time (~2-3 seconds for this model but can go up to 15-20 seconds for colored output). So whatever you display on the screen, it better not be something that changes every few seconds (mine changes every 10 minutes).

In the rest of this post, I'll walk through all the steps necessary to set up the display to be used with R code.

The handy stuff

Simply fit the display into the frame and connect it to the Raspberry Pi. That's pretty much it.
Here is the back view of the frame.

(yes, I used rubber bands to attach the Raspberry Pi to the frame…)

Setting up the Pi

  • Get Raspberry Pi OS Lite (Link)
  • Follow the installation guide (Link)

Installing R on the Raspberry Pi

Installing R on the Raspberry Pi turned out to be trickier than I thought. I was only able to install version 3.6 from the
package sources, but I wanted to have version 4.0.2. I tried following the guides posted
here and
here but always ran into the problem that the package r-base-core
was not available for 4.0.

I ended up compiling R from source, which was surprisingly simple. I loosely followed the guide posted here:

wget https://cran.rstudio.com/src/base/R-4/R-4.0.2.tar.gz  tar zxvf R-4.0.2.tar.gz  # uncomment first line in /etc/apt/source.list  sudo apt-get build-dep r-base   ./configure --prefix=/opt/R/4.0.2 --enable-R-shlib --with-blas --with-lapack

If you run into the error:
configure: error: PCRE2 library and headers are required, or use --with-pcre1 and PCRE >= 8.32 with UTF-8 support
do

sudo apt-get install libpcre2-dev  sudo make  sudo make install    sudo ln -s /opt/R/4.0.2/bin/R /usr/local/bin/R  sudo ln -s /opt/R/4.0.2/bin/Rscript /usr/local/bin/Rscript

and that's already it. You should now have R ready to run on your Raspberry Pi.
Since we are operating on a headless Raspberry Pi, there is no point in trying to install RStudio.
I found, however, Nvim R to be an awesome headless substitute (if you are happy using vi, that is).

Setting up the E-Paper display

I simply followed the guide posted on the waveshare wiki.

Enable SPI

sudo raspi-config and then choose Interfacing Options -> SPI -> Yes.

Install BCM2835 libraries

Check what the newest version is. At the time of writing, it was 1.68.

wget http://www.airspayce.com/mikem/bcm2835/bcm2835-1.68.tar.gz  tar zxvf bcm2835-1.68.tar.gz   cd bcm2835-1.68/  sudo ./configure  sudo make  sudo make check  sudo make install

Install wiringPi

sudo apt-get install wiringpi

If you are on a Raspberry Pi 4, you'll need to update it

wget https://project-downloads.drogon.net/wiringpi-latest.deb  sudo dpkg -i wiringpi-latest.deb  

Communicating with the display from R

The Raspberry Pi can easily communicate with the display via some C or python library.
Luckily, both can be integrated in R. I opted for python, due to the amazing R package reticulate which makes
the communication between R and python a breeze. I'd be interested, though to package the C library as an R package (eventually).

#python3  sudo apt-get update  sudo apt-get install python3-pip  sudo apt-get install python3-pil  sudo apt-get install python3-numpy  sudo pip3 install Raspberry Pi.GPIO  sudo pip3 install spidev

You'll also need git to download some additional code.

sudo apt-get install git -y

Then, clone the repository of waveshare that includes the needed libraries.

sudo git clone https://github.com/waveshare/e-Paper

The library we will need is in the folder RaspberryPi&JetsonNano/python/lib.

python code for communication

The python code below is used to send the bitmap file screen-output.bmp to the display.

#!/usr/bin/python  # -*- coding:utf-8 -*-  import sys  import os  import time  libdir="/home/pi/e-Paper/RaspberryPi&JetsonNano/python/lib"  if os.path.exists(libdir):      sys.path.append(libdir)    from waveshare_epd import epd7in5_V2  from PIL import Image,ImageDraw,ImageFont    try:      epd = epd7in5_V2.EPD()      epd.init()      Himage = Image.open("screen-output.bmp")      epd.display(epd.getbuffer(Himage))      epd.sleep()  except KeyboardInterrupt:          epd7in5.epdconfig.module_exit()      exit()

Note that I am not a python expert so this code may well be awful (but it works). The code bellow shows how to call
this script (display.py) from within R.

library(reticulate)    use_python("/usr/bin/python3")  py_run_file("display.py")

Build the dashboard in R

Now that we can communicate with the display from python and know how to use python from within R, we can
think about what we want to display on the dashboard and how we implement it. Remember, all we have is
800×480 pixels in black and white. Apart from this constrained though, you are free to put whatever you want on the display.
I opted for a standard weather, calendar, and date setup, together with some randomly generated pixel art spaceships (my R implementation of a pixel-sprite-generator). The plotting is done entirely with ggplot2.

To reproduce my dashboard setup, simply clone the repository

git clone https://github.com/schochastics/e-Paper-dashboard.git

To use the code, you need to get an API key from openweathermap.org and set up
the gcalendr package with your google calendar. The rest should work out of the box (except that it won't. feel free to ask questions in the comments or on twitter.

Of course you do not have to follow my build. The following code snippet should be sufficient to build your own dashboard from scratch.

library(reticulate)  #load additional libraries, such as e.g. ggplot2  use_python("/usr/bin/python3")    #build your plot object    ggsave("raw-output.bmp",p,width=5,height=3,dpi = 160)  system("convert -colors 2 +dither -type Bilevel -monochrome raw-output.bmp screen-output.bmp")  py_run_file("display.py")

The parameters set in ggsave() makes sure that the output is a 800×480 bmp file and the line thereafter converts the plot to a true black and white image (which speeds up the rendering on the display). For the second line to work you need to install imagemagick.

sudo apt-get install imagemagick

Here is an example dashboard with my code from github.

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Raspberry Pi E-Paper Dashboard with R first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Submitting R package to CRAN

Posted: 15 Oct 2020 11:00 AM PDT

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Disclaimer: I have no affiliation with Microsoft Corp. or Revolution Analytics.

For the n-th time in x years, submitting an R package to CRAN ended up
like comedy. This time for one anecdotal note (a kind of warning),
whereas the previous accepted version of ESGtoolkit has had,
for 5 years, 12 warnings and notes combined. No error, nothing's broken. And I'm not even comparing
it to some other packages' results.

package-results-yeah

Literally no online repo (PyPi, conda forge, submit a Julia package, etc.) works
this way these days, with a censor sending emails back and forth to a student receiving an examination. The validation process is automated (sure, you can sometimes contact someone). It's not as if all of this was about controlling the calculation of EBITDA. It's just code. And perfection is unattainable. One. Warning. Man. Sighs.

It seems to me like, under the hood and
after years and years of observation and curious, repeated subliminal (and not so subliminal) threats, it isn't just about code on CRAN, and more about a lot of obscure politics and private interests. Even if it was about code, censor, why not submitting a pull request/issue to my GitHub repo directly as everyone does these days, instead of continuously shooting shots for nothing? It also happens that, when you submit something there, your package is kind of confiscated and you can never (ever) take it back. So definitely, and again, think twice (or more) before submitting.

I know exactly what to do to remove the note (xlab and ylab passed to ... in a
plotting function, and causing the "not-error" no visible global function definition
for 'xlab'
). Which wasn't suggested by censor. But that's when I was reminded that in 2020, there are thousands of ways to circumvent reactionaries (you should use GitHub/Gitlab search sometimes, your whole world and certainties will fall apart) on the internet, and of
this gem of a package I once heard about:
miniCRAN.

How does miniCRAN feels to me? Just like when I want
to create a virtual environment in Python with only packages that I'd like to use,
isolated from other packages (well, for actually using R in a virtual environment, I don't know yet). And there's more to it than that, as shown
in this presentation, including the invaluable ability to use a CRAN-like system of R packages behind a firewall, in an entreprise – or wherever – ecosystem. With miniCRAN, I was able to create an online repo just like what I'm used to, with only the set of packages I'd like to use.

And for someone who's been using R for 15 + years (I wonder why it's studied and taught in university though), it was delightful to be able to do this for the first time:

# my mirror of packages created with miniCRAN  repo <- "https://techtonique.github.io/r-techtonique-forge/"    # list of packages currently available in the repo for Linux/macOS and Windows  print(miniCRAN::pkgAvail(repos = repo, type = "source"))  print(miniCRAN::pkgAvail(repos = repo, type = "win.binary"))    # installing `pkg_name` from my mirror using the old faithful `install.packages`  install.packages("pkg_name", repos=repo, type="source")     # using the package as usual   library(pkg_name)      

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Submitting R package to CRAN first appeared on R-bloggers.

The history of autonomous vehicle datasets and 3 open-source Python apps for visualizing them

Posted: 15 Oct 2020 10:26 AM PDT

[This article was first published on R – Modern Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Special thanks to Plotly investor, NVIDIA, for their help in reviewing these open-source Dash applications for autonomous vehicle R&D, and Lyft for initial data visualization development in Plotly.

Author: Xing Han Lu, @xhlulu (originally posted on Medium)

📌 To learn more about how to use Dash for Autonomous Vehicle and AI Applications register for our live webinar with Xing Han Lu, Plotly's Lead AI/ML Engineer

Imagine eating your lunch, streaming your favorite TV shows, or chatting with your friends through video call inside the comfort of your car as it manages to drive you from home to work, to an appointment, or to a completely different city hours away. In fact, it would drive you through rain and snow storms, avoid potholes and identify pedestrians crossing the streets in low-light conditions; all this while ensuring that everyone stays safe. To achieve this, automotive manufacturers will need to achieve what is called "level 5 driving automation" (L5), which means that the "automated driving features will not require you to take over driving" and "can drive everywhere in all conditions", as defined by SAE International.

Image for post
All the levels of driving automation defined by SAE International [Source]

Achieving this level of driving automation has been the dream of many hardware engineers, software developers, and researchers. Whichever automotive manufacturer capable of achieving L5 first would immediately become ahead of the competition. It could also result in safer roads for everyone by mitigating human error, which causes 94% of serious crashes in the United States. For these reasons, various models, datasets, and visualization libraries have been publicly released over the years to strive towards fully autonomous vehicles (AV). In order to support the development of autonomous driving and help researchers better understand how self-driving cars view the world, we present four Dash apps that cover self-driving car trip playbacks and real-time data reading and visualizations, and video frame detection and annotations.

From KITTI to Motional and Lyft: 7 years of open-access datasets for the AV community

Published in 2012 as a research paper, the KITTI Vision benchmark was one of the first publicly available benchmarks for evaluating computer vision models based on automotive navigation. Through 6 hours of car trips around a rural town in Germany, it collected real-time sensor data: GPS coordinates, video feed, and point clouds (through a laser scanner). Additionally, it offered 3D bounding box annotations, optical flow, stereo, and various other tasks. By building models capable of accurately predicting those annotations, researchers could help autonomous vehicles systems accurately localize cars/pedestrians and predict the distance of objects/roads.

Image for post
The KITTI benchmark, one of the earliest AV datasets released for research [Source]

In 2019, researchers at Motional released nuScenes, an open-access dataset of over 1000 scenes collected in Singapore and Boston. It collected a total of 1.5M colored images and 400k lidar point clouds in various meteorological conditions (rain and night time). To help navigate this dataset, the authors also released a Python devkit for easily retrieving and reading collected sensor data for a given scene. It also included capabilities for plotting static 2D rendering of the point clouds on each image.

In the same year, Lyft released their Level 5 (L5) perception dataset, which encompasses over 350 scenes and over 1.3M bounding boxes. To explore this new dataset, they created a fork of the nuScenes devkit and added conversion tools, video rendering improvements, and interactive visualizations in Plotly. In order to encourage the community to leverage the dataset for building models capable of detecting objects in the 3D world, they released a competition on Kaggle with a prize pool of $25,000.

Image for post
Image for post
Lyft's Perception Dataset contains 30k LIDAR point clouds and 1.3M bounding boxes [Source]

Following the contributions of Motional and Lyft, various automotive companies released their own AV dataset, including ArgoAudiBerkeleyFordWaymo, and many others.

Web-based visualization tools for LIDAR point clouds and bounding boxes

The datasets released by academic and industrial researchers are not only large, but also highly complex and require visualization tools capable of handling their multi-modality. Whereas videos could be effectively displayed frame-by-frame, point clouds and 3D bounding boxes require more sophisticated solutions in order to fully display them. Although you can visualize them in two dimensions using top-down views or by projecting them onto video frames, it's hard to fully understand everything that's happening at a given point in time without being able to interact with the data by moving around and zooming in and out.

One library that can be used to solve this problem is deck.gl. Created at Uber and maintained by the vis.gl team, it offers interactive 3D visualization that can be directly integrated with maps through Mapbox. Among the wide range of layer types, you can find point clouds and polygons, which can be respectively used to display and interact with LIDAR points and 3D bounding box annotations.

Image for post
Image for post
A demo of deck.gl's Point Cloud layer, built in JavaScript [Source]

Although the various tools offered by deck.gl are extremely customizable and can be used in various applications, you will still need to implement everything by yourself in order to build a user interface for visualizing AV scenes. In order to alleviate this process and accelerate the development of custom AV dashboards, the same team behind deck.gl built streetscape.gl (also known as Autonomous Visualization System, or AVS), a collection of React components that lets you build a fully custom AV dashboard in a few hundred lines of JavaScript. When using data recorded in the XVIZ format, you can pre-generate a complete scene and load it into the client's browser, enabling a smooth playback of the entire segment.

In addition to visualizing point clouds and bounding boxes, you can also create custom objects such as car meshes, record metrics over time like acceleration and velocity, and offer controllable settings to the end user, enabling more fine-grained control directly in the UI.

Image for post
An overview of the Streetscape.gl components, all available in React [Source]

Although both libraries are extremely useful for AV research, they require the user to have a certain degree of familiarity with JavaScript, React, Node.js, and webpack. This makes it harder for AV researchers and engineers to build customized solutions without a team of developers specialized in front-end technologies, thus slowing down the development cycle of new AV features, and making those visualization tools harder to integrate with existing AV tools and ML libraries already available in Python. For those reasons, we present 3 Dash template apps that can help with R&D of AV software, systems and tools by abstracting, streamlining and unifying the various open-source AV libraries and datasets.

Dash App #1: Autonomous Visualization System (AVS) in Dash

All of Dash's core components, as well as many popular community-made components are built using React.js, which is the same framework for interfacing with streetscape.gl. Therefore, it was possible to first build a React UI app and then wrap it as a custom Dash component using the Dash component boilerplate. This makes your UIs built with streetscape.gl directly available to Dash apps built in Python, R, or Julia.

For this demo, we built a basic UI inspired by Streetscape.gl's official demo and contains many metrics and controls to help understand the scene.

Image for post
Dash AVS in action. Check out the source code and the demo.

In this Dash AVS app, you can play an interactive clip containing a scene, which is a segment of a car trip with data collected from LIDAR sensors, cameras, GPS, the car itself, and from human annotators (e.g. the bounding box annotations). At any time, you can stop the clip to:

  • move around by dragging the viewer
  • zoom in or out with the scroll wheel
  • tilt or rotate by holding CTRL and dragging with the mouse

By using Dash's own Core Components or community-built components like Dash Bootstrap, you can also create custom components to give the user more control on what is being visualized. You can directly choose various map styles, dataset & scene URLs, and whether you want to use the basic or advanced version of the UI. Given those inputs, you can simply create a BasicUI or an AdvancedUI component and pass the URL containing the logs as an argument.

This entire app is written in less than 200 lines of Python code, and can be easily deployed and scaled through Dash Enterprise's app manager.

Dash App #2: Visualizing the Lyft Perception dataset with dash-deck

Our second app lets you explore a specific scene from the Lyft Perception dataset. You can navigate between the frames by clicking on one of the buttons or by dragging the slider to set the exact time of the scene. Through dash-deck, the interactive viewer displays the point clouds and bounding boxes annotations at a given frame in a similar way to the first app. You can choose between various map views, LIDAR devices, cameras, and toggle various overlays on the image.

Image for post
Navigate the Lyft Perception dataset in Dash using dash-deck and the nuScenes-Lyft SDK. Check out the source code and the demo.

The difference between this app and the first app become clear once you look into the implementation details. If you are looking for more flexibility in your visualizations, and want to use a fully Python oriented solution, you might be interested in using pydeck, a Python interface for rendering deck.gl views. Although it requires more tweaking and customization, you can have significantly more control during the development of the app. For example, you can decide which color each point cloud will have, the opacity of the point cloud, the exact shape of a mesh object, the starting position of the camera, and the point of view (which can be orbit, first person or map view).

Another major difference is that Dash AVS requires you to first convert your data into XVIZ before serving or streaming a scene to the user. On the other hand, this app can be easily modified to use any possible data format, as long as you can preprocess the input into the format accepted by pydeck or dash-deck. The reason behind that is that everything is done dynamically and in real-time. In fact, every time the user selects a certain frame, this is what happens:

  1. The Lyft/nuScenes SDK retrieves the current frame and loads the point clouds, the annotations and the images captured by the cameras
  2. Pandas is used to preprocess the point clouds
  3. Pydeck constructs the point clouds and polygon layers
  4. Pyquarternion projects the point clouds and boxes onto the image
  5. Dash Deck and Plotly Express respectively render the deck viewer and the image graph

All six of these libraries are accessible through Python. In fact, the first four libraries were not specifically designed to be used in Dash. They just work out of the box because Dash was designed to seamlessly work with most Python use cases. This means that you could easily modify this app to perform real-time processing such as 3D mesh generationor bounding box predictions using PointNet prior to displaying a frame.

Dash App #3: Object detection and editable annotations for video frames

Our third app lets you replay driving scenes from the Berkeley DeepDrive dataset, which contains 1100 hours of driving videos. The video is augmented with 2D bounding box annotations generated by MobileNet v2, which were generated and embedded into the video. In addition to replaying the scene, you can stop the video at any time, run the MobileNet algorithm in real-time and interactive edit or add new bounding boxes at that exact timestamp. Once you are happy with the annotations, you can move on to the next frame or the next scene and download the updated and new annotations as CSV files.

Image for post
Play a video, pause to detect a frame, and edit or add a bounding box, and save the annotations when you are happy. Check out the source code and the demo.

This app is our favorite example of computer vision algorithms being used alongside human annotators to accelerate the data collection process. Since the app is fully built in Python and only uses built-in features from dash-playerPlotly.py, and TensorFlow Hub, you can easily personalize it to use more sophisticated models and fulfill technical annotation requirements that are virtually only limited by your choices of libraries in Python. For example, you could decide to store all the new annotations inside a SQL database (which are automatically included in Dash Enterprise), export them to a cloud storage using a library like boto3, or generate a snapshot of the working session that you can share with other annotators.

More Dash apps for enhancing self-driving cars visualization

In addition to these three apps, we have built various open-source Dash apps that can be used for supporting the development of self-driving cars:

  • Dash Deck Explorer (demo — code): This explorer lets you try all possible layers that you can build using pydeck and render with dash-deck. Short description and source code are included!
  • Dash DETR (demo — code): This app lets you input an URL for an image and apply real-time object detection using the Detection Transformer (DETR), a neural network model created by Facebook AI Research. The code could be easily reused and integrated into various workflows for 2D object detection, such as detecting pedestrians and vehicles inside images using PyTorch hub.
  • Object Detection Explorer (demo — code): Similar to the video detection app, this app lets you replay videos that were previously annotated with a MobileNet detector. Instead of running the model on a frame, it instead reads logs generated by the detector to derive useful insights, such as the number of pedestrians in the image, and a confidence heatmap of objects that are currently present on the screen.
  • Dash Uber Rides (demo — code): Monitoring and reviewing self-driving car trips will become an important task in order to ensure the reliability of the AV system in various conditions and locations. This app lets you visualize Uber trips across New York City and offer various controls for quickly narrowing down on a specific subset of the trips.
Image for post
Try out various demos with the Dash Deck Explorer.
Image for post
Run state-of-the-art object detection models in real time with Dash DETR.

Are you interested in building applications for autonomous driving, or interested in using advanced visualization components in Dash? Reach out to learn more about how you can take part in the development of next-generation autonomous driving visualization tools!

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The history of autonomous vehicle datasets and 3 open-source Python apps for visualizing them first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Fermat’s Riddle

Posted: 15 Oct 2020 09:20 AM PDT

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

·A Fermat-like riddle from the Riddler (with enough room to code on the margin)

An  arbitrary positive integer N is to be written as a difference of two distinct positive integers. What are the impossible cases and else can you provide a list of all distinct representations?

Since the problem amounts to finding a>b>0 such that

N=a^2-b^2=(a-b)(a+b)

both (a+b) and (a-b) are products of some of the prime factors in the decomposition of N and both terms must have the same parity for the average a to be an integer. This eliminates decompositions with a single prime factor 2 (and N=1). For other cases, the following R code (which I could not deposit on tio.run because of the packages R.utils!) returns a list

library(R.utils)  library(numbers)  bitz<-function(i,m) #int2bits    c(rev(as.binary(i)),rep(0,m))[1:m]  ridl=function(n){  a=primeFactors(n)  if((n==1)|(sum(a==2)==1)){    print("impossible")}else{    m=length(a);g=NULL    for(i in 1:2^m){      b=bitz(i,m)      if(((d<-prod(a[!!b]))%%2==(e<-prod(a[!b]))%%2)&(d  

For instance,

> ridl(1456)       [,1] [,2]  [1,]  365  363  [2,]  184  180  [3,]   95   87  [4,]   59   45  [5,]   40   12  [6,]   41   15  

Checking for the most prolific N, up to 10⁶, I found that N=6720=2⁶·3·5·7 produces 20 different decompositions. And that N=887,040=2⁸·3²·5·7·11 leads to 84 distinct differences of squares.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Fermat's Riddle first appeared on R-bloggers.

Gold-Mining Week 6 (2020)

Posted: 15 Oct 2020 08:26 AM PDT

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Gold-Mining Week 6 (2020) appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Gold-Mining Week 6 (2020) first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Hotelling’s T^2 in Julia, Python, and R

Posted: 14 Oct 2020 11:00 AM PDT

[This article was first published on R on Data & The World, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The t-test is a common, reliable way to check for differences between two samples. When dealing with multivariate data, one can simply run t-tests on each variable and see if there are differences. This could lead to scenarios where individual t-tests suggest that there is no difference, although looking at all variables jointly will show a difference. When a multivariate test is preferred, the obvious choice is the Hotelling's \(T^2\) test.

Hotelling's test has the same overall flexibility that the t-test does, in that it can also work on paired data, or even a single dataset, though this example will only cover the two-sample case.

Background

Without delving too far into the math – Wikipedia has a more thorough look at it – if we have two samples \(X\) and \(Y\) pulled from the same normal distribution, then the test statistic will be

$$ t^2 = \frac{n_x n_y}{n_x+n_y} (\bar{X} – \bar{Y})^T \Sigma^{-1} (\bar{X} – \bar{Y}) $$

where \(n_x\) is the number of samples in \(X\) (same for \(n_Y\)), \(\bar{X}\) is the vector of the column averages in \(X\) (same for \(\bar{Y}\)), and \(\Sigma\) is the pooled covariance matrix. This follows a \(T^2\) distribution, which isn't among the distributions typically found in statistical software. But it turns out that this can be made to follow an F-distribution by multiplying by the right factor:

$$ \frac{n_x + n_y – p – 1}{p(n_x + n_y – 2)} t^2 \sim F(p, n_x+n_y-p-1) $$

where \(p\) is the number of variables in the data. The F-distribution is much more commonplace, so this should be much easier to deal with.

Since computing the statistic and p-value requires nothing beyond some linear algebra and probability distributions, it's not too difficult to implement. Thorough implementations of Hotelling's \(T^2\) aren't hard to find, though from what I've seen it's usually in a package that's not standard or common – Python seems to lack one in its major statistics libraries while Julia and R have them as parts of specific libraries (HypothesisTests and ICSNP respectively). But since these aren't too difficult to code, this post will code the two-sample test manually in all three languages. I checked my results against the ICSNP library's test to ensure that things were working correctly.

The Data

For data, we'll use the sepal measurements in the iris dataset. Looking at the data, there are definite differences between the classes, and while there is an extension of Hotelling's test that can handle more than two samples, we'll just use the Versicolor and Virginica species. From the plot below, there does appear to be some difference, and univariate t-tests on each variable strongly suggest that significant differences exist.

Boxplots for two variables in the iris data.

The reason for only using the sepal measurements and those two species is because some other combinations were producing p-values small enough that Julia rendered them as zero, especially when all four variables were used. So I picked something that produced a p-value that all three languages could properly display.

R Implementation

R probably has the most straightforward implementation, in that all of the tools needed to write this are actually imported upon starting up, so no external dependencies exist. I am using the glue library, but that's just for formatting the output of print(), so it's not necessary for computing results.

library(glue)  TwoSampleT2Test <- function(X, Y){  nx <- nrow(X)  ny <- nrow(Y)  delta <- colMeans(X) - colMeans(Y)  p <- ncol(X)  Sx <- cov(X)  Sy <- cov(Y)  S_pooled <- ((nx-1)*Sx + (ny-1)*Sy)/(nx+ny-2)  t_squared <- (nx*ny)/(nx+ny) * t(delta) %*% solve(S_pooled) %*% (delta)  statistic <- t_squared * (nx+ny-p-1)/(p*(nx+ny-2))  p_value <- pf(statistic, p, nx+ny-p-1, lower.tail=FALSE)  print(glue("Test statistic: {statistic}   Degrees of freedom: {p} and {nx+ny-p-1}   p-value: {p_value}"))  return(list(TestStatistic=statistic, p_value=p_value))  }  data(iris)  versicolor <- iris[iris$Species == "versicolor", 1:2]  virginica <- iris[iris$Species == "virginica", 1:2]  TwoSampleT2Test(versicolor, virginica)  ## Test statistic: 15.8266009919182  ## Degrees of freedom: 2 and 97  ## p-value: 1.12597832539986e-06 

Julia Implementation

Julia is a little bit trickier. It might be because I haven't gone too deep into it, but Julia seems a little less casual about type conversions than R, so there is a need to mindful of those. In particular, the matrix multiplication step goes smoothly but returns a 1-by-1 array instead of a single numeric value like R does, so the statistic needs to be extracted from the array. There are also a few packages that need to be installed and imported, though these aren't very unusual ones.

using Distributions  using RDatasets  using Statistics  function TwoSampleT2Test(X,Y)  nx, p = size(X)  ny, _ = size(Y)  δ = mean(X, dims=1) - mean(Y, dims=1)  Sx = cov(X)  Sy = cov(Y)  S_pooled = ((nx-1)*Sx + (ny-1)*Sy)/(nx+ny-2)  t_squared = (nx*ny)/(nx+ny) * δ * inv(S_pooled) * transpose(δ)  statistic = t_squared[1,1] * (nx+ny-p-1)/(p*(nx+ny-2))  F = FDist(p, nx+ny-p-1)  p_value = 1 - cdf(F, statistic)  println("Test statistic: $(statistic)\nDegrees of freedom: $(p) and $(nx+ny-p-1)\np-value: $(p_value)")  return([statistic, p_value])  end  iris = dataset("datasets", "iris")  versicolor = convert(Matrix, iris[iris.Species .== "versicolor", 1:2])  virginica = convert(Matrix, iris[iris.Species .== "virginica", 1:2])  TwoSampleT2Test(versicolor, virginica)  ## Test statistic: 15.826600991918198  ## Degrees of freedom: 2 and 97  ## p-value: 1.1259783254669031e-6

The results appear to be the same, although there is a very slight difference in the p-values, on the order of \(10^{-16}\) or so. I don't know whether that's an implementation detail or something else. The way Julia handles its arrays is a little different as well, since we're taking the transpose of δ in the right-side multiplication instead of the left side.

Also, if you're not familiar with Julia, the use of the Unicode character for the Greek letter delta (δ) might be surprising. Julia actually can use Unicode characters for variable names, which is pretty neat (though not generally advisable, I think).

Python Implementation

This one is probably the most convoluted of the three, though even it isn't that bad. We again require a couple of fairly common packages. There are two major differences in the Python code compared to the previous languages:

  1. The dataset isn't loaded as a dataframe, but instead as an object of type sklearn.utils.Bunch, with properties that contain the variables and the target class.
  2. Matrix operations aren't quite as native to Python, so we lack an operator for performing it and have to use np.matmul() twice as a result.
import numpy as np  from sklearn import datasets  from scipy.stats import f  def TwoSampleT2Test(X, Y):  nx, p = X.shape  ny, _ = Y.shape  delta = np.mean(X, axis=0) - np.mean(Y, axis=0)  Sx = np.cov(X, rowvar=False)  Sy = np.cov(Y, rowvar=False)  S_pooled = ((nx-1)*Sx + (ny-1)*Sy)/(nx+ny-2)  t_squared = (nx*ny)/(nx+ny) * np.matmul(np.matmul(delta.transpose(), np.linalg.inv(S_pooled)), delta)  statistic = t_squared * (nx+ny-p-1)/(p*(nx+ny-2))  F = f(p, nx+ny-p-1)  p_value = 1 - F.cdf(statistic)  print(f"Test statistic: {statistic}\nDegrees of freedom: {p} and {nx+ny-p-1}\np-value: {p_value}")  return statistic, p_value  iris = datasets.load_iris()  versicolor = iris.data[iris.target==1, :2]  virginica = iris.data[iris.target==2, :2]  TwoSampleT2Test(versicolor, virginica)  ## Test statistic: 15.82660099191812  ## Degrees of freedom: 2 and 97  ## p-value: 1.1259783253558808e-06

We have a third, slightly different p-value than both previous implementations. Again, I don't really know what's going on.

Conclusion

So writing a Hotelling's \(T^2\) test yourself isn't too hard. These functions could be generalized to handle one-sample, paired, and other variants as well if desired; this should prove to be a good start if so.

To leave a comment for the author, please follow the link and comment on their blog: R on Data & The World.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Hotelling's T^2 in Julia, Python, and R first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

QQ-plots and Box-Whisker plots: where do they come from?

Posted: 14 Oct 2020 11:00 AM PDT

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For the most curious students

QQ-plots and Box-Whisker plots usually become part of the statistical toolbox for the students attending my course of 'Experimental methods in agriculture'. Most of them learn that the QQ-plot can be used to check for the basic assumption of gaussian residuals in linear models and that the Box-Whisker plot can be used to describe the experimental groups, when their size is big enough and we do not want to assume a gaussian distribution. Furthermore, most students learn to use the plot() method on an 'lm' object and the boxplot() function in the base 'graphic' package and concentrate on the interpretation of the R output. To me, in practical terms, this is enough; however, there is at least a couple of students per year who think that this is not enough and they want to know more. What is the math behind those useful plots? Can we draw them by hand?

If I were to give further detail about these two types of graphs, I should give a more detailed explanation of percentiles, at the very beginning of my course. I usually present the concept, which is rather easy to grasp, for students and I also give an example of how to calculate the 50th percentile (i.e. the median). So far, so good. But, what about the 25th and 75th percentiles, which are needed to build the Box-Whisker plot? As a teacher, I must admit I usually skip this aspect; I resort to showing the use of the quantile() function and the students are happy, apart the same couple per year, who asks: "what is this function doing in the background?".

Usually, there is no time to satisfy the above thirst for knowledge. First of all, I need time to introduce other important concepts for agricultural student; second, giving more detail would imply a high risk of being 'beaten' by all the other, less eager, students. Therefore, I decided to put my explanation here, to the benefit of the most curious of my students.

As an example, we will use 11 observations, randomly selected from a uniform distribution. First of all, we sort them out in increasing order.

set.seed(1234)  y <- sort( runif(11))  y  ##  [1] 0.009495756 0.113703411 0.232550506 0.514251141 0.609274733 0.622299405  ##  [7] 0.623379442 0.640310605 0.666083758 0.693591292 0.860915384

The P-values of observations

Let's try to imagine that our eleven observations are percentiles, although we do not know what their percentage point is. Well, we know something, at least; we have an odd number of observations and we know that the central value (0.622299405) is the median, i.e. the 50th percentile. But, what about the other values? In other words, we are looking for the P-values associated to each observation (probability points).

In order to determine the P-values, we divide the whole scale from 0 to 100% into as many intervals as there are values in our sample, that is eleven intervals. Each interval contains \(1 / 11 \times 100 = 9.09\%\) of the whole scale: the first interval goes from 0% to 9.09%, the second goes from 9.09% to 18.18% and so on, until the 11th interval, that goes from 90.91% to 100%.

Now, each value in the ordered sample is associated to one probability interval. The problem is: where do we put the value, within each interval? A possible line of attack, that is the default in R with \(n > 10\), is to put the value in the middle of the interval, as shown in the Figure below.

Consequently, each point corresponds to the following P-values:

(1:11 - 0.5)/11  ##  [1] 0.04545455 0.13636364 0.22727273 0.31818182 0.40909091 0.50000000  ##  [7] 0.59090909 0.68181818 0.77272727 0.86363636 0.95454545

In general:

\[p_i = \frac{i - 0.5}{n} \]

where \(n\) is the number of values and \(i\) is the index of each value in the sorted vector. In other words, the first value is the 4.5th percentile, the second value is the 13.64th percentile and so on, until the 11th value that is the 95.45th percentile.

Unfortunately, the calculation of P-values is not unique and there are other ways to locate the values within each interval. A second possibility, is to put the value close to the beginning of each interval. In this case, the corresponding P-values can be calculated as:

(1:11 - 1)/(11 - 1)  ##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

In general:

\[p_i = \frac{i - 1}{n - 1} \]

According to this definition, the first value is the 0th percentile, the second is the 10th percentile and so on.

A third possibility, is to put the value at the end of each interval. In this case, each point corresponds to the following P-values:

1:11/11  ##  [1] 0.09090909 0.18181818 0.27272727 0.36363636 0.45454545 0.54545455  ##  [7] 0.63636364 0.72727273 0.81818182 0.90909091 1.00000000

In general, it is:

\[p_i = \frac{i}{n} \]

Although it is not the default in R, this third approach is, perhaps, the most understandable, as it is more closely related to the definition of percentiles. It is clear that there is 9.09% subjects equal to or lower than the first one and that there is 18.18% of subjects equal to or lower than the second one.

Several other possibilities exist, but we will not explore them here. However, the most common function in R to calculate probability points is the ppoints() function, which gives different results, according to the selection of the 'a' argument. In detail, if \(a = 0.5\) (the default) we obtain the first series of P-values (where each of the original values is put in the middle of each interval) while, if \(a = 1\), we obtain the second series of P-values (where each of the original values is put near to the beginning of each interval).

ppoints(y)  ##  [1] 0.04545455 0.13636364 0.22727273 0.31818182 0.40909091 0.50000000  ##  [7] 0.59090909 0.68181818 0.77272727 0.86363636 0.95454545  ppoints(y, a = 1)  ##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

From ppoints() to the QQ-plot

P-values are used to draw quantile-quantile plots. Let's imagine that we want to know whether the eleven values in the vector 'y' can be regarded as a sample from a gaussian distribution. To this aim, we can standardise them and plot them against the corresponding percentiles of a gaussian distribution, which we derive from the above determined P-values, by the pnorm() function. The output is exactly the same as that produced by a call to the qqnorm() function.

x.coord <- qnorm(ppoints(y))  y.coord <- scale(y, scale = T)   plot(y.coord ~ x.coord, main = "Manual QQ-plot",       xlab = "Theoretical quantiles", ylab = "Standardised values")

Percentiles

Percentiles can be calculated by solving the three equations above for \(i\). For example, from the first equation, with simple math, we derive:

\[i = n p_i + 0.5\]

which gives the index for the required percentile. If we look for the 30th percentile, the index is \(11 \times 0.30 + 0.5 = 3.8\), i.e. this percentile is between the 3rd and the 4th value, that are, respectively, 0.232550506 and 0.514251141. We can find the exact percentile by a sort of 'weighted average' of these two values, considering the decimal part of the index (0.8):

(1 - 0.8) * y[3] + 0.8 * y[4]  ## [1] 0.457911

We can get the same result, by using the quantile() function in R and setting the 'type = 5' argument.

quantile(y, 0.3, type = 5)  ##      30%   ## 0.457911

If we start from the second equation, we derive:

\[i = p \times (n - 1) + 1\]

For the 30th percentile, the index is \(0.3 \times (11 - 1) + 1 = 4\). Thus, the required percentile is exactly in 4th position (0.5142511). With R, we can use the same quantile() function, but we have to set the 'type = 7' argument, that is the default.

quantile(y, 0.3)  ##       30%   ## 0.5142511

If we take the third equation above, we derive:

\[i = n \times p\]

and thus the index for the 30th percentile is: \(11 \times 0.3 = 3.3\). We use the same kind of 'weighted average' to retrieve the exact percentile:

(1 - 0.3) * y[3] + 0.3 * y[4]  ## [1] 0.3170607

With R:

quantile(y, 0.3, type = 4)  ##       30%   ## 0.3170607

If we use the above math to derive the 25th, the 50th and the 75th percentile, we are ready to draw our boxplot by hand. Please, consider that the boxplot() function in R uses the default quantiles ('type = 7').

It's all, for today. If you have any comments, please, drop me a note at andrea.onofri@unipg.it. Best wishes,

Prof. Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post QQ-plots and Box-Whisker plots: where do they come from? first appeared on R-bloggers.

This posting includes an audio/video/photo media file: Download Now

Comments