[R-bloggers] The Bachelorette Ep. 1 – Every has its Thorn – Data Analysis in R (and 10 more aRticles)
[R-bloggers] The Bachelorette Ep. 1 – Every has its Thorn – Data Analysis in R (and 10 more aRticles) |
- The Bachelorette Ep. 1 – Every has its Thorn – Data Analysis in R
- Free Workshops for Meetup Groups
- Hack: How to Install and Load Packages Dynamically
- Nucleci segmentation in R with Platypus.
- Raspberry Pi E-Paper Dashboard with R
- Submitting R package to CRAN
- The history of autonomous vehicle datasets and 3 open-source Python apps for visualizing them
- Fermat’s Riddle
- Gold-Mining Week 6 (2020)
- Hotelling’s T^2 in Julia, Python, and R
- QQ-plots and Box-Whisker plots: where do they come from?
The Bachelorette Ep. 1 – Every has its Thorn – Data Analysis in R Posted: 16 Oct 2020 05:32 AM PDT
[This article was first published on Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. We all love a good love story. In The Bachelor TV Franchise, millions gather to see who will be the lucky winner of the contestant's heart. But what about those who don't make it? What do they have in common? Please check out our dashboard (built with shiny) at https://stoltzmaniac.shinyapps.io/TheBacheloretteApp/. Night 1 of The Bachelorette found 23 men continuing on to night 2, while 8 men were sent packing. Based on our highly scientific rating system here at Stoltzman Consulting, we found that Clare has a preferred type. Let's dig into the data. Below, we see that Clare prefers men whose characteristics are Bro-y and Vanit-y while she is less interested in men who are there with characteristics of the "Right Reason-y" and "Kind Heart-y". The "Drama-y" level does not appear to impact her decision making. Clare also appears to prefer men who have a stronger social media presence. Those who have not been cut are shown in the "tall and slender" side versus those who were eliminated are depicted in the "short and stout" side of the chart below. Dale Moss had to be excluded from the chart due to the fact that his follower count was way higher than everyone else's (over 180K) – he is still an active participant. We are looking forward to seeing where this season goes. Apparently, it is the most dramatic season yet! Remember to checkout our live analytics dashboard at https://stoltzmaniac.shinyapps.io/TheBacheloretteApp/ to build models to predict the winner and see current Twitter and Instagram trends.
For those interested in the code (data available upon request, just visit the contact us section of this site).
library(tidyverse) GLOBAL_DATA = get_database_data() saveRDS(GLOBAL_DATA, 'GLOBAL_DATA.rds') # Contestant circular bar chart all_contestant_data = GLOBAL_DATA$contestant_data_raw suitor_data = all_contestant_data %>% filter(!instagram %in% c('tayshiaaa', 'chrisbharrison', 'clarecrawley')) %>% mutate(status = as.factor( case_when( end_episode > latest_episode ~ 'Active', TRUE ~ 'Eliminated' ) )) %>% select(status, everything()) coord_plot_data = suitor_data %>% select(status:`Right Reason-y`) %>% group_by(status) %>% pivot_longer(`Vanit-y`:`Right Reason-y`, names_to = "characteristic") %>% group_by(characteristic, status) %>% summarize(avg = mean(value), .groups = 'drop') coord_plot_data %>% ggplot(aes(x = characteristic)) + geom_col(aes( y = avg, col = status, fill = status), position = 'dodge') + geom_text(aes( y = avg, label = characteristic), data = coord_plot_data %>% filter(status == 'Active'), size = 4, position = position_stack(vjust = 1.4)) + coord_polar() + ggtitle('Preferred Suitor Characteristics') + theme_minimal() + scale_color_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) + scale_fill_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) + theme(legend.position = c(0.5, 0.95), legend.direction = "horizontal", legend.title = element_blank(), axis.title = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(), plot.title = element_text( hjust = 0.5, vjust = -1)) # Contestant instagram insta_follower_data = GLOBAL_DATA$insta_followers_w_losers %>% drop_na() %>% mutate(relative_air_date = case_when( datetime <= '2020-10-12' ~ 'Pre Air', TRUE ~ 'Aired') ) %>% left_join( suitor_data, by = 'name' ) %>% drop_na() insta_follower_data %>% filter(relative_air_date == 'Pre Air', follower_count < 1e5) %>% ggplot(aes(x = status, y = follower_count, col = status, fill = status)) + geom_violin(alpha = 0.) + ggtitle('Pre-Air Instagram Followers', subtitle = "*Excludes Dale Moss") + theme_minimal() + scale_color_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) + scale_fill_manual("legend", values = c("Active" = "pink", "Eliminated" = "darkgrey")) + scale_y_continuous(label = scales::comma) + theme(legend.position = 'top', legend.direction = "horizontal", legend.title = element_blank(), axis.title = element_blank(), plot.title = element_text( hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post The Bachelorette Ep. 1 - Every has its Thorn - Data Analysis in R first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
Free Workshops for Meetup Groups Posted: 16 Oct 2020 12:37 AM PDT
[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. For the last few years we've offered automatic sponsorship for meet-ups and satRday events. However for obvious COVID related reasons, most (all?) meet-ups have meeting getting together virtually, so the need for extra Pizza money has diminished. As with most organisations, we've had to adapted to the new online-first environment. In particular, running primarily online training courses. We've always ran some, online events, but now we're running in excess of ten days of training per month on topics ranging from R, Python, Git, Shiny and Stan. So far our average course rating is a healthy 4.7 out of 5! With this new found experience in online training, and our regret at not getting to interact with useR groups from around the world, we thought that one way we could contribute back to the community would be to offer free training workshops. Potential topics fall into two groups. Introductory sessions, where the background would be kept to a minimum, e.g.
Or more advanced topics that would require some pre-requisites
If you are interested, then please complete this short form and we'll get in touch. Jumping Rivers are full service, RStudio certified partners. Part of our role is to offer support in RStudio Pro products. If you use any RStudio Pro products, feel free to contact us (info@jumpingrivers.com). We may be able to offer free support. The post Free Workshops for Meetup Groups appeared first on Jumping Rivers. To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Free Workshops for Meetup Groups first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
Hack: How to Install and Load Packages Dynamically Posted: 15 Oct 2020 09:28 PM PDT
[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. When we share an R script file with someone else, we assumed that they have already installed the required R packages. However, this is not always the case and for that reason, I strongly suggest adding this piece of code to every shared R script which requires a package. Let's assume that your code requires the following three packages: "readxl", "dplyr", "multcomp" . The script below checks, if the package exists, and if not, then it installs it and finally it loads it to R mypackages<-c("readxl", "dplyr", "multcomp") for (p in mypackages){ if(!require(p, character.only = TRUE)){ install.packages(p) library(p, character.only = TRUE) } } To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Hack: How to Install and Load Packages Dynamically first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
Nucleci segmentation in R with Platypus. Posted: 15 Oct 2020 12:00 PM PDT
[This article was first published on R on Data Science Guts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In my previous post I've introduced you to my latest project platypus – R package for object detection and image segmentation. This time I will go into more details and show you how to use it on biomedical data. The dataToday we will work on 2018 Data Science Bowl dataset.
After downloading the data, unpack them and move to preferred destination. For this example we will be interested only in Before we start, let's investigate a little bit. library(tidyverse) library(platypus) library(abind) library(here) # Print current working directory here() # [1] "/home/maju116/Desktop/PROJECTS/Moje Projekty/platypus" # Set directories with the data and models data_path <- here("examples/data/data-science-bowl-2018/") models_path <- here("examples/models/") # Investigate one instance of data (image + masks) sample_image_path <- here("examples/data/data-science-bowl-2018/stage1_train/00071198d059ba7f5914a526d124d28e6d010c92466da21d4a04cd5413362552/") list.files(sample_image_path, full.names = TRUE) %>% set_names(basename(.)) %>% map(~ list.files(.)) # $images # [1] "00071198d059ba7f5914a526d124d28e6d010c92466da21d4a04cd5413362552.png" # # $masks # [1] "07a9bf1d7594af2763c86e93f05d22c4d5181353c6d3ab30a345b908ffe5aadc.png" # [2] "0e548d0af63ab451616f082eb56bde13eb71f73dfda92a03fbe88ad42ebb4881.png" # [3] "0ea1f9e30124e4aef1407af239ff42fd6f5753c09b4c5cac5d08023c328d7f05.png" # [4] "0f5a3252d05ecdf453bdd5e6ad5322c454d8ec2d13ef0f0bf45a6f6db45b5639.png" # [5] "2c47735510ef91a11fde42b317829cee5fc04d05a797b90008803d7151951d58.png" # [6] "4afa39f2a05f9884a5ff030d678c6142379f99a5baaf4f1ba7835a639cb50751.png" # [7] "4bc58dbdefb2777392361d8b2d686b1cc14ca310e009b79763af46e853e6c6ac.png" # [8] "4e3b49fb14877b63704881a923365b68c1def111c58f23c66daa49fef4b632bf.png" # [9] "5522143fa8723b66b1e0b25331047e6ae6eeec664f7c8abeba687e0de0f9060a.png" # [10] "58656859fb9c13741eda9bc753c3415b78d1135ee852a194944dee88ab70acf4.png" # [11] "6442251746caac8fc255e6a22b41282ffcfabebadbd240ee0b604808ff9e3383.png" # [12] "7ff04129f8b6d9aaf47e062eadce8b3fcff8b4a29ec5ad92bca926ac2b7263d2.png" # [13] "8bbec3052bcec900455e8c7728d03facb46c880334bcc4fb0d1d066dd6c7c5d2.png" # [14] "9576fe25f4a510f12eecbabfa2e0237b98d8c2622b9e13b9a960e2afe6da844e.png" # [15] "95deddb72b845b1a1f81a282c86e666045da98344eaa2763d67e2ab80bc2e5c3.png" # [16] "a1b0cdb21f341af17d86f23596df4f02a6b9c4e0d59a7f74aaf28b9e408a4e4c.png" # [17] "aa154c70e0d82669e9e492309bd00536d2b0f6eeec1210014bbafbfc554b377c.png" # [18] "acba6646e8250aab8865cd652dfaa7c56f643267ea2e774aee97dc2342d879d6.png" # [19] "ae00049dc36a1e5ffafcdeadb44b18a9cd6dfd459ee302ab041337529bd41cf2.png" # [20] "af4d6ff17fa7b41de146402e12b3bab1f1fe3c1e6f37da81a54e002168b1e7dd.png" # [21] "b0cbc2c553f9c4ac2191395236f776143fb3a28fb77b81d3d258a2f45361ca89.png" # [22] "b6fc3b5403de8f393ca368553566eaf03d5c07148539bc6141a486f1d185f677.png" # [23] "be98de8a7ba7d5d733b1212ae957f37b5b69d0bf350b9a5a25ba4346c29e49f7.png" # [24] "cb53899ef711bce04b209829c61958abdb50aa759f3f896eb7ed868021c22fb4.png" # [25] "d5024b272cb39f9ef2753e2f31344f42dd17c0e2311c4927946bc5008d295d2e.png" # [26] "f6eee5c69f54807923de1ceb1097fc3aa902a6b20d846f111e806988a4269ed0.png" # [27] "ffae764df84788e8047c0942f55676c9663209f65da943814c6b3aca78d8e7f7.png" As you can see each image has its own directory, that has two subdirectories inside:
For the modeling, beside train and test sets, we will also need a validation set (No one is forcing you, but it's a good practice!): train_path <- here("examples/data/data-science-bowl-2018/stage1_train/") test_path <- here("examples/data/data-science-bowl-2018/stage1_test/") validation_path <- here("examples/data/data-science-bowl-2018/stage1_validation/") if (!dir.exists(validation_path)) { dir.create(validation_path) # List train images train_samples <- list.files(train_path, full.names = TRUE) set.seed(1234) # Select 10% for validation validation_samples <- sample(train_samples, round(0.1 * length(train_samples))) validation_samples %>% walk(~ system(paste0('mv "', ., '" "', validation_path, '"'))) } Semantic segmentation with U-NetSince we now something about our data, we can now move to the modeling part. We will start by selecting the architecture of the neural network. In case of semantic segmentation there is a few different choices like U-Net, Fast-FCN, DeepLab and many more. For the time being in the platypus package you have access only to the U-Net architecture. U-Net was originally developed for biomedical data segmentation. As you can see in the picture above architecture is very similar to autoencoder and it looks like the letter U, hence the name. Model is composed of 2 parts, and each part has some number of convolutional blocks (3 in the image above). Number of blocks will be hyperparameter in our model. To build a U-Net model in
blocks <- 4 # Number of U-Net convolutional blocks n_class <- 2 # Number of classes net_h <- 256 # Must be in a form of 2^N net_w <- 256 # Must be in a form of 2^N grayscale <- FALSE # Will input image be in grayscale or RGB DCB2018_u_net <- u_net( net_h = net_h, net_w = net_w, grayscale = grayscale, blocks = blocks, n_class = n_class, filters = 16, dropout = 0.1, batch_normalization = TRUE, kernel_initializer = "he_normal" ) After that it's time to select loss and additional metrics. Because semantic segmentation is in essence classification for each pixel instead of the whole image, you can use categorical cross-entropy as a loss function and accuracy as a metric. Other common choice, available in DCB2018_u_net %>% compile( optimizer = optimizer_adam(lr = 1e-3), loss = loss_dice(), metrics = metric_dice_coeff() ) The next step will be data ingestion. As you remember we have a separate directory and multiple masks for each image. That's not a problem for binary_colormap # [[1]] # [1] 0 0 0 # # [[2]] # [1] 255 255 255 train_DCB2018_generator <- segmentation_generator( path = train_path, # directory with images and masks mode = "nested_dirs", # Each image with masks in separate folder colormap = binary_colormap, only_images = FALSE, net_h = net_h, net_w = net_w, grayscale = FALSE, scale = 1 / 255, batch_size = 32, shuffle = TRUE, subdirs = c("images", "masks") # Names of subdirs with images and masks ) # 603 images with corresponding masks detected! # Set 'steps_per_epoch' to: 19 validation_DCB2018_generator <- segmentation_generator( path = validation_path, # directory with images and masks mode = "nested_dirs", # Each image with masks in separate folder colormap = binary_colormap, only_images = FALSE, net_h = net_h, net_w = net_w, grayscale = FALSE, scale = 1 / 255, batch_size = 32, shuffle = TRUE, subdirs = c("images", "masks") # Names of subdirs with images and masks ) # 67 images with corresponding masks detected! # Set 'steps_per_epoch' to: 3 We can now fit the model. history <- DCB2018_u_net %>% fit_generator( train_DCB2018_generator, epochs = 20, steps_per_epoch = 19, validation_data = validation_DCB2018_generator, validation_steps = 3, callbacks = list(callback_model_checkpoint( filepath = file.path(models_path, "DSB2018_w.hdf5"), save_best_only = TRUE, save_weights_only = TRUE, monitor = "dice_coeff", mode = "max", verbose = 1) ) ) And calculate predictions for the new images. Our model will return a 4-dimensional array (number of images, height, width, number of classes). Each pixel will have N probabilities, where N is number of classes. To transform raw predictions into segmentation map (by selecting class with max probability for each pixel) you can use test_DCB2018_generator <- segmentation_generator( path = test_path, mode = "nested_dirs", colormap = binary_colormap, only_images = TRUE, net_h = net_h, net_w = net_w, grayscale = FALSE, scale = 1 / 255, batch_size = 32, shuffle = FALSE, subdirs = c("images", "masks") ) # 65 images detected! # Set 'steps_per_epoch' to: 3 test_preds <- predict_generator(DCB2018_u_net, test_DCB2018_generator, 3) dim(test_preds) # [1] 65 256 256 2 test_masks <- get_masks(test_preds, binary_colormap) dim(test_masks[[1]]) # [1] 256 256 3 To visualize predicted masks with the original images you can use test_imgs_paths <- create_images_masks_paths(test_path, "nested_dirs", FALSE, c("images", "masks"), ";")$images_paths plot_masks( images_paths = test_imgs_paths[1:4], masks = test_masks[1:4], labels = c("background", "nuclei"), colormap = binary_colormap )
To leave a comment for the author, please follow the link and comment on their blog: R on Data Science Guts. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Nucleci segmentation in R with Platypus. first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
Raspberry Pi E-Paper Dashboard with R Posted: 15 Oct 2020 11:00 AM PDT
[This article was first published on schochastics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. For once, this is not a post on an R package for network analysis or random R code. I have always been intrigued by the Raspberry Pi, but never really engaged with it since I lack This is what you need to build such a dashboard yourself:
E-Paper displays are kind of awesome since they can display an image without using any electricity (once it is loaded). In the rest of this post, I'll walk through all the steps necessary to set up the display to be used with R code. The handy stuffSimply fit the display into the frame and connect it to the Raspberry Pi. That's pretty much it. (yes, I used rubber bands to attach the Raspberry Pi to the frame…) Installing R on the Raspberry PiInstalling R on the Raspberry Pi turned out to be trickier than I thought. I was only able to install version 3.6 from the I ended up compiling R from source, which was surprisingly simple. I loosely followed the guide posted here: wget https://cran.rstudio.com/src/base/R-4/R-4.0.2.tar.gz tar zxvf R-4.0.2.tar.gz # uncomment first line in /etc/apt/source.list sudo apt-get build-dep r-base ./configure --prefix=/opt/R/4.0.2 --enable-R-shlib --with-blas --with-lapack If you run into the error: sudo apt-get install libpcre2-dev sudo make sudo make install sudo ln -s /opt/R/4.0.2/bin/R /usr/local/bin/R sudo ln -s /opt/R/4.0.2/bin/Rscript /usr/local/bin/Rscript and that's already it. You should now have R ready to run on your Raspberry Pi. Setting up the E-Paper displayI simply followed the guide posted on the waveshare wiki. Enable SPI
Install BCM2835 librariesCheck what the newest version is. At the time of writing, it was wget http://www.airspayce.com/mikem/bcm2835/bcm2835-1.68.tar.gz tar zxvf bcm2835-1.68.tar.gz cd bcm2835-1.68/ sudo ./configure sudo make sudo make check sudo make install Install wiringPisudo apt-get install wiringpi If you are on a Raspberry Pi 4, you'll need to update it wget https://project-downloads.drogon.net/wiringpi-latest.deb sudo dpkg -i wiringpi-latest.deb Communicating with the display from RThe Raspberry Pi can easily communicate with the display via some C or python library. #python3 sudo apt-get update sudo apt-get install python3-pip sudo apt-get install python3-pil sudo apt-get install python3-numpy sudo pip3 install Raspberry Pi.GPIO sudo pip3 install spidev You'll also need sudo apt-get install git -y Then, clone the repository of waveshare that includes the needed libraries. sudo git clone https://github.com/waveshare/e-Paper The library we will need is in the folder python code for communicationThe python code below is used to send the bitmap file #!/usr/bin/python # -*- coding:utf-8 -*- import sys import os import time libdir="/home/pi/e-Paper/RaspberryPi&JetsonNano/python/lib" if os.path.exists(libdir): sys.path.append(libdir) from waveshare_epd import epd7in5_V2 from PIL import Image,ImageDraw,ImageFont try: epd = epd7in5_V2.EPD() epd.init() Himage = Image.open("screen-output.bmp") epd.display(epd.getbuffer(Himage)) epd.sleep() except KeyboardInterrupt: epd7in5.epdconfig.module_exit() exit() Note that I am not a python expert so this code may well be awful (but it works). The code bellow shows how to call library(reticulate) use_python("/usr/bin/python3") py_run_file("display.py") Build the dashboard in RNow that we can communicate with the display from python and know how to use python from within R, we can To reproduce my dashboard setup, simply clone the repository git clone https://github.com/schochastics/e-Paper-dashboard.git To use the code, you need to get an API key from openweathermap.org and set up Of course you do not have to follow my build. The following code snippet should be sufficient to build your own dashboard from scratch. library(reticulate) #load additional libraries, such as e.g. ggplot2 use_python("/usr/bin/python3") #build your plot object ggsave("raw-output.bmp",p,width=5,height=3,dpi = 160) system("convert -colors 2 +dither -type Bilevel -monochrome raw-output.bmp screen-output.bmp") py_run_file("display.py") The parameters set in sudo apt-get install imagemagick Here is an example dashboard with my code from github. To leave a comment for the author, please follow the link and comment on their blog: schochastics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Raspberry Pi E-Paper Dashboard with R first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
Posted: 15 Oct 2020 11:00 AM PDT
[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Disclaimer: I have no affiliation with Microsoft Corp. or Revolution Analytics. For the n-th time in x years, submitting an R package to CRAN ended up Literally no online repo (PyPi, conda forge, submit a Julia package, etc.) works It seems to me like, under the hood and I know exactly what to do to remove the note ( How does miniCRAN feels to me? Just like when I want And for someone who's been using R for 15 + years (I wonder why it's studied and taught in university though), it was delightful to be able to do this for the first time: # my mirror of packages created with miniCRAN repo <- "https://techtonique.github.io/r-techtonique-forge/" # list of packages currently available in the repo for Linux/macOS and Windows print(miniCRAN::pkgAvail(repos = repo, type = "source")) print(miniCRAN::pkgAvail(repos = repo, type = "win.binary")) # installing `pkg_name` from my mirror using the old faithful `install.packages` install.packages("pkg_name", repos=repo, type="source") # using the package as usual library(pkg_name) To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Submitting R package to CRAN first appeared on R-bloggers. |
The history of autonomous vehicle datasets and 3 open-source Python apps for visualizing them Posted: 15 Oct 2020 10:26 AM PDT
[This article was first published on R – Modern Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Special thanks to Plotly investor, NVIDIA, for their help in reviewing these open-source Dash applications for autonomous vehicle R&D, and Lyft for initial data visualization development in Plotly. Author: Xing Han Lu, @xhlulu (originally posted on Medium)
Imagine eating your lunch, streaming your favorite TV shows, or chatting with your friends through video call inside the comfort of your car as it manages to drive you from home to work, to an appointment, or to a completely different city hours away. In fact, it would drive you through rain and snow storms, avoid potholes and identify pedestrians crossing the streets in low-light conditions; all this while ensuring that everyone stays safe. To achieve this, automotive manufacturers will need to achieve what is called "level 5 driving automation" (L5), which means that the "automated driving features will not require you to take over driving" and "can drive everywhere in all conditions", as defined by SAE International. Achieving this level of driving automation has been the dream of many hardware engineers, software developers, and researchers. Whichever automotive manufacturer capable of achieving L5 first would immediately become ahead of the competition. It could also result in safer roads for everyone by mitigating human error, which causes 94% of serious crashes in the United States. For these reasons, various models, datasets, and visualization libraries have been publicly released over the years to strive towards fully autonomous vehicles (AV). In order to support the development of autonomous driving and help researchers better understand how self-driving cars view the world, we present four Dash apps that cover self-driving car trip playbacks and real-time data reading and visualizations, and video frame detection and annotations. From KITTI to Motional and Lyft: 7 years of open-access datasets for the AV communityPublished in 2012 as a research paper, the KITTI Vision benchmark was one of the first publicly available benchmarks for evaluating computer vision models based on automotive navigation. Through 6 hours of car trips around a rural town in Germany, it collected real-time sensor data: GPS coordinates, video feed, and point clouds (through a laser scanner). Additionally, it offered 3D bounding box annotations, optical flow, stereo, and various other tasks. By building models capable of accurately predicting those annotations, researchers could help autonomous vehicles systems accurately localize cars/pedestrians and predict the distance of objects/roads. In 2019, researchers at Motional released nuScenes, an open-access dataset of over 1000 scenes collected in Singapore and Boston. It collected a total of 1.5M colored images and 400k lidar point clouds in various meteorological conditions (rain and night time). To help navigate this dataset, the authors also released a Python devkit for easily retrieving and reading collected sensor data for a given scene. It also included capabilities for plotting static 2D rendering of the point clouds on each image. In the same year, Lyft released their Level 5 (L5) perception dataset, which encompasses over 350 scenes and over 1.3M bounding boxes. To explore this new dataset, they created a fork of the nuScenes devkit and added conversion tools, video rendering improvements, and interactive visualizations in Plotly. In order to encourage the community to leverage the dataset for building models capable of detecting objects in the 3D world, they released a competition on Kaggle with a prize pool of $25,000. Following the contributions of Motional and Lyft, various automotive companies released their own AV dataset, including Argo, Audi, Berkeley, Ford, Waymo, and many others. Web-based visualization tools for LIDAR point clouds and bounding boxesThe datasets released by academic and industrial researchers are not only large, but also highly complex and require visualization tools capable of handling their multi-modality. Whereas videos could be effectively displayed frame-by-frame, point clouds and 3D bounding boxes require more sophisticated solutions in order to fully display them. Although you can visualize them in two dimensions using top-down views or by projecting them onto video frames, it's hard to fully understand everything that's happening at a given point in time without being able to interact with the data by moving around and zooming in and out. One library that can be used to solve this problem is deck.gl. Created at Uber and maintained by the vis.gl team, it offers interactive 3D visualization that can be directly integrated with maps through Mapbox. Among the wide range of layer types, you can find point clouds and polygons, which can be respectively used to display and interact with LIDAR points and 3D bounding box annotations. Although the various tools offered by deck.gl are extremely customizable and can be used in various applications, you will still need to implement everything by yourself in order to build a user interface for visualizing AV scenes. In order to alleviate this process and accelerate the development of custom AV dashboards, the same team behind deck.gl built streetscape.gl (also known as Autonomous Visualization System, or AVS), a collection of React components that lets you build a fully custom AV dashboard in a few hundred lines of JavaScript. When using data recorded in the XVIZ format, you can pre-generate a complete scene and load it into the client's browser, enabling a smooth playback of the entire segment. In addition to visualizing point clouds and bounding boxes, you can also create custom objects such as car meshes, record metrics over time like acceleration and velocity, and offer controllable settings to the end user, enabling more fine-grained control directly in the UI. Although both libraries are extremely useful for AV research, they require the user to have a certain degree of familiarity with JavaScript, React, Node.js, and webpack. This makes it harder for AV researchers and engineers to build customized solutions without a team of developers specialized in front-end technologies, thus slowing down the development cycle of new AV features, and making those visualization tools harder to integrate with existing AV tools and ML libraries already available in Python. For those reasons, we present 3 Dash template apps that can help with R&D of AV software, systems and tools by abstracting, streamlining and unifying the various open-source AV libraries and datasets. Dash App #1: Autonomous Visualization System (AVS) in DashAll of Dash's core components, as well as many popular community-made components are built using React.js, which is the same framework for interfacing with streetscape.gl. Therefore, it was possible to first build a React UI app and then wrap it as a custom Dash component using the Dash component boilerplate. This makes your UIs built with streetscape.gl directly available to Dash apps built in Python, R, or Julia. For this demo, we built a basic UI inspired by Streetscape.gl's official demo and contains many metrics and controls to help understand the scene. In this Dash AVS app, you can play an interactive clip containing a scene, which is a segment of a car trip with data collected from LIDAR sensors, cameras, GPS, the car itself, and from human annotators (e.g. the bounding box annotations). At any time, you can stop the clip to:
By using Dash's own Core Components or community-built components like Dash Bootstrap, you can also create custom components to give the user more control on what is being visualized. You can directly choose various map styles, dataset & scene URLs, and whether you want to use the basic or advanced version of the UI. Given those inputs, you can simply create a This entire app is written in less than 200 lines of Python code, and can be easily deployed and scaled through Dash Enterprise's app manager. Dash App #2: Visualizing the Lyft Perception dataset with dash-deckOur second app lets you explore a specific scene from the Lyft Perception dataset. You can navigate between the frames by clicking on one of the buttons or by dragging the slider to set the exact time of the scene. Through dash-deck, the interactive viewer displays the point clouds and bounding boxes annotations at a given frame in a similar way to the first app. You can choose between various map views, LIDAR devices, cameras, and toggle various overlays on the image. The difference between this app and the first app become clear once you look into the implementation details. If you are looking for more flexibility in your visualizations, and want to use a fully Python oriented solution, you might be interested in using pydeck, a Python interface for rendering deck.gl views. Although it requires more tweaking and customization, you can have significantly more control during the development of the app. For example, you can decide which color each point cloud will have, the opacity of the point cloud, the exact shape of a mesh object, the starting position of the camera, and the point of view (which can be orbit, first person or map view). Another major difference is that Dash AVS requires you to first convert your data into XVIZ before serving or streaming a scene to the user. On the other hand, this app can be easily modified to use any possible data format, as long as you can preprocess the input into the format accepted by pydeck or dash-deck. The reason behind that is that everything is done dynamically and in real-time. In fact, every time the user selects a certain frame, this is what happens:
All six of these libraries are accessible through Python. In fact, the first four libraries were not specifically designed to be used in Dash. They just work out of the box because Dash was designed to seamlessly work with most Python use cases. This means that you could easily modify this app to perform real-time processing such as 3D mesh generationor bounding box predictions using PointNet prior to displaying a frame. Dash App #3: Object detection and editable annotations for video framesOur third app lets you replay driving scenes from the Berkeley DeepDrive dataset, which contains 1100 hours of driving videos. The video is augmented with 2D bounding box annotations generated by MobileNet v2, which were generated and embedded into the video. In addition to replaying the scene, you can stop the video at any time, run the MobileNet algorithm in real-time and interactive edit or add new bounding boxes at that exact timestamp. Once you are happy with the annotations, you can move on to the next frame or the next scene and download the updated and new annotations as CSV files. This app is our favorite example of computer vision algorithms being used alongside human annotators to accelerate the data collection process. Since the app is fully built in Python and only uses built-in features from dash-player, Plotly.py, and TensorFlow Hub, you can easily personalize it to use more sophisticated models and fulfill technical annotation requirements that are virtually only limited by your choices of libraries in Python. For example, you could decide to store all the new annotations inside a SQL database (which are automatically included in Dash Enterprise), export them to a cloud storage using a library like boto3, or generate a snapshot of the working session that you can share with other annotators. More Dash apps for enhancing self-driving cars visualizationIn addition to these three apps, we have built various open-source Dash apps that can be used for supporting the development of self-driving cars:
Are you interested in building applications for autonomous driving, or interested in using advanced visualization components in Dash? Reach out to learn more about how you can take part in the development of next-generation autonomous driving visualization tools! To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post The history of autonomous vehicle datasets and 3 open-source Python apps for visualizing them first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
Posted: 15 Oct 2020 09:20 AM PDT
[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ·A Fermat-like riddle from the Riddler (with enough room to code on the margin)
Since the problem amounts to finding a>b>0 such that both (a+b) and (a-b) are products of some of the prime factors in the decomposition of N and both terms must have the same parity for the average a to be an integer. This eliminates decompositions with a single prime factor 2 (and N=1). For other cases, the following R code (which I could not deposit on tio.run because of the packages R.utils!) returns a list library(R.utils) library(numbers) bitz<-function(i,m) #int2bits c(rev(as.binary(i)),rep(0,m))[1:m] ridl=function(n){ a=primeFactors(n) if((n==1)|(sum(a==2)==1)){ print("impossible")}else{ m=length(a);g=NULL for(i in 1:2^m){ b=bitz(i,m) if(((d<-prod(a[!!b]))%%2==(e<-prod(a[!b]))%%2)&(d |
Posted: 15 Oct 2020 08:26 AM PDT
[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Gold-Mining Week 6 (2020) appeared first on Fantasy Football Analytics. To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Gold-Mining Week 6 (2020) first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
Hotelling’s T^2 in Julia, Python, and R Posted: 14 Oct 2020 11:00 AM PDT
[This article was first published on R on Data & The World, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The t-test is a common, reliable way to check for differences between two samples. When dealing with multivariate data, one can simply run t-tests on each variable and see if there are differences. This could lead to scenarios where individual t-tests suggest that there is no difference, although looking at all variables jointly will show a difference. When a multivariate test is preferred, the obvious choice is the Hotelling's \(T^2\) test. Hotelling's test has the same overall flexibility that the t-test does, in that it can also work on paired data, or even a single dataset, though this example will only cover the two-sample case. BackgroundWithout delving too far into the math – Wikipedia has a more thorough look at it – if we have two samples \(X\) and \(Y\) pulled from the same normal distribution, then the test statistic will be $$ t^2 = \frac{n_x n_y}{n_x+n_y} (\bar{X} – \bar{Y})^T \Sigma^{-1} (\bar{X} – \bar{Y}) $$ where \(n_x\) is the number of samples in \(X\) (same for \(n_Y\)), \(\bar{X}\) is the vector of the column averages in \(X\) (same for \(\bar{Y}\)), and \(\Sigma\) is the pooled covariance matrix. This follows a \(T^2\) distribution, which isn't among the distributions typically found in statistical software. But it turns out that this can be made to follow an F-distribution by multiplying by the right factor: $$ \frac{n_x + n_y – p – 1}{p(n_x + n_y – 2)} t^2 \sim F(p, n_x+n_y-p-1) $$ where \(p\) is the number of variables in the data. The F-distribution is much more commonplace, so this should be much easier to deal with. Since computing the statistic and p-value requires nothing beyond some linear algebra and probability distributions, it's not too difficult to implement. Thorough implementations of Hotelling's \(T^2\) aren't hard to find, though from what I've seen it's usually in a package that's not standard or common – Python seems to lack one in its major statistics libraries while Julia and R have them as parts of specific libraries ( The DataFor data, we'll use the sepal measurements in the iris dataset. Looking at the data, there are definite differences between the classes, and while there is an extension of Hotelling's test that can handle more than two samples, we'll just use the Versicolor and Virginica species. From the plot below, there does appear to be some difference, and univariate t-tests on each variable strongly suggest that significant differences exist. The reason for only using the sepal measurements and those two species is because some other combinations were producing p-values small enough that Julia rendered them as zero, especially when all four variables were used. So I picked something that produced a p-value that all three languages could properly display. R ImplementationR probably has the most straightforward implementation, in that all of the tools needed to write this are actually imported upon starting up, so no external dependencies exist. I am using the library(glue) TwoSampleT2Test <- function(X, Y){ nx <- nrow(X) ny <- nrow(Y) delta <- colMeans(X) - colMeans(Y) p <- ncol(X) Sx <- cov(X) Sy <- cov(Y) S_pooled <- ((nx-1)*Sx + (ny-1)*Sy)/(nx+ny-2) t_squared <- (nx*ny)/(nx+ny) * t(delta) %*% solve(S_pooled) %*% (delta) statistic <- t_squared * (nx+ny-p-1)/(p*(nx+ny-2)) p_value <- pf(statistic, p, nx+ny-p-1, lower.tail=FALSE) print(glue("Test statistic: {statistic} Degrees of freedom: {p} and {nx+ny-p-1} p-value: {p_value}")) return(list(TestStatistic=statistic, p_value=p_value)) } data(iris) versicolor <- iris[iris$Species == "versicolor", 1:2] virginica <- iris[iris$Species == "virginica", 1:2] TwoSampleT2Test(versicolor, virginica) ## Test statistic: 15.8266009919182 ## Degrees of freedom: 2 and 97 ## p-value: 1.12597832539986e-06 Julia ImplementationJulia is a little bit trickier. It might be because I haven't gone too deep into it, but Julia seems a little less casual about type conversions than R, so there is a need to mindful of those. In particular, the matrix multiplication step goes smoothly but returns a 1-by-1 array instead of a single numeric value like R does, so the statistic needs to be extracted from the array. There are also a few packages that need to be installed and imported, though these aren't very unusual ones. using Distributions using RDatasets using Statistics function TwoSampleT2Test(X,Y) nx, p = size(X) ny, _ = size(Y) δ = mean(X, dims=1) - mean(Y, dims=1) Sx = cov(X) Sy = cov(Y) S_pooled = ((nx-1)*Sx + (ny-1)*Sy)/(nx+ny-2) t_squared = (nx*ny)/(nx+ny) * δ * inv(S_pooled) * transpose(δ) statistic = t_squared[1,1] * (nx+ny-p-1)/(p*(nx+ny-2)) F = FDist(p, nx+ny-p-1) p_value = 1 - cdf(F, statistic) println("Test statistic: $(statistic)\nDegrees of freedom: $(p) and $(nx+ny-p-1)\np-value: $(p_value)") return([statistic, p_value]) end iris = dataset("datasets", "iris") versicolor = convert(Matrix, iris[iris.Species .== "versicolor", 1:2]) virginica = convert(Matrix, iris[iris.Species .== "virginica", 1:2]) TwoSampleT2Test(versicolor, virginica) ## Test statistic: 15.826600991918198 ## Degrees of freedom: 2 and 97 ## p-value: 1.1259783254669031e-6 The results appear to be the same, although there is a very slight difference in the p-values, on the order of \(10^{-16}\) or so. I don't know whether that's an implementation detail or something else. The way Julia handles its arrays is a little different as well, since we're taking the transpose of δ in the right-side multiplication instead of the left side. Also, if you're not familiar with Julia, the use of the Unicode character for the Greek letter delta (δ) might be surprising. Julia actually can use Unicode characters for variable names, which is pretty neat (though not generally advisable, I think). Python ImplementationThis one is probably the most convoluted of the three, though even it isn't that bad. We again require a couple of fairly common packages. There are two major differences in the Python code compared to the previous languages:
import numpy as np from sklearn import datasets from scipy.stats import f def TwoSampleT2Test(X, Y): nx, p = X.shape ny, _ = Y.shape delta = np.mean(X, axis=0) - np.mean(Y, axis=0) Sx = np.cov(X, rowvar=False) Sy = np.cov(Y, rowvar=False) S_pooled = ((nx-1)*Sx + (ny-1)*Sy)/(nx+ny-2) t_squared = (nx*ny)/(nx+ny) * np.matmul(np.matmul(delta.transpose(), np.linalg.inv(S_pooled)), delta) statistic = t_squared * (nx+ny-p-1)/(p*(nx+ny-2)) F = f(p, nx+ny-p-1) p_value = 1 - F.cdf(statistic) print(f"Test statistic: {statistic}\nDegrees of freedom: {p} and {nx+ny-p-1}\np-value: {p_value}") return statistic, p_value iris = datasets.load_iris() versicolor = iris.data[iris.target==1, :2] virginica = iris.data[iris.target==2, :2] TwoSampleT2Test(versicolor, virginica) ## Test statistic: 15.82660099191812 ## Degrees of freedom: 2 and 97 ## p-value: 1.1259783253558808e-06 We have a third, slightly different p-value than both previous implementations. Again, I don't really know what's going on. ConclusionSo writing a Hotelling's \(T^2\) test yourself isn't too hard. These functions could be generalized to handle one-sample, paired, and other variants as well if desired; this should prove to be a good start if so. To leave a comment for the author, please follow the link and comment on their blog: R on Data & The World. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Hotelling's T^2 in Julia, Python, and R first appeared on R-bloggers. This posting includes an audio/video/photo media file: Download Now |
QQ-plots and Box-Whisker plots: where do they come from? Posted: 14 Oct 2020 11:00 AM PDT
[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. For the most curious studentsQQ-plots and Box-Whisker plots usually become part of the statistical toolbox for the students attending my course of 'Experimental methods in agriculture'. Most of them learn that the QQ-plot can be used to check for the basic assumption of gaussian residuals in linear models and that the Box-Whisker plot can be used to describe the experimental groups, when their size is big enough and we do not want to assume a gaussian distribution. Furthermore, most students learn to use the If I were to give further detail about these two types of graphs, I should give a more detailed explanation of percentiles, at the very beginning of my course. I usually present the concept, which is rather easy to grasp, for students and I also give an example of how to calculate the 50th percentile (i.e. the median). So far, so good. But, what about the 25th and 75th percentiles, which are needed to build the Box-Whisker plot? As a teacher, I must admit I usually skip this aspect; I resort to showing the use of the Usually, there is no time to satisfy the above thirst for knowledge. First of all, I need time to introduce other important concepts for agricultural student; second, giving more detail would imply a high risk of being 'beaten' by all the other, less eager, students. Therefore, I decided to put my explanation here, to the benefit of the most curious of my students. As an example, we will use 11 observations, randomly selected from a uniform distribution. First of all, we sort them out in increasing order. set.seed(1234) y <- sort( runif(11)) y ## [1] 0.009495756 0.113703411 0.232550506 0.514251141 0.609274733 0.622299405 ## [7] 0.623379442 0.640310605 0.666083758 0.693591292 0.860915384 The P-values of observationsLet's try to imagine that our eleven observations are percentiles, although we do not know what their percentage point is. Well, we know something, at least; we have an odd number of observations and we know that the central value (0.622299405) is the median, i.e. the 50th percentile. But, what about the other values? In other words, we are looking for the P-values associated to each observation (probability points). In order to determine the P-values, we divide the whole scale from 0 to 100% into as many intervals as there are values in our sample, that is eleven intervals. Each interval contains \(1 / 11 \times 100 = 9.09\%\) of the whole scale: the first interval goes from 0% to 9.09%, the second goes from 9.09% to 18.18% and so on, until the 11th interval, that goes from 90.91% to 100%. Now, each value in the ordered sample is associated to one probability interval. The problem is: where do we put the value, within each interval? A possible line of attack, that is the default in R with \(n > 10\), is to put the value in the middle of the interval, as shown in the Figure below. Consequently, each point corresponds to the following P-values: (1:11 - 0.5)/11 ## [1] 0.04545455 0.13636364 0.22727273 0.31818182 0.40909091 0.50000000 ## [7] 0.59090909 0.68181818 0.77272727 0.86363636 0.95454545 In general: \[p_i = \frac{i - 0.5}{n} \] where \(n\) is the number of values and \(i\) is the index of each value in the sorted vector. In other words, the first value is the 4.5th percentile, the second value is the 13.64th percentile and so on, until the 11th value that is the 95.45th percentile. Unfortunately, the calculation of P-values is not unique and there are other ways to locate the values within each interval. A second possibility, is to put the value close to the beginning of each interval. In this case, the corresponding P-values can be calculated as: (1:11 - 1)/(11 - 1) ## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 In general: \[p_i = \frac{i - 1}{n - 1} \] According to this definition, the first value is the 0th percentile, the second is the 10th percentile and so on. A third possibility, is to put the value at the end of each interval. In this case, each point corresponds to the following P-values: 1:11/11 ## [1] 0.09090909 0.18181818 0.27272727 0.36363636 0.45454545 0.54545455 ## [7] 0.63636364 0.72727273 0.81818182 0.90909091 1.00000000 In general, it is: \[p_i = \frac{i}{n} \] Although it is not the default in R, this third approach is, perhaps, the most understandable, as it is more closely related to the definition of percentiles. It is clear that there is 9.09% subjects equal to or lower than the first one and that there is 18.18% of subjects equal to or lower than the second one. Several other possibilities exist, but we will not explore them here. However, the most common function in R to calculate probability points is the ppoints(y) ## [1] 0.04545455 0.13636364 0.22727273 0.31818182 0.40909091 0.50000000 ## [7] 0.59090909 0.68181818 0.77272727 0.86363636 0.95454545 ppoints(y, a = 1) ## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 From |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment