[R-bloggers] Which Technology Should I Learn? (and 7 more aRticles) |
- Which Technology Should I Learn?
- Why R? Webinar – Development pipeline for R production – rZYPAD
- Z is for Additional Axes
- Expert opinion (again)
- Highlights of Hugo Code Highlighting
- Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020
- Movie Recommendation With Recommenderlab
- Testing for Covid-19 in the U.S.
Which Technology Should I Learn? Posted: 30 Apr 2020 09:00 AM PDT [This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Knowing where to start can be challenging, but we're here to help. Read on to learn more about where to begin on your data science and analytics journey. Data science and analytics languagesIf you're new to data science and analytics, or your organization is, you'll need to pick a language to analyze your data and a thoughtful way to make that decision. Read our blog post and tutorial to learn how to choose between the two most popular languages for data science—Python and R—or read on for a brief summary. PythonPython is one of the world's most popular programming languages. It is production-ready, meaning it has the capacity to be a single tool that integrates with every part of your workflow. So whether you want to build a web application or a machine learning model, Python can get you there!
RR has been used primarily in academics and research, but in recent years, enterprise usage has rapidly expanded. Built specifically for working with data, R provides an intuitive interface to the most advanced statistical methods available today.
SQLMuch of the world's raw data lives in organized collections of tables called relational databases. Data analysts and data scientists must know how to wrangle and extract data from these databases using SQL.
DatabasesData scientists, analysts, and engineers must constantly interact with databases, which can store a vast amount of information in tables without slowing down performance. You can use SQL to query data from databases and model different phenomena in your data and the relationships between them. Find out the differences between the most popular databases in our blog post or read on for a summary. Microsoft SQL Server
PostgreSQL
Oracle Database
SpreadsheetsSpreadsheets are used across the business world to transform mountains of raw data into clear insights by organizing, analyzing, and storing data in tables. Microsoft Excel and Google Sheets are the most popular spreadsheet software, with a flexible structure that allows data to be entered in cells of a table. Google Sheets
Microsoft Excel
Business intelligence toolsBusiness intelligence (BI) tools make data discovery accessible for all skill levels—not just advanced analytics professionals. They are one of the simplest ways to work with data, providing the tools to collect data in one place, gain insight into what will move the needle, forecast outcomes, and much more. TableauTableau is a data visualization software that is like a supercharged Microsoft Excel. Its user-friendly drag-and-drop functionality makes it simple for anyone to access, analyze and create highly impactful data visualizations.
Microsoft Power BIMicrosoft Power BI allows users to connect and transform raw data, add calculated columns and measures, create simple visualizations, and combine them to create interactive reports.
To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. | ||||||||||||
Why R? Webinar – Development pipeline for R production – rZYPAD Posted: 30 Apr 2020 07:00 AM PDT [This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. April 30th (8:00pm GMT+2) is another date for a webinar at Why R? Foundation YouTube channel. We will have a blast talk by Lorenzo Braschi from Roche IT. The title of the meeting is rZYPAD: Development pipeline for R production See you on the Webinar! Details
Next talks
Previous talksRobin Lovelace and Jakub Nowosad (authors of Geocomputation with R) – Recent changes in R spatial and how to be ready for them. Video Heidi Seibold, Department of Statistics (collaboration with LMU Open Science Center) (University of Munich) – Teaching Machine Learning online. Video Olgun Aydin – PwC Poland – Introduction to shinyMobile. Video Achim Zeileis from Universität Innsbruck – R/exams: A One-for-All Exams Generator – Online Tests, Live Quizzes, and Written Exams with R. Video Stay up to date
To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||
Posted: 30 Apr 2020 07:00 AM PDT [This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Here we are at the last post in Blogging A to Z! Today, I want to talk about adding additional axes to your ggplot, using the options for fill or color. While these aren't true z-axes in the geometric sense, I think of them as a third, z, axis. Some of you may be surprised to learn that fill and color are different, and that you could use one or both in a given plot. Color refers to the outline of the object (bar, piechart wedge, etc.), while fill refers to the inside of the object. For scatterplots, the default shape doesn't have a fill, so you'd just use color to change the appearance of those points. Let's recreate the pages read over 2019 chart, but this time, I'll just use fiction books and separate them as either fantasy or other fiction; this divides that dataset pretty evenly in half. Here's how I'd generate the pages read over time separately by those two genre categories. library(tidyverse) reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv", fantasy <- reads2019 %>% Now I'd just plug that information into my ggplot code, but add a third variable in the aesthetics (aes) for ggplot – color = Fantasy. library(scales) myplot <- fantasy %>% This plot uses the default R colorscheme. I could change those colors, using an existing colorscheme, or define my own. Let's make a fivethirtyeight style figure, using their theme for the overall plot, and their color scheme for the genre variable. library(ggthemes) ## Warning: package 'ggthemes' was built under R version 3.6.3 myplot + I can also specify my own colors. The geom_point offers many point shapes; 21-25 allow you to specify both color and fill. But for the rest, only use color. library(ggpubr) ## Warning: package 'ggpubr' was built under R version 3.6.3 ggpubr::show_point_shapes() Of course, you may have plots where changing fill is best, such as on a bar plot. In my summarize example, I created a stacked bar chart of fiction versus non-fiction with author gender as the fill. reads2019 %>% Stacking is the default, but I could also have the bars next to each other. reads2019 %>% You can also use fill (or color) with the same variable you used for x or y; that is, instead of having it be a third scale, it could add some color and separation to distinguish categories from the x or y variable. This is especially helpful if you have multiple categories being plotted, because it helps break up the wall of bars. If you do this, I'd recommend choosing a color palette with highly complementary colors, rather than highly contrasting ones; you probably also want to drop the legend, though, since the axis will also be labeled. genres <- reads2019 %>% If you only have a couple categories and want to draw a contrast, that's when you can use contrasting shades: for instance, at work, when I plot performance on an item, I use red for incorrect and blue for correct, to maximize the contrast between the two performance levels for whatever data I'm presenting. I hope you enjoyed this series! There's so much more you can do with tidyverse than what I covered this month. Hopefully this has given you enough to get started and sparked your interest to learn more. Once again, I highly recommend checking out R for Data Science. To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||
Posted: 29 Apr 2020 05:00 PM PDT [This article was first published on R | Gianluca Baio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. THis is the second video I was mentioning here — took a while to get out but it's available now. I think you need to register here and then you can see our panel discussion. Like I said earlier, it was good fun and I think the actual session we did at ISPOR last year was, I think, very well received and it's a shame that we can't build on the momentum in the next R-HTA (which, I think, we're going to have to postpone, given the COVID-19 emergency…). To leave a comment for the author, please follow the link and comment on their blog: R | Gianluca Baio. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||
Highlights of Hugo Code Highlighting Posted: 29 Apr 2020 05:00 PM PDT [This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Thanks to a quite overdue update of Hugo on our build system1, our website can now harness the full power of Hugo code highlighting for Markdown-based content. Make your code look prettyIf you notice and appreciate the difference between and you might agree with Mara Averick's opinion,
Syntax highlighting means some elements of code blocks, like functions, operators, comments, etc. get styled differently: they could be colored or in italic. Now, how do the colors of the second block appear? First of all, it's a code block with language information, in this case R (note the as opposed to without language information, that won't get highlighted – although some syntax highlighting tools, not Hugo Chroma, do some guessing. There are in general two ways in which colors are added to code blocks, client-side syntax highlighting and server-side syntax highlighting. Client-side syntax highlightingIn this sub-section I'll mostly refer to highlight.js but principles probably apply to other client-side syntax highlighting tools. If we look at a post of Mara Averick's at the time of writing, the html of a block is just Now, using Firefox Developer Console, we see colors come from CSS classes starting with "hljs". And in the head of that page (examined via "View source"), there's which is the part loading and applying highlight.js to the page. When using highlight.js on your website, you might need to specify R as a supplementary language in your config, since some languages are bundled by default whilst others are not. A big downside of client-side syntax highlighting is loading time: Server-side syntax highlightingIn server-side syntax highlighting, with say Pygments or Chroma (Hugo default), your website html as served already has styling information. With Chroma, that styling information is either:
The html source for one of the blocks of the page screenshot above is The style used is indicated in the website config and picked from Chroma style gallery.
The html of the block seen above is and it goes hand in hand with having styling for different ".chroma" classes in our website CSS. To have this behaviour, in our website config there's which confusingly enough uses the name "Pygments", not Chroma, for historical reasons.
To generate a stylesheet for a given style, use Hugo How does Chroma know what parts of code is of the string class for instance? Chroma works on Markdown content, so if you use blogdown to generate pages as html, you can only use client-side highlighting, like this tidyverse.org page whose source is html. We have now seen how Hugo websites have syntax highlighting, which for Yihui Xie "is only for cosmetic purposes". Emphasize parts of your codeWith Chroma, you can apply special options to code blocks defined with fences, i.e. starting with three backticks and language info, and ending with three backticks5. On Chroma options for line highlightingSee how is rendered below: lines 1 and 4 to 5 are highlighted. There are also options related to line numbering. gives a code block with line numbered as table (easier for copy-pasting the code without line numbers), starting from number 3.
You can also configure line numbering for your whole website. The real magic to me is that if you write your code from R Markdown you can
knitr hook to highlight lines of source codeOur hook is The chunk6 is rendered as
Produce line-highlighted code blocks with | ||||||||||||
Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020 Posted: 29 Apr 2020 02:52 PM PDT [This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Nina Zumel and John Mount will be speaking on advanced data preparation for supervised machine learning at the Why R? Webinar Thursday, May 7, 2020. This is a 8pm in a GMT+2 timezone, which for us is 11AM Pacific Time. Hope to see you there! To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||
Movie Recommendation With Recommenderlab Posted: 29 Apr 2020 12:05 AM PDT [This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Because You Are Interested In Data Science, You Are Interested In This Blog PostIf you love streaming movies and tv series online as much as we do here at STATWORX, you've probably stumbled upon recommendations like „Customers who viewed this item also viewed…" or „Because you have seen …, you like …". Amazon, Netflix, HBO, Disney+, etc. all recommend their products and movies based on your previous user behavior – But how do these companies know what their customers like? The answer is collaborative filtering. In this blog post, I will first explain how collaborative filtering works. Secondly, I'm going to show you how to develop your own small movie recommender with the R package Different ApproachesThere are several approaches to give a recommendation. In the user-based collaborative filtering (UBCF), the users are in the focus of the recommendation system. For a new proposal, the similarities between new and existing users are first calculated. Afterward, either the n most similar users or all users with a similarity above a specified threshold are consulted. The average ratings of the products are formed via these users and, if necessary, weighed according to their similarity. Then, the x highest rated products are displayed to the new user as a suggestion. For the item-based collaborative filtering IBCF, however, the focus is on the products. For every two products, the similarity between them is calculated in terms of their ratings. For each product, the k most similar products are identified, and for each user, the products that best match their previous purchases are suggested. Those and other collaborative filtering methods are implemented in the
Developing your own Movie RecommenderDatasetTo create our recommender, we use the data from movielens. These are film ratings from 0.5 (= bad) to 5 (= good) for over 9000 films from more than 600 users. The movieId is a unique mapping variable to merge the different datasets. To better understand the film ratings better, we display the number of different ranks and the average rating per film. We see that in most cases, there is no evaluation by a user. Furthermore, the average ratings contain a lot of „smooth" ranks. These are movies that only have individual ratings, and therefore, the average score is determined by individual users. In order not to let individual users influence the movie ratings too much, the movies are reduced to those that have at least 50 ratings. Under the assumption that the ratings of users who regularly give their opinion are more precise, we also only consider users who have given at least 50 ratings. For the films filtered above, we receive the following average ratings per user: You can see that the distribution of the average ratings is left-skewed, which means that many users tend to give rather good ratings. To compensate for this skewness, we normalize the data. Model Training and EvaluationTo train our recommender and subsequently evaluate it, we carry out a 10-fold cross-validation. Also, we train both an IBCF and a UBCF recommender, which in turn calculate the similarity measure via cosine similarity and Pearson correlation. A random recommendation is used as a benchmark. To evaluate how many recommendations can be given, different numbers are tested via the vector We then have the results displayed graphically for analysis. We see that the best performing model is built by using UBCF and the Pearson correlation as a similarity measure. The model consistently achieves the highest true positive rate for the various false-positive rates and thus delivers the most relevant recommendations. Furthermore, we want to maximize the recall, which is also guaranteed at every level by the UBCF Pearson model. Since the n most similar users (parameter ConclusionOur user based collaborative filtering model with the Pearson correlation as a similarity measure and 40 users as a recommendation delivers the best results. To test the model by yourself and get movie suggestions for your own flavor, I created a small Shiny App. However, there is no guarantee that the suggested movies really meet the individual taste. Not only is the underlying data set relatively small and can still be distorted by user ratings, but the tech giants also use other data such as age, gender, user behavior, etc. for their models. But what I can say is: Data Scientists who read this blog post also read the other blog posts by STATWORX. Shiny-AppHere you can find the Shiny App. To get your own movie recommendation, select up to 10 movies from the dropdown list, rate them on a scale from 0 (= bad) to 5 (= good) and press the run button. Please note that the app is located on a free account of shinyapps.io. This makes it available for 25 hours per month. If the 25 hours are used and therefore the app is this month no longer available, you will find the code here to run it on your local RStudio. ABOUT USSTATWORX
Sign Up Now!
Sign Up Now!
Der Beitrag Movie Recommendation With Recommenderlab erschien zuerst auf STATWORX. To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||
Testing for Covid-19 in the U.S. Posted: 28 Apr 2020 06:22 PM PDT [This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. For almost a month, on a daily basis, we are working with colleagues (Romuald, Chi and Mathieu) on modeling the dynamics of the recent pandemic. I learn of lot of things discussing with them, but we keep struggling with the tests. Paul, in Montréal, helped me a little bit, but I think we will still have to more to get a better understand. To but honest, we stuggle with two very simple questions
Recently, I discovered Modelling COVID-19 exit strategies for policy makers in the United Kingdom, which is very close to what we try to do… and in the document two interesting scenarios are discussed, with, for the first one, "1 million 'reliable' daily tests are deployed" (in the U.K.) and "5 million 'useless' daily tests are deployed". There are about 65 millions unhabitants in the U.K. so we talk here about 1.5% people tested, on a daily basis, or 7.69% people ! It could make sense, but our question was, at some point, is that realistic ? where are we today with testing ? In the U.S. https://covidtracking.com/ collects interesting data, on a daily basis, per state.
Unfortunately, there is no information about the population. That we can find on wikipedia. But in that table, the state is given by its full name (and the symbol in the previous dataset). So we new also to match the two datasets properly,
Now our dataset is fine… and we can get a function to plot the number of people tested in the U.S. (cumulated). Here, we distinguish between the positive and the negative,
Let us start with New York
As at now, 4% of the entiere population got tested… over 6 weeks…. The graph on the right is the proportion of people who tested positive… I won't get back on that one here today, I keep it for our work. In New Jersey, we got about 2.5% of the entiere population tested, overall,
Let us try a last one, Florida
As at today, it is 1.5% of the population, over 6 weeks. Overall, in the U.S. less than 0.1% people are tested, on a daily basis. Which is far from the 1.5% in the U.K. scenarios. Now, here come the second question,
On that one, my experience in biology is… very limited, and Paul helped me. He mentioned this morning a nice report, from a lab in UC Berkeley One of my question was for instance, if you get tested positive, and you do it again, can you test negative ? Or, in the context of our data, do we test different people ? are some people tested on a regular basis (perhaps every week) ? For instance, with antigen tests (Reverse Transcription Quantitative Polymerase Chain Reaction (RT-qPCR) – also called molecular or PCR – Polymerase Chain Reaction – test) we test if someone is infectious, while with antibody test (using serological immunoassays that detect viral-specific antibodies — Immunoglobin M (IgM) and G (IgG) — also called serology test), we test for immunity. Which is rather different… I have no idea what we have in our database, to be honest… and for the past six weeks, I have seen a lot of databases, and most of the time, I don't know how to interpret, I don't know what is measured… and it is scary. So, so far, we try to do some maths, to test dynamics by tuning parameters "the best we can" (and not estimate them). But if anyone has good references on testing, in the context of Covid-19 (for instance on specificity, sensitivity of all those tests) I would love to hear about it ! To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment