[R-bloggers] rco: Make Your R Code Run Faster Today! (and 7 more aRticles) |
- rco: Make Your R Code Run Faster Today!
- 15+ Resources to Get Started with R
- Beginners guide to Bubble Map with Shiny
- An efficient way to install and load R packages
- another easy Riddler
- Supplement to ‘Nonparametric estimation of the service time distribution in discrete-time queueing networks’
- Do my data follow a normal distribution ? A note on the most widely used distribution and how to test for normality in R
- The significance of the region on the salary of engineers in Sweden
rco: Make Your R Code Run Faster Today! Posted: 30 Jan 2020 10:12 PM PST [This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The rco package can optimize R code in a variety of different ways. The package implements common subexpression elimination, constant folding, constant propagation, dead code elimination, among other very relevant code optimization strategies. Currently, the rco could be downloaded as a GitHub package. The rco package functions as an RStudio Addin, be used through a shiny GUI interface, or as an R function through either the optimize_files() or optimize_text() functions. Found an optimization method not currently supported by the rco package? Feel free to contribute! Your contribution is more than welcome. Install rco now and make your R code run faster today! To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
15+ Resources to Get Started with R Posted: 30 Jan 2020 10:11 PM PST [This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. R is the second most sought after language in data science behind Python, so gaining mastery of R is a prerequisite to a thriving career in the field. Whether you're an experienced developer or a newbie considering a career move, here are some excellent resources so you can get started with R. [Related Article: Data-Driven Exploration of the R User Community Worldwide] What is R?R is a programming language and environment designed for statistical analysis. It's used mainly by data miners and statisticians. It's a free resource and runs on a wide variety of platforms, including UNIX, Windows, and Mac OS. It has thousands of well-documented extensions and is cross-platform compatible. It's not quite as popular outside of the field of data science, but it's one of the best options for exploring datasets in a deep dive manner or for going after data insights for a single time. Head over to the R sight and download a copy of R, so you're ready to get started. Free R Resources for BeginnersLet's take a look at how a beginner might break into R. It's not quite as friendly as Python, but it's definitely accessible with good resources and practice. Platforms and Documentationr-bloggers.com: R-bloggers is a collection of blogs designed by R experts that covers a wide range of R topics. No matter what you're curious about or have an issue with, R-bloggers has it covered. BooksR for Data Science: This classic handbook provides resources and documentation. It's available for free on the website, or you can purchase a physical copy from Amazon. Hands-on Programming with R: Garrett Grolemund's classic is a practical, hands-on approach to R programming. It gives you the instruction you need plus practical programming skills to begin with R right from the very beginning. CoursesCodecademy: Codecademy's mission is to bring development knowledge even to beginners, and its R offers are no different. While many of the lessons will require a membership, it does offer a basic set of courses to get you started. edX.org: EdX offers a range of free R courses to get you started, but we recommend starting with Microsoft's Introduction to R for Data Science for a comprehensive overview. The courses cost nothing and are offered asynchronously. Some do come with official certification for a fee. Free R Resources for DevelopersIf you've already got some development experience under your belt, these resources could be a great way to get started with R by utilizing your current experience. Even better, they're free. Platforms and Documentationstorybench.com: Storybench is an experiential learning platform designed to provide exercises in digital storytelling. They offer projects in R, most notably "How to Explore Correlations in R." Once you've gotten the basics, the next logical step is to take on projects for hands-on learning. BooksR Programming for Data Science: This option is available for free (though you can choose to donate in support of the project). It offers full resources for learning R and understanding key data science principles. If you upgrade the package, the online book comes with a full video suite. Text Mining with R: Another book available for free on the website, this option offers a targeted approach to text mining with a full Github repository as support. R in Action: Another entirely free resource for business developers. If you've already got an established career in development in the business world, this could be an excellent resource for getting started with R in the business world. CoursesCoursera: Johns Hopkins's popular partnership with Coursera, "Data Science, Foundations Using R" is a great way for developers to build skills to break into the field of Data Science. edX + Harvard: X Series Program in Data Analysis for Life Sciences is a course series designed to implement both learning R and real-life projects for a full learning experience. You can upgrade to an official learning certificate for a fee or take the individual courses for free. Paid Resources for Beginners and BeyondSometimes, you've got to invest a little in your learning experience. Here are a couple of things that can really jumpstart your R-programming skills if you've got some wiggle room in your budget. Getting Started with R: A primer on using R for the biological sciences. It contains valuable information for getting started on statistical analysis using the R programming language. flowingdata.com: Flowingdata is a membership site designed to elevate your visualizations. It's another excellent way to get experiential learning with R projects. Rstudio: It's not cheap, but if you're serious about making a career in R, you'll want to get it. Save up and invest. They do, however, have a series of free webinars you can peruse. Bonus! More Blogs and Newslettershttps://blog.revolutionanalytics.com/r/ : Blog designed to highlight milestones in Data Science and includes a range of topics including both R and Python for you multilingual developers out there. https://journal.r-project.org/: Open access, refereed journal detailing the latest in R-programming news and projects. Papers have a focus on accessibility, and the articles are tended to reach a wide audience. https://morningcupofcoding.com/: Offers a wide range of curated coding articles, including R programming. Check their back issues for articles of interest. opendatascience.com: ODSC's general weekly newsletter provides members with trending topics in the fields of modeling, tools & platforms, and more. Getting Started with R Programming[Related Article: Where is Data Science Heading? Watching R's Most Popular Packages May Have the Answer] While both Python and R are invaluable resources for getting started in Data Science, the statistical capabilities of R are in wide demand. Whether you're new to the world of coding or an experienced developer, R can open doors into the world of Data Science. To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Beginners guide to Bubble Map with Shiny Posted: 30 Jan 2020 10:11 PM PST [This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Map Bubble We will be using Highcharter package to show earthquake magnitude and depth . Highcharter is a versatile charting library to build interactive charts, one of the easiest to learn and for shiny integration.
About dataset Shiny Application library(shiny) library(highcharter) library(dplyr) edata <- read.csv('earthquake.csv') %>% rename(lat=latitude,lon = longitude) wmap <- hcmap() Using dplyr package latitude and longitude variables are renamed as lat and lon with rename verb. Column names are important. ui
ui <- fluidPage( titlePanel("MapBubble"), # Application title sidebarLayout( sidebarPanel( sliderInput('mag','Magnitude more than(Richter Scale)', min = 1,max = 6,step = 0.5,value = 0), selectInput('bubble','Bubble Size indicates',choices = c('Magnitude'= 'mag','Depth(in Km)' = 'depth')), sliderInput('bublesize','Adjust bubble Size',min = 2,max = 10,step = 1,value = 6) ), # Display a Map Bubble mainPanel( highchartOutput('eqmap',height = "500px") ) ) ) Server renderHighchart is render function for use in shiny. We will pass the filtered data and chart type as mapbubble in hc_add_series function. Place, time and z variable are displayed in the tooltip with "point" format. server <- function(input, output) { data <- reactive(edata %>% filter(mag >= input$mag) %>% rename(z = input$bubble)) output$eqmap <-renderHighchart( wmap %>% hc_legend(enabled = F) %>% hc_add_series(data = data(), type = "mapbubble", name = "", maxSize = paste0(input$bublesize,'%')) %>% #bubble size in perc % hc_tooltip(useHTML = T,headerFormat='',pointFormat = paste('Location :{point.place} Time: {point.time} ',input$bubble,': {point.z}')) %>% hc_title(text = "Global Seismic Activity") %>% hc_subtitle(text = paste('No of obs:', nrow(data()),sep = '')) %>% hc_mapNavigation(enabled = T)%>% ) } # Run the application shinyApp(ui = ui, server = server) Shiny R file can be found here at the github repository To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An efficient way to install and load R packages Posted: 30 Jan 2020 04:00 PM PST [This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. What is a R package and how to use it?Unlike other programs, only fundamental functionalities come by default with R. You will thus often need to install some "extensions" to perform the analyses you want. These extensions which are are collections of functions and datasets developed and published by R users are called packages. They extend existing base R functionalities by adding new ones. R is open source so everyone can write code and publish it as a package, and everyone can install a package and start using the functions or datasets built inside the package, all this for free. In order to use a package, it needs to be installed on your computer by running Inefficient way to install and load R packagesDepending on how long you have been using R, you may use a limited amount of packages or, on the contrary, a large amount of them. As you use more and more packages you will soon start to have (too) many lines of code just for installing and loading them. Here is a preview of the code from my PhD thesis showing how the installation and loading of R packages looked like when I started working on R (only a fraction of them are displayed to shorten the code): As you can guess the code became longer and longer as I needed more and more packages for my analyses. Moreover, I tended to reinstall all packages as I was working on 4 different computers and I could not remember which packages were already installed on which machine. Reinstalling all packages everytime I opened my script or R Markdown document was a waste of time. More efficient wayThen one day, a colleague of mine shared some of his code with me. I am glad he did as he introduced me to a much more efficient way to install and load R packages. He gave me the permission to share the tip, so here is the code I now use to perform the task of installing and loading R packages: This code for installing and loading R packages is more efficient in several ways:
From that day on, every time I need to use a new package, I simply add it to the vector Thanks for reading. I hope the article helped you to install and load R packages in a more efficient way. See other articles about R. As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here. Get updates every time a new article is published by subscribing to this blog. To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 30 Jan 2020 03:20 PM PST [This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. A quick riddle from the Riddler
Since the average gain for Abigail is δ(1-.1ς)+2(1-δ)(.4+.2ς) the riddle sums up as solving the minmax problem the solution in ς is either 0 or 1 depending on δ being smaller or larger than 12/22, which leads to this value as the expected gain. The saddlepoint is hardly visible in the above image. While ς is either 0 or 1 in the optimal setting, a constant choice of 1 or 0 would modify the optimal for δ except that Abigail must declare her value of δ! To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 29 Jan 2020 04:00 PM PST [This article was first published on R on Sastibe's Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Great news: a scientific article I have co-authored has been accepted for publication and can now be found online here or via the DOI 10.1016/j.spa.2020.01.011. Yes, my list of publications has been amended 1. This article has been through quite a lengthy review process, and was the main motivation for another one of my blog posts. This post dates to September 2018, yet I only started working on these simulations in the framework of the second round of peer review… Anyways, the results presented in the paper are based on simulations I calculated using a specifically desigend R package, aptly named queueingnetworkR. This package was only ever designed for this singular purpose, for anyone who might be interested in repeating the experiments or using the code in any other fashion, the latest tar-ball can be found here, the GitLab page is here.
To leave a comment for the author, please follow the link and comment on their blog: R on Sastibe's Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 28 Jan 2020 04:00 PM PST [This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. What is a normal distribution?The normal distribution is a function that defines how a set of measurements is distributed around the center of these measurements (i.e., the mean). Many natural phenomena in real life can be approximated by a bell-shaped frequency distribution known as the normal distribution or the Gaussian distribution. The normal distribution is a mount-shaped, unimodal and symmetric distribution where most measurements gather around the mean. Moreover, the further a measure deviates from the mean, the lower the probability of occuring. In this sense, for a given variable, it is common to find values close to the mean, but less and less likely to find values as we move away from the mean. Last but not least, since the normal distribution is symmetric around its mean, extreme values in both tails of the distribution are equivalently unlikely. For instance, given that adult height follows a normal distribution, most adults are close to the average height and extremely short adults occur as infrequently as extremely tall adults. In this article, the focus is on understanding the normal distribution, the associated empirical rule, its parameters and how to compute \(Z\) scores to find probabilities under the curve (illustrated with examples). As it is a requirement in some statistical tests, we also show 4 complementary methods to test the normality assumption in R. Empirical ruleData possessing an approximately normal distribution have a definite variation, as expressed by the following empirical rule:
where \(\mu\) and \(\sigma\) correspond to the population mean and population standard deviation, respectively. The empirical rule is illustred by the following 2 examples. Suppose that the scores of an exam in statistics given to all students in a Belgian university are known to have, approximately, a normal distribution with mean \(\mu = 67\) and standard deviation \(\sigma = 9\). It can then be deduced that approximately 68% of the scores are between 58 and 76, that approximately 95% of the scores are between 49 and 85, and that almost all of the scores (99.7%) are between 40 and 94. Thus, knowing the mean and the standard deviation gives us a fairly good picture of the distribution of scores. Now suppose that a single university student is randomly selected from those who took the exam. What is the probability that her score will be between 49 and 85? Based on the empirical rule, we find that 0.95 is a reasonable answer to this probability question. The utility and value of the empirical rule are due to the common occurrence of approximately normal distributions of measurements in nature. For example, IQ, shoe size, height, birth weight, etc. are approximately normally-distributed. You will find that approximately 95% of these measurements will be within \(2\sigma\) of their mean (Wackerly, Mendenhall, and Scheaffer 2014). ParametersLike many probability distributions, the shape and probabilities of the normal distribution is defined entirely by some parameters. The normal distribution has two parameters: (i) the mean \(\mu\) and (ii) the variance \(\sigma^2\) (i.e., the square of the standard deviation \(\sigma\)). The mean \(\mu\) locates the center of the distribution, that is, the central tendency of the observations, and the variance \(\sigma^2\) defines the width of the distribution, that is, the spread of the observations. The mean \(\mu\) can take on any finite value (i.e., \(-\infty < \mu < \infty\)), whereas the variance \(\sigma^2\) can assume any positive finite value (i.e., \(\sigma^2 > 0\)). The shape of the normal distribution changes based on these two parameters. Since there is an infinite number of combinations of the mean and variance, there is an infinite number of normal distributions, and thus an infinite number of forms. For instance, see how the shapes of the normal distributions vary when the two parameters change: As you can see on the second graph, when the variance (or the standard deviation) decreases, the observations are closer to the mean. On the contrary, when the variance (or standard deviation) increases, it is more likely that observations will be further away from the mean. A random variable \(X\) which follows a normal distribution with a mean of 430 and a variance of 17 is denoted \(X ~ \sim \mathcal{N}(\mu = 430, \sigma^2 = 17)\). We have seen that, although different normal distributions have different shapes, all normal distributions have common characteristics:
Probabilities and standard normal distributionProbabilities and quantiles for random variables with normal distributions are easily found using R via the functions Although there are infinitely many normal distributions (since there is a normal distribution for every combination of mean and variance), we need only one table to find the probabilities under the normal curve: the standard normal distribution. The normal standard distribution is a special case of the normal distribution where the mean is equal to 0 and the variance is equal to 1. A normal random variable \(X\) can always be transformed to a standard normal random variable \(Z\), a process known as "scaling" or "standardization", by substracting the mean from the observation, and dividing the result by the standard deviation. Formally: \[Z = \frac{X – \mu}{\sigma}\] where \(X\) is the observation, \(\mu\) and \(\sigma\) the mean and standard deviation of the population from which the observation was drawn. So the mean of the standard normal distribution is 0, and its variance is 1, denoted \(Z ~ \sim \mathcal{N}(\mu = 0, \sigma^2 = 1)\). From this formula, we see that \(Z\), referred as standard score or \(Z\) score, allows to see how far away one specific observation is from the mean of all observations, with the distance expressed in standard deviations. In other words, the \(Z\) score corresponds to the number of standard deviations one observation is away from the mean. A positive \(Z\) score means that the specific observation is above the mean, whereas a negative \(Z\) score means that the specific observation is below the mean. \(Z\) scores are often used to compare an individual to her peers, or more generally, a measurement compared to its distribution. For instance, suppose a student scoring 60 at a statistics exam with the mean score of the class being 40, and scoring 65 at an economics exam with the mean score of the class being 80. Given the "raw" scores, one would say that the student performed better in economics than in statistics. However, taking into consideration her peers, it is clear that the student performed relatively better in statistics than in economics. Computing \(Z\) scores allows to take into consideration all other students (i.e., the entire distribution) and gives a better measure of comparison. Let's compute the \(Z\) scores for the two exams, assuming that the score for both exams follow a normal distribution with the following parameters:
\(Z\) scores for:
On the one hand, the \(Z\) score for the exam in statistics is positive (\(Z_{stat} = 2.5\)) which means that she performed better than average. On the other hand, her score for the exam in economics is negative (\(Z_{econ} = -1.2\)) which means that she performed worse than average. Below an illustration of her grades in a standard normal distribution for better comparison: Although the score in economics is better in absolute terms, the score in statistics is actually relatively better when comparing each score within its own distribution. Furthermore, \(Z\) score also enables to compare observations that would otherwise be impossible because they have different units for example. Suppose you want to compare a salary in € with a weight in kg. Without standardization, there is no way to conclude whether someone is more extreme in terms of her wage or in terms of her weight. Thanks to \(Z\) scores, we can compare two values that were in the first place not comparable to each other. Final remark regarding the interpretation of a \(Z\) score: a rule of thumb is that an observation with a \(Z\) score between -3 and -2 or between 2 and 3 is considered as a rare value. An observation with a \(Z\) score smaller than -3 or larger than 3 is considered as an extremely rare value. A value with any other \(Z\) score is considered as not rare nor extremely rare. Areas under the normal distribution in R and by handNow that we have covered the \(Z\) score, we are going to use it to determine the area under the curve of a normal distribution. Note that there are several ways to arrive at the solution in the following exercises. You may therefore use other steps than the ones presented to obtain the same result. Ex. 1Let \(Z\) denote a normal random variable with mean 0 and standard deviation 1, find \(P(Z > 1)\). We actually look for the shaded area in the following figure: 1)" style="width:100.0%" /> In R We look for the probability of \(Z\) being larger than 1 so we set the argument By handSee that the random variable \(Z\) has already a mean of 0 and a standard deviation of 1, so no transformation is required. To find the probabilities by hand, we need to refer to the standard normal distribution table shown below: From the illustration at the top of the table, we see that the values inside the table correspond to the area under the normal curve above a certain \(z\). Since we are looking precisely at the probability above \(z = 1\) (since we look for \(P(Z > 1)\)), we can simply proceed down the first (\(z\)) column in the table until \(z = 1.0\). The probability is 0.1587. Thus, \(P(Z > 1) = 0.1587\). This is similar to what we found using R, except that values in the table are rounded to 4 digits. Ex. 2Let \(Z\) denote a normal random variable with mean 0 and standard deviation 1, find \(P(−1 \le Z \le 1)\). We are looking for the shaded area in the following figure: In R Note that the arguments by default for the mean and the standard deviation are By handFor this exercise we proceed by steps:
\[P(−1 \le Z \le 1) = 1 – 2 \cdot P(Z > 1) = 1 – 2 \cdot 0.1587 = 0.6826\] where \(P(Z > 1) = 0.1587\) has been found in the previous exercise. Ex. 3Let \(Z\) denote a normal random variable with mean 0 and standard deviation 1, find \(P(0 \le Z \le 1.37)\). We are looking for the shaded area in the following figure: In R By handAgain we proceed by steps for this exercise:
\[P(0 \le Z \le 1.37) = P(Z > 0) – P(Z > 1.37) = 0.5 – 0.0853 = 0.4147\] Ex. 4Recap the example presented in the empirical rule: Suppose that the scores of an exam in statistics given to all students in a Belgian university are known to have a normal distribution with mean \(\mu = 67\) and standard deviation \(\sigma = 9\). What fraction of the scores lies between 70 and 80? We are looking for the shaded area in the following figure: In R By handRemind that we are looking for \(P(70 \le X \le 80)\) where \(X \sim \mathcal{N}(\mu = 67, \sigma^2 = 9^2)\). The random variable \(X\) is in its "raw" format, meaning that it has not been standardized yet since the mean is 67 and the variance is \(9^2\). We thus need to first apply the transformation to standardize the endpoints 70 and 80 with the following formula: \[Z = \frac{X – \mu}{\sigma}\] After the standardization, \(x = 70\) becomes (in terms of \(z\), so in terms of deviation from the mean expressed in standard deviation): \[z = \frac{70 – 67}{9} = 0.3333\] and \(x = 80\) becomes: \[z = \frac{80 – 67}{9} = 1.4444\] The figure above in terms of \(X\) is now in terms of \(Z\): Finding the probability \(P(0.3333 \le Z \le 1.4444)\) is similar to exercises 1 to 3:
\[P(0.3333 \le Z \le 1.4444) = P(Z > 0.3333) – P(Z > 1.4444) = 0.3707 – 0.0749 = 0.2958\] The difference with the probability found using in R comes from the rounding. To conclude this exercice, we can say that, given that the mean scores is 67 and the standard deviation is 9, 29.58% of the students scored between 70 and 80. Ex. 5See another example in a context here. Why is the normal distribution so crucial in statistics?The normal distribution is important for three main reasons:
It is therefore useful to know how to test for normality in R, which is the topic of next sections. How to test the normality assumptionAs mentioned above, some statistical tests require that the data follow a normal distribution, or the result of the test may be flawed. In this section, we show 4 complementary methods to determine whether your data follow a normal distribution in R. HistogramA histogram displays the spread and shape of a distribution, so it is a good starting point to evaluate normality. Let's have a look at the histogram of a distribution that we would expect to follow a normal distribution, the height of 1,000 adults in cm: The normal curve with the corresponding mean and variance has been added to the histogram. The histogram follows the normal curve so the data seems to follow a normal distribution. Below the minimal code for a histogram in R with the dataset In Histograms are however not sufficient, particularily in the case of small samples because the number of bins greatly change its appearance. Histograms are not recommended when the number of observations is less than 20 because it does not always correctly illustrate the distribution. See two examples below with dataset of 10 and 12 observations: Can you tell whether these datasets follow a normal distribution? Suprisingly, both follow a normal distribution! Density plotDensity plots also provide a visual judgment about whether the data follow a normal distribution. They are similar to histograms as they also allow to analyze the spread and the shape of the distribution. However, they are a smoothed version of the histogram. Here is the density plot drawn from the dataset on the height of the 12 adults discussed above: In Since it is hard to test for normality from histograms and density plots only, it is recommended to corroborate these graphs with a QQ-plot. QQ-plot, also known as normality plot, is the third method presented to evaluate normality. QQ-plotLike histograms and density plots, QQ-plots allow to visually evaluate the normality assumption. Here is the QQ-plot drawn from the dataset on the height of the 12 adults discussed above: In Instead of looking at the spread of the data (as it is the case with histograms and density plots), with QQ-plots we only need to ascertain whether the data points follow the line (sometimes referred as Henry's line). If points are close to the reference line and within the confidence bands, the normality assumption can be considered as met. The bigger the deviation between the points and the reference line and the more they lie outside the confidence bands, the less likely that the normality condition is met. The height of these 12 adults seem to follow a normal distribution because all points lie within the confidence bands. When facing a non-normal distribution as shown by the QQ-plot below (systematic departure from the reference line), the first step is usually to apply the logarithm transformation on the data and recheck to see whether the log-transformed data are normally distributed. Applying the logarithm transformation can be done with the Note that QQ-plots are also a convenient way to assess whether residuals from regression analysis follow a normal distribution. Normality testThe 3 tools presented above were a visual inspection of the normality. Nonetheless, visual inspection may sometimes be unreliable so it is also possible to formally test whether the data follow a normal distribution with statistical tests. These normality tests compare the distribution of the data to a normal distribution in order to assess whether observations show an important deviation from normality. The two most common normality tests are Shapiro-Wilk's test and Kolmogorov-Smirnov test. Both tests have the same hypotheses, that is:
Shapiro-Wilk test is recommended for normality test as it provides better power than Kolmogorov-Smirnov test.2 In R, the Shapiro-Wilk test of normality can be done with the function From the output, we see that the \(p\)-value \(> 0.05\) implying that the data are not significantly different from a normal distribution. In other words, we can assume the normality. This test confirms the QQ-plot which also showed normality (as all points lied within the confidence bands). It is important to note that, in practice, normality tests are often considered as too conservative in the sense that for large sample size (\(n > 50\)), a small deviation from the normality may cause the normality condition to be violated. A normality test is a hypothesis test, so as the sample size increases, their capacity of detecting smaller differences increases. So as the number of observations increases, the Shapiro-Wilk test becomes very sensitive even to a small deviation from normality. As a consequence, it happens that according to the normality test the data do not follow a normal distribution although the departures from the normal distribution are negligible and the data in fact follow a normal distribution. For this reason, it is often the case that the normality condition is verified based on a combination of all methods presented in this article, that is, visual inspections (with histograms and QQ-plots) and a formal inspection (with the Shapiro-Wilk test for instance). I personally tend to prefer QQ-plots over histograms and normality tests so I do not have to bother about the sample size. This article showed the different methods that are available, your choice will of course depends on the type of your data and the context of your analyses. Thanks for reading. I hope the article helped you to learn more about the normal distribution and how to test for normality in R. See other articles about statistics or about R. As always, if you have a statistical question related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here. Get updates every time a new article is published by subscribing to this blog. ReferencesWackerly, Dennis, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical Statistics with Applications. Cengage Learning.
To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The significance of the region on the salary of engineers in Sweden Posted: 28 Jan 2020 04:00 PM PST [This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. So far I have analysed the effect of experience, education, gender and year on the salary of engineers in Sweden. In this post, I will have a look at the effect of the region on the salary of engineers in Sweden. Statistics Sweden use NUTS (Nomenclature des Unités Territoriales Statistiques), which is the EU's hierarchical regional division, to specify the regions. First, define libraries and functions. The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000CG.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/. The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by region, sector, occupational group (SSYK 2012) and sex. Year 2014 – 2018 Monthly salary All sectors We expect that the region is an important factor in salaries. As a null hypothesis, we assume that the region is not related to the salary and examine if we can reject this hypothesis with the data from Statistics Sweden. The F-value from the Anova table for the region is 40 (Pr(>F) < 2.2e-16), sufficient for rejecting the null hypothesis that the region has no effect on the salary holding year as constant. The adjusted R-squared value is 0,882 implying a good fit of the model.
Let's check what we have found. Also examine the interaction between gender and region. The F-value from the Anova table for the interaction is 2,7 (Pr(>F) < 0.017) implying a relation to 95 % significance.
And the interaction between year and region. The F-value from the Anova table for the region is 0,57 (Pr(>F) < 0,77) implying no significant relation.
Finally the interaction between gender, year and region, the only significant interaction is between the region and gender, F-value from the Anova table for the interaction is 2,4 (Pr(>F) < 0.033)
To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment