[R-bloggers] rco: Make Your R Code Run Faster Today! (and 7 more aRticles)

rco: Make Your R Code Run Faster Today!
15+ Resources to Get Started with R
Beginners guide to Bubble Map with Shiny
An efficient way to install and load R packages
another easy Riddler
Supplement to ‘Nonparametric estimation of the service time distribution in discrete-time queueing networks’
Do my data follow a normal distribution ? A note on the most widely used distribution and how to test for normality in R
The significance of the region on the salary of engineers in Sweden

Posted: 30 Jan 2020 10:12 PM PST

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The rco package can optimize R code in a variety of different ways. The package implements common subexpression elimination, constant folding, constant propagation, dead code elimination, among other very relevant code optimization strategies.

Currently, the rco could be downloaded as a GitHub package. The rco package functions as an RStudio Addin, be used through a shiny GUI interface, or as an R function through either the optimize_files() or optimize_text() functions.

Found an optimization method not currently supported by the rco package? Feel free to contribute! Your contribution is more than welcome.

Install rco now and make your R code run faster today!

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

15+ Resources to Get Started with R

Posted: 30 Jan 2020 10:11 PM PST

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is the second most sought after language in data science behind Python, so gaining mastery of R is a prerequisite to a thriving career in the field. Whether you're an experienced developer or a newbie considering a career move, here are some excellent resources so you can get started with R.

What is R?

R is a programming language and environment designed for statistical analysis. It's used mainly by data miners and statisticians. It's a free resource and runs on a wide variety of platforms, including UNIX, Windows, and Mac OS.

It has thousands of well-documented extensions and is cross-platform compatible. It's not quite as popular outside of the field of data science, but it's one of the best options for exploring datasets in a deep dive manner or for going after data insights for a single time.

Head over to the R sight and download a copy of R, so you're ready to get started.

Free R Resources for Beginners

Let's take a look at how a beginner might break into R. It's not quite as friendly as Python, but it's definitely accessible with good resources and practice.

Platforms and Documentation

r-bloggers.com: R-bloggers is a collection of blogs designed by R experts that covers a wide range of R topics. No matter what you're curious about or have an issue with, R-bloggers has it covered.

Books

R for Data Science: This classic handbook provides resources and documentation. It's available for free on the website, or you can purchase a physical copy from Amazon.

Hands-on Programming with R: Garrett Grolemund's classic is a practical, hands-on approach to R programming. It gives you the instruction you need plus practical programming skills to begin with R right from the very beginning.

Courses

Codecademy: Codecademy's mission is to bring development knowledge even to beginners, and its R offers are no different. While many of the lessons will require a membership, it does offer a basic set of courses to get you started.

edX.org: EdX offers a range of free R courses to get you started, but we recommend starting with Microsoft's Introduction to R for Data Science for a comprehensive overview. The courses cost nothing and are offered asynchronously. Some do come with official certification for a fee.

Free R Resources for Developers

If you've already got some development experience under your belt, these resources could be a great way to get started with R by utilizing your current experience. Even better, they're free.

Platforms and Documentation

storybench.com: Storybench is an experiential learning platform designed to provide exercises in digital storytelling. They offer projects in R, most notably "How to Explore Correlations in R." Once you've gotten the basics, the next logical step is to take on projects for hands-on learning.

Books

R Programming for Data Science: This option is available for free (though you can choose to donate in support of the project). It offers full resources for learning R and understanding key data science principles. If you upgrade the package, the online book comes with a full video suite.

Text Mining with R: Another book available for free on the website, this option offers a targeted approach to text mining with a full Github repository as support.

R in Action: Another entirely free resource for business developers. If you've already got an established career in development in the business world, this could be an excellent resource for getting started with R in the business world.

Courses

Coursera: Johns Hopkins's popular partnership with Coursera, "Data Science, Foundations Using R" is a great way for developers to build skills to break into the field of Data Science.

edX + Harvard: X Series Program in Data Analysis for Life Sciences is a course series designed to implement both learning R and real-life projects for a full learning experience. You can upgrade to an official learning certificate for a fee or take the individual courses for free.

Paid Resources for Beginners and Beyond

Sometimes, you've got to invest a little in your learning experience. Here are a couple of things that can really jumpstart your R-programming skills if you've got some wiggle room in your budget.

Getting Started with R: A primer on using R for the biological sciences. It contains valuable information for getting started on statistical analysis using the R programming language.

flowingdata.com: Flowingdata is a membership site designed to elevate your visualizations. It's another excellent way to get experiential learning with R projects.

Rstudio: It's not cheap, but if you're serious about making a career in R, you'll want to get it. Save up and invest. They do, however, have a series of free webinars you can peruse.

Bonus! More Blogs and Newsletters

https://blog.revolutionanalytics.com/r/ : Blog designed to highlight milestones in Data Science and includes a range of topics including both R and Python for you multilingual developers out there.

https://journal.r-project.org/: Open access, refereed journal detailing the latest in R-programming news and projects. Papers have a focus on accessibility, and the articles are tended to reach a wide audience.

https://morningcupofcoding.com/: Offers a wide range of curated coding articles, including R programming. Check their back issues for articles of interest.

opendatascience.com: ODSC's general weekly newsletter provides members with trending topics in the fields of modeling, tools & platforms, and more.

Getting Started with R Programming

While both Python and R are invaluable resources for getting started in Data Science, the statistical capabilities of R are in wide demand. Whether you're new to the world of coding or an experienced developer, R can open doors into the world of Data Science.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Beginners guide to Bubble Map with Shiny

Posted: 30 Jan 2020 10:11 PM PST

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Map Bubble
Map bubble is type of map chart where bubble or circle position indicates geoghraphical location and bubble size is used to show differences in magnitude of quantitative variables like population.

We will be using Highcharter package to show earthquake magnitude and depth . Highcharter is a versatile charting library to build interactive charts, one of the easiest to learn and for shiny integration.

About dataset
Dataset used here is from US Geological survey website of recent one week earthquake events. There are about 420 recorded observation with magnitude more than 2.0 globally. Dataset has 22 variables, of which we will be using time, latitude, longitude, depth, magnitude(mag) and nearest named place of event.

Shiny Application
This application has single app.R file and earthquake dataset. Before we start with UI function, we will load dataset and fetch world json object from highcharts map collection with hcmap function. See the app here

library(shiny)  library(highcharter)  library(dplyr)  edata <- read.csv('earthquake.csv') %>% rename(lat=latitude,lon = longitude)  wmap <- hcmap()

Using dplyr package latitude and longitude variables are renamed as lat and lon with rename verb. Column names are important.

ui
It has sidebar panel with 3 widgets and main panel for displaying map.

Two sliders, one for filtering out low magnitude values and other for adjusting bubble size.
One select widget for bubble size variable. User can select magnitude or depth of earthquake event. mag and depth are columns in dataset.
Widget output function highchartOutput for use in shiny.

ui <- fluidPage(       titlePanel("MapBubble"), # Application title   sidebarLayout(   sidebarPanel(              sliderInput('mag','Magnitude more than(Richter Scale)', min = 1,max = 6,step = 0.5,value = 0),        selectInput('bubble','Bubble Size indicates',choices = c('Magnitude'= 'mag','Depth(in Km)' = 'depth')),       sliderInput('bublesize','Adjust bubble Size',min = 2,max = 10,step = 1,value = 6)                ),                # Display a Map Bubble        mainPanel(          highchartOutput('eqmap',height = "500px")                 )     )  )

Server
Before rendering, we will filter the dataset within reactive context. Any numeric column that we want to indicate with bubble size should be named z. input$bubble comes from select widget.

renderHighchart is render function for use in shiny. We will pass the filtered data and chart type as mapbubble in hc_add_series function. Place, time and z variable are displayed in the tooltip with "point" format.
Sub-title is used to show no. of filtered observation in the map

     server <- function(input, output) {     data <- reactive(edata %>%                  filter(mag >= input$mag) %>%                 rename(z = input$bubble))    output$eqmap <-renderHighchart(       wmap %>% hc_legend(enabled = F) %>%      hc_add_series(data = data(), type = "mapbubble", name = "", maxSize = paste0(input$bublesize,'%')) %>% #bubble size in perc %     hc_tooltip(useHTML = T,headerFormat='',pointFormat = paste('Location :{point.place}    Time: {point.time}   ',input$bubble,': {point.z}')) %>%     hc_title(text = "Global Seismic Activity") %>%   hc_subtitle(text = paste('No of obs:', nrow(data()),sep = '')) %>%   hc_mapNavigation(enabled = T)%>%   )  }    # Run the application   shinyApp(ui = ui, server = server)

Shiny R file can be found here at the github repository

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

An efficient way to install and load R packages

Posted: 30 Jan 2020 04:00 PM PST

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is a R package and how to use it?

Unlike other programs, only fundamental functionalities come by default with R. You will thus often need to install some "extensions" to perform the analyses you want. These extensions which are are collections of functions and datasets developed and published by R users are called packages. They extend existing base R functionalities by adding new ones. R is open source so everyone can write code and publish it as a package, and everyone can install a package and start using the functions or datasets built inside the package, all this for free.

In order to use a package, it needs to be installed on your computer by running install.packages("name_of_package") (do not forget "" around the name of the package, otherwise R will look for an object saved under that name!). Once the package is installed, you must load the package and only after it has been loaded you can use all the functions and datasets it contains. To load a package, run library(name_of_package) (this time "" around the name of the package are optional, but can still be used if you wish).

Inefficient way to install and load R packages

Depending on how long you have been using R, you may use a limited amount of packages or, on the contrary, a large amount of them. As you use more and more packages you will soon start to have (too) many lines of code just for installing and loading them.

Here is a preview of the code from my PhD thesis showing how the installation and loading of R packages looked like when I started working on R (only a fraction of them are displayed to shorten the code):

# Installation of required packages  install.packages("tidyverse")  install.packages("ggplot2")  install.packages("readxl")  install.packages("dplyr")  install.packages("tidyr")  install.packages("ggfortify")  install.packages("DT")  install.packages("reshape2")  install.packages("knitr")  install.packages("lubridate")    # Load packages  library("tidyverse")  library("ggplot2")  library("readxl")  library("dplyr")  library("tidyr")  library("ggfortify")  library("DT")  library("reshape2")  library("knitr")  library("lubridate")

As you can guess the code became longer and longer as I needed more and more packages for my analyses. Moreover, I tended to reinstall all packages as I was working on 4 different computers and I could not remember which packages were already installed on which machine. Reinstalling all packages everytime I opened my script or R Markdown document was a waste of time.

More efficient way

Then one day, a colleague of mine shared some of his code with me. I am glad he did as he introduced me to a much more efficient way to install and load R packages. He gave me the permission to share the tip, so here is the code I now use to perform the task of installing and loading R packages:

# Package names  packages <- c("ggplot2", "readxl", "dplyr", "tidyr", "ggfortify", "DT", "reshape2", "knitr", "lubridate", "pwr", "psy", "car", "doBy", "imputeMissings", "RcmdrMisc", "questionr", "vcd", "multcomp", "KappaGUI", "rcompanion", "FactoMineR", "factoextra", "corrplot", "ltm", "goeveg", "corrplot", "FSA", "MASS", "scales", "nlme", "psych", "ordinal", "lmtest", "ggpubr", "dslabs", "stringr", "assist", "ggstatsplot", "forcats", "styler", "remedy", "snakecaser", "addinslist", "esquisse", "here", "summarytools", "magrittr", "tidyverse", "funModeling", "pander", "cluster")    # Install packages not yet installed  installed_packages <- packages %in% rownames(installed.packages())  if (any(installed_packages == FALSE)) {    install.packages(packages[!installed_packages])  }    # Packages loading  lapply(packages, library, character.only = TRUE) %>%    invisible()

This code for installing and loading R packages is more efficient in several ways:

The function install.packages() accepts a vector as argument, so one line of code for each package in the past is now one line including all packages
In the second part of the code, it checks whether a package is already installed or not, and then install only the missing ones
Regarding the packages loading (the last part of the code), the lapply() function is used to call the library() function on all packages at once, which makes the code more condense.
The output when loading a package is rarely useful. The invisible() function removes this output.

From that day on, every time I need to use a new package, I simply add it to the vector packages at the top of the code, which is located at the top of my scripts and R Markdown documents. No matter on which computer I am working on, running the entire code will install only the missing packages and will load all of them. This greatly reduced the running time for the installation and loading of my R packages.

Thanks for reading. I hope the article helped you to install and load R packages in a more efficient way. See other articles about R.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Get updates every time a new article is published by subscribing to this blog.

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

another easy Riddler

Posted: 30 Jan 2020 03:20 PM PST

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A quick riddle from the Riddler

In a two-person game, Abigail and Zian both choose between a and z. Abigail win one point with probability .9 if they choose (a,a) and with probability 1 if they choose (a,z), and two points with probability .4 if they choose (z,z) and with probability .6 if they choose (z,a). Find the optimal probabilities δ and ς of choosing a for both Abigail and Zian when δ is known to Zian.

Since the average gain for Abigail is δ(1-.1ς)+2(1-δ)(.4+.2ς) the riddle sums up as solving the minmax problem

$\max_\delta \min_\varsigma\delta(1-.1\varsigma)+2(1-\delta)(.4+.2\varsigma)$

the solution in ς is either 0 or 1 depending on δ being smaller or larger than 12/22, which leads to this value as the expected gain. The saddlepoint is hardly visible in the above image. While ς is either 0 or 1 in the optimal setting, a constant choice of 1 or 0 would modify the optimal for δ except that Abigail must declare her value of δ!

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Supplement to ‘Nonparametric estimation of the service time distribution in discrete-time queueing networks’

Posted: 29 Jan 2020 04:00 PM PST

[This article was first published on R on Sastibe's Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Great news: a scientific article I have co-authored has been accepted for publication and can now be found online here or via the DOI 10.1016/j.spa.2020.01.011. Yes, my list of publications has been amended ¹. This article has been through quite a lengthy review process, and was the main motivation for another one of my blog posts. This post dates to September 2018, yet I only started working on these simulations in the framework of the second round of peer review…

Anyways, the results presented in the paper are based on simulations I calculated using a specifically desigend R package, aptly named queueingnetworkR. This package was only ever designed for this singular purpose, for anyone who might be interested in repeating the experiments or using the code in any other fashion, the latest tar-ball can be found here, the GitLab page is here.

I am glad you asked. ^[return]

To leave a comment for the author, please follow the link and comment on their blog: R on Sastibe's Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Do my data follow a normal distribution ? A note on the most widely used distribution and how to test for normality in R

Posted: 28 Jan 2020 04:00 PM PST

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is a normal distribution?

The normal distribution is a function that defines how a set of measurements is distributed around the center of these measurements (i.e., the mean). Many natural phenomena in real life can be approximated by a bell-shaped frequency distribution known as the normal distribution or the Gaussian distribution.

The normal distribution is a mount-shaped, unimodal and symmetric distribution where most measurements gather around the mean. Moreover, the further a measure deviates from the mean, the lower the probability of occuring. In this sense, for a given variable, it is common to find values close to the mean, but less and less likely to find values as we move away from the mean. Last but not least, since the normal distribution is symmetric around its mean, extreme values in both tails of the distribution are equivalently unlikely. For instance, given that adult height follows a normal distribution, most adults are close to the average height and extremely short adults occur as infrequently as extremely tall adults.

In this article, the focus is on understanding the normal distribution, the associated empirical rule, its parameters and how to compute $Z$ scores to find probabilities under the curve (illustrated with examples). As it is a requirement in some statistical tests, we also show 4 complementary methods to test the normality assumption in R.

Empirical rule

Data possessing an approximately normal distribution have a definite variation, as expressed by the following empirical rule:

$\mu \pm \sigma$ includes approximately 68% of the observations
$\mu \pm 2 \cdot \sigma$ includes approximately 95% of the observations
$\mu \pm 3 \cdot \sigma$ includes almost all of the observations (99.7% to be more precise)

Normal distribution & empirical rule. Source: Wikipedia

where $\mu$ and $\sigma$ correspond to the population mean and population standard deviation, respectively.

The empirical rule is illustred by the following 2 examples. Suppose that the scores of an exam in statistics given to all students in a Belgian university are known to have, approximately, a normal distribution with mean $\mu = 67$ and standard deviation $\sigma = 9$. It can then be deduced that approximately 68% of the scores are between 58 and 76, that approximately 95% of the scores are between 49 and 85, and that almost all of the scores (99.7%) are between 40 and 94. Thus, knowing the mean and the standard deviation gives us a fairly good picture of the distribution of scores. Now suppose that a single university student is randomly selected from those who took the exam. What is the probability that her score will be between 49 and 85? Based on the empirical rule, we find that 0.95 is a reasonable answer to this probability question.

The utility and value of the empirical rule are due to the common occurrence of approximately normal distributions of measurements in nature. For example, IQ, shoe size, height, birth weight, etc. are approximately normally-distributed. You will find that approximately 95% of these measurements will be within $2\sigma$ of their mean (Wackerly, Mendenhall, and Scheaffer 2014).

Parameters

Like many probability distributions, the shape and probabilities of the normal distribution is defined entirely by some parameters. The normal distribution has two parameters: (i) the mean $\mu$ and (ii) the variance $\sigma^2$ (i.e., the square of the standard deviation $\sigma$). The mean $\mu$ locates the center of the distribution, that is, the central tendency of the observations, and the variance $\sigma^2$ defines the width of the distribution, that is, the spread of the observations.

The mean $\mu$ can take on any finite value (i.e., $-\infty < \mu < \infty$), whereas the variance $\sigma^2$ can assume any positive finite value (i.e., $\sigma^2 > 0$). The shape of the normal distribution changes based on these two parameters. Since there is an infinite number of combinations of the mean and variance, there is an infinite number of normal distributions, and thus an infinite number of forms.

For instance, see how the shapes of the normal distributions vary when the two parameters change:

As you can see on the second graph, when the variance (or the standard deviation) decreases, the observations are closer to the mean. On the contrary, when the variance (or standard deviation) increases, it is more likely that observations will be further away from the mean.

A random variable $X$ which follows a normal distribution with a mean of 430 and a variance of 17 is denoted $X ~ \sim \mathcal{N}(\mu = 430, \sigma^2 = 17)$.

We have seen that, although different normal distributions have different shapes, all normal distributions have common characteristics:

They are symmetric, 50% of the population is above the mean and 50% of the population is below the mean
The mean, median and mode are equal
The empirical rule detailed earlier is applicable to all normal distributions

Probabilities and standard normal distribution

Probabilities and quantiles for random variables with normal distributions are easily found using R via the functions pnorm() and qnorm(). Probabilities associated with a normal distribution can also be found using this Shiny app. However, before computing probabilities, we need to learn more about the standard normal distribution and the $Z$ score.

Although there are infinitely many normal distributions (since there is a normal distribution for every combination of mean and variance), we need only one table to find the probabilities under the normal curve: the standard normal distribution. The normal standard distribution is a special case of the normal distribution where the mean is equal to 0 and the variance is equal to 1. A normal random variable $X$ can always be transformed to a standard normal random variable $Z$, a process known as "scaling" or "standardization", by substracting the mean from the observation, and dividing the result by the standard deviation. Formally:

\[Z = \frac{X – \mu}{\sigma}\]

where $X$ is the observation, $\mu$ and $\sigma$ the mean and standard deviation of the population from which the observation was drawn. So the mean of the standard normal distribution is 0, and its variance is 1, denoted $Z ~ \sim \mathcal{N}(\mu = 0, \sigma^2 = 1)$.

From this formula, we see that $Z$, referred as standard score or $Z$ score, allows to see how far away one specific observation is from the mean of all observations, with the distance expressed in standard deviations. In other words, the $Z$ score corresponds to the number of standard deviations one observation is away from the mean. A positive $Z$ score means that the specific observation is above the mean, whereas a negative $Z$ score means that the specific observation is below the mean. $Z$ scores are often used to compare an individual to her peers, or more generally, a measurement compared to its distribution.

For instance, suppose a student scoring 60 at a statistics exam with the mean score of the class being 40, and scoring 65 at an economics exam with the mean score of the class being 80. Given the "raw" scores, one would say that the student performed better in economics than in statistics. However, taking into consideration her peers, it is clear that the student performed relatively better in statistics than in economics. Computing $Z$ scores allows to take into consideration all other students (i.e., the entire distribution) and gives a better measure of comparison. Let's compute the $Z$ scores for the two exams, assuming that the score for both exams follow a normal distribution with the following parameters:

	Statistics	Economics
Mean	40	80
Standard deviation	8	12.5
Student's score	60	65

$Z$ scores for:

Statistics: $Z_{stat} = \frac{60 – 40}{11} = 2.5$
Economics: $Z_{econ} = \frac{65 – 80}{12.5} = -1.2$

On the one hand, the $Z$ score for the exam in statistics is positive ($Z_{stat} = 2.5$) which means that she performed better than average. On the other hand, her score for the exam in economics is negative ($Z_{econ} = -1.2$) which means that she performed worse than average. Below an illustration of her grades in a standard normal distribution for better comparison:

Although the score in economics is better in absolute terms, the score in statistics is actually relatively better when comparing each score within its own distribution.

Furthermore, $Z$ score also enables to compare observations that would otherwise be impossible because they have different units for example. Suppose you want to compare a salary in € with a weight in kg. Without standardization, there is no way to conclude whether someone is more extreme in terms of her wage or in terms of her weight. Thanks to $Z$ scores, we can compare two values that were in the first place not comparable to each other.

Final remark regarding the interpretation of a $Z$ score: a rule of thumb is that an observation with a $Z$ score between -3 and -2 or between 2 and 3 is considered as a rare value. An observation with a $Z$ score smaller than -3 or larger than 3 is considered as an extremely rare value. A value with any other $Z$ score is considered as not rare nor extremely rare.

Areas under the normal distribution in R and by hand

Now that we have covered the $Z$ score, we are going to use it to determine the area under the curve of a normal distribution.

Note that there are several ways to arrive at the solution in the following exercises. You may therefore use other steps than the ones presented to obtain the same result.

Ex. 1

Let $Z$ denote a normal random variable with mean 0 and standard deviation 1, find $P(Z > 1)$.

We actually look for the shaded area in the following figure:

Standard normal distribution: P(Z data-recalc-dims=

1)" style="width:100.0%" />

Standard normal distribution: $P(Z > 1)$

In R

pnorm(1,    mean = 0,    sd = 1, # sd stands for standard deviation    lower.tail = FALSE  )

## [1] 0.1586553

We look for the probability of $Z$ being larger than 1 so we set the argument lower.tail = FALSE. The default lower.tail = TRUE would give the result for $P(Z < 1)$. Note that $P(Z = 1) = 0$ so writing $P(Z > 1)$ or $P(Z \ge 1)$ is equivalent.

By hand

See that the random variable $Z$ has already a mean of 0 and a standard deviation of 1, so no transformation is required. To find the probabilities by hand, we need to refer to the standard normal distribution table shown below:

Standard normal distribution table (Wackerly, Mendenhall, and Scheaffer 2014).

From the illustration at the top of the table, we see that the values inside the table correspond to the area under the normal curve above a certain $z$. Since we are looking precisely at the probability above $z = 1$ (since we look for $P(Z > 1)$), we can simply proceed down the first ($z$) column in the table until $z = 1.0$. The probability is 0.1587. Thus, $P(Z > 1) = 0.1587$. This is similar to what we found using R, except that values in the table are rounded to 4 digits.

Ex. 2

Let $Z$ denote a normal random variable with mean 0 and standard deviation 1, find $P(−1 \le Z \le 1)$.

We are looking for the shaded area in the following figure:

$Standard normal distribution: P(−1 \le Z \le 1)$

Standard normal distribution: $P(−1 \le Z \le 1)$

In R

pnorm(1, lower.tail = TRUE) - pnorm(-1, lower.tail = TRUE)

## [1] 0.6826895

Note that the arguments by default for the mean and the standard deviation are mean = 0 and sd = 1. Since this is what we need, we can omit them.¹

By hand

For this exercise we proceed by steps:

The shaded area corresponds to the entire area under the normal curve minus the two white areas in both tails of the curve.
We know that the normal distribution is symmetric.
Therefore, the shaded area is the entire area under the curve minus two times the white area in the right tail of the curve, the white area in the right tail of the curve being $P(Z > 1)$.
We also know that the entire area under the normal curve is 1.
Thus, the shaded area is 1 minus 2 times $P(Z > 1)$:

\[P(−1 \le Z \le 1) = 1 – 2 \cdot P(Z > 1) = 1 – 2 \cdot 0.1587 = 0.6826\]

where $P(Z > 1) = 0.1587$ has been found in the previous exercise.

Ex. 3

Let $Z$ denote a normal random variable with mean 0 and standard deviation 1, find $P(0 \le Z \le 1.37)$.

We are looking for the shaded area in the following figure:

$Standard normal distribution: P(0 \le Z \le 1.37)$

Standard normal distribution: $P(0 \le Z \le 1.37)$

In R

pnorm(0, lower.tail = FALSE) - pnorm(1.37, lower.tail = FALSE)

## [1] 0.4146565

By hand

Again we proceed by steps for this exercise:

We know that $P(Z > 0) = 0.5$ since the entire area under the curve is 1, half of it is 0.5.
The shaded area is half of the entire area under the curve minus the area from 1.37 to infinity.
The area under the curve from 1.37 to infinity corresponds to $P(Z > 1.37)$.
Therefore, the shaded area is $0.5 – P(Z > 1.37)$.
To find $P(Z > 1.37)$, proceed down the $z$ column in the table to the entry 1.3 and then across the top of the table to the column labeled .07 to read $P(Z > 1.37) = .0853$
Thus,

\[P(0 \le Z \le 1.37) = P(Z > 0) – P(Z > 1.37) = 0.5 – 0.0853 = 0.4147\]

Ex. 4

Recap the example presented in the empirical rule: Suppose that the scores of an exam in statistics given to all students in a Belgian university are known to have a normal distribution with mean $\mu = 67$ and standard deviation $\sigma = 9$. What fraction of the scores lies between 70 and 80?

We are looking for the shaded area in the following figure:

$P(70 \le X \le 80) where X \sim \mathcal{N}(\mu = 67, \sigma^2 = 9^2)$

$P(70 \le X \le 80)$ where $X \sim \mathcal{N}(\mu = 67, \sigma^2 = 9^2)$

In R

pnorm(70, mean = 67, sd = 9, lower.tail = FALSE) - pnorm(80, mean = 67, sd = 9, lower.tail = FALSE)

## [1] 0.2951343

By hand

Remind that we are looking for $P(70 \le X \le 80)$ where $X \sim \mathcal{N}(\mu = 67, \sigma^2 = 9^2)$. The random variable $X$ is in its "raw" format, meaning that it has not been standardized yet since the mean is 67 and the variance is $9^2$. We thus need to first apply the transformation to standardize the endpoints 70 and 80 with the following formula:

\[Z = \frac{X – \mu}{\sigma}\]

After the standardization, $x = 70$ becomes (in terms of $z$, so in terms of deviation from the mean expressed in standard deviation):

\[z = \frac{70 – 67}{9} = 0.3333\]

and $x = 80$ becomes:

\[z = \frac{80 – 67}{9} = 1.4444\]

The figure above in terms of $X$ is now in terms of $Z$:

$P(0.3333 \le Z \le 1.4444) where Z \sim \mathcal{N}(\mu = 0, \sigma^2 = 1)$

$P(0.3333 \le Z \le 1.4444)$ where $Z \sim \mathcal{N}(\mu = 0, \sigma^2 = 1)$

Finding the probability $P(0.3333 \le Z \le 1.4444)$ is similar to exercises 1 to 3:

The shaded area corresponds to the area under the curve from $z = 0.3333$ to $z = 1.4444$.
In other words, the shaded area is the area under the curve from $z = 0.3333$ to infinity minus the area under the curve from $z = 1.4444$ to infinity.
From the table, $P(Z > 0.3333) = 0.3707$ and $P(Z > 1.4444) = 0.0749$
Thus:

\[P(0.3333 \le Z \le 1.4444) = P(Z > 0.3333) – P(Z > 1.4444) = 0.3707 – 0.0749 = 0.2958\]

The difference with the probability found using in R comes from the rounding.

To conclude this exercice, we can say that, given that the mean scores is 67 and the standard deviation is 9, 29.58% of the students scored between 70 and 80.

Ex. 5

See another example in a context here.

Why is the normal distribution so crucial in statistics?

The normal distribution is important for three main reasons:

Some statistical hypothesis tests assume that the data follow a normal distribution
The central limit theorem states that, for a large number of observations ($n > 30$), no matter what is the underlying distribution of the original variable, the distribution of the sample means ($\overline{X}_n$) and of the sum ($S_n = \sum_{i = 1}^n X_i$) may be approached by a normal distribution
Linear and nonlinear regression assume that the residuals are normally-distributed

It is therefore useful to know how to test for normality in R, which is the topic of next sections.

How to test the normality assumption

As mentioned above, some statistical tests require that the data follow a normal distribution, or the result of the test may be flawed.

In this section, we show 4 complementary methods to determine whether your data follow a normal distribution in R.

Histogram

A histogram displays the spread and shape of a distribution, so it is a good starting point to evaluate normality. Let's have a look at the histogram of a distribution that we would expect to follow a normal distribution, the height of 1,000 adults in cm:

The normal curve with the corresponding mean and variance has been added to the histogram. The histogram follows the normal curve so the data seems to follow a normal distribution.

Below the minimal code for a histogram in R with the dataset iris:

data(iris)  hist(iris$Sepal.Length)

In {ggplot2}:

ggplot(iris) +    aes(x = Sepal.Length) +    geom_histogram()

Histograms are however not sufficient, particularily in the case of small samples because the number of bins greatly change its appearance. Histograms are not recommended when the number of observations is less than 20 because it does not always correctly illustrate the distribution. See two examples below with dataset of 10 and 12 observations:

Can you tell whether these datasets follow a normal distribution? Suprisingly, both follow a normal distribution!

Density plot

Density plots also provide a visual judgment about whether the data follow a normal distribution. They are similar to histograms as they also allow to analyze the spread and the shape of the distribution. However, they are a smoothed version of the histogram. Here is the density plot drawn from the dataset on the height of the 12 adults discussed above:

plot(density(dat_hist$value))

In {ggpubr}:

library("ggpubr") # package must be installed first  ggdensity(dat_hist$value,    main = "Density plot of adult height",    xlab = "Height (cm)"  )

Since it is hard to test for normality from histograms and density plots only, it is recommended to corroborate these graphs with a QQ-plot. QQ-plot, also known as normality plot, is the third method presented to evaluate normality.

QQ-plot

Like histograms and density plots, QQ-plots allow to visually evaluate the normality assumption. Here is the QQ-plot drawn from the dataset on the height of the 12 adults discussed above:

library(car)  qqPlot(dat_hist$value)

## [1] 12  2

In {ggpubr}:

library(ggpubr)  ggqqplot(dat_hist$value)

Instead of looking at the spread of the data (as it is the case with histograms and density plots), with QQ-plots we only need to ascertain whether the data points follow the line (sometimes referred as Henry's line).

If points are close to the reference line and within the confidence bands, the normality assumption can be considered as met. The bigger the deviation between the points and the reference line and the more they lie outside the confidence bands, the less likely that the normality condition is met. The height of these 12 adults seem to follow a normal distribution because all points lie within the confidence bands.

When facing a non-normal distribution as shown by the QQ-plot below (systematic departure from the reference line), the first step is usually to apply the logarithm transformation on the data and recheck to see whether the log-transformed data are normally distributed. Applying the logarithm transformation can be done with the log() function.

Note that QQ-plots are also a convenient way to assess whether residuals from regression analysis follow a normal distribution.

Normality test

The 3 tools presented above were a visual inspection of the normality. Nonetheless, visual inspection may sometimes be unreliable so it is also possible to formally test whether the data follow a normal distribution with statistical tests. These normality tests compare the distribution of the data to a normal distribution in order to assess whether observations show an important deviation from normality.

The two most common normality tests are Shapiro-Wilk's test and Kolmogorov-Smirnov test. Both tests have the same hypotheses, that is:

$H_0$: the data follow a normal distribution
$H_1$: the data do not follow a normal distribution

Shapiro-Wilk test is recommended for normality test as it provides better power than Kolmogorov-Smirnov test.² In R, the Shapiro-Wilk test of normality can be done with the function shapiro.test():³

shapiro.test(dat_hist$value)

##   ##  Shapiro-Wilk normality test  ##   ## data:  dat_hist$value  ## W = 0.93968, p-value = 0.4939

From the output, we see that the $p$-value $> 0.05$ implying that the data are not significantly different from a normal distribution. In other words, we can assume the normality. This test confirms the QQ-plot which also showed normality (as all points lied within the confidence bands).

It is important to note that, in practice, normality tests are often considered as too conservative in the sense that for large sample size ($n > 50$), a small deviation from the normality may cause the normality condition to be violated. A normality test is a hypothesis test, so as the sample size increases, their capacity of detecting smaller differences increases. So as the number of observations increases, the Shapiro-Wilk test becomes very sensitive even to a small deviation from normality. As a consequence, it happens that according to the normality test the data do not follow a normal distribution although the departures from the normal distribution are negligible and the data in fact follow a normal distribution. For this reason, it is often the case that the normality condition is verified based on a combination of all methods presented in this article, that is, visual inspections (with histograms and QQ-plots) and a formal inspection (with the Shapiro-Wilk test for instance).

I personally tend to prefer QQ-plots over histograms and normality tests so I do not have to bother about the sample size. This article showed the different methods that are available, your choice will of course depends on the type of your data and the context of your analyses.

Thanks for reading. I hope the article helped you to learn more about the normal distribution and how to test for normality in R. See other articles about statistics or about R.

As always, if you have a statistical question related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Get updates every time a new article is published by subscribing to this blog.

References

Wackerly, Dennis, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical Statistics with Applications. Cengage Learning.

The argument lower.tail = TRUE is also the default so we could omit it as well. However, for clarity and to make sure I compute the propabilities in the correct side of the curve, I used to keep this argument explicit by writing it.
The Shapiro-Wilk test is based on the correlation between the sample and the corresponding normal scores.
In R, the Kolmogorov-Smirnov test is performed with the function ks.test().

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

The significance of the region on the salary of engineers in Sweden

Posted: 28 Jan 2020 04:00 PM PST

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So far I have analysed the effect of experience, education, gender and year on the salary of engineers in Sweden. In this post, I will have a look at the effect of the region on the salary of engineers in Sweden.

Statistics Sweden use NUTS (Nomenclature des Unités Territoriales Statistiques), which is the EU's hierarchical regional division, to specify the regions.

First, define libraries and functions.

library (tidyverse)

## -- Attaching packages --------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3  ## v tibble  2.1.3     v dplyr   0.8.3  ## v tidyr   1.0.2     v stringr 1.4.0  ## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------ tidyverse_conflicts() --  ## x dplyr::filter() masks stats::filter()  ## x dplyr::lag()    masks stats::lag()

library (broom)   library (car)

## Loading required package: carData

##   ## Attaching package: 'car'

## The following object is masked from 'package:dplyr':  ##   ##     recode

## The following object is masked from 'package:purrr':  ##   ##     some

library (swemaps) # devtools::install_github('reinholdsson/swemaps')  library(sjPlot)

## Registered S3 methods overwritten by 'lme4':  ##   method                          from  ##   cooks.distance.influence.merMod car   ##   influence.merMod                car   ##   dfbeta.influence.merMod         car   ##   dfbetas.influence.merMod        car

## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!

readfile <- function (file1){      read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%        gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%        drop_na() %>%        mutate (year_n = parse_number (year))  }  nuts <- read.csv("nuts.csv") %>%    mutate(NUTS2_sh = substr(NUTS2, 1, 4))    map_ln_n <- map_ln %>%    mutate(lnkod_n = as.numeric(lnkod))

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000CG.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by region, sector, occupational group (SSYK 2012) and sex. Year 2014 – 2018 Monthly salary All sectors

We expect that the region is an important factor in salaries. As a null hypothesis, we assume that the region is not related to the salary and examine if we can reject this hypothesis with the data from Statistics Sweden.

tb <- readfile ("000000CG.csv") %>%    filter (`occuptional  (SSYK 2012)` == "214 Engineering professionals") %>%     left_join(nuts %>% distinct (NUTS2_en, NUTS2_sh), by = c("region" = "NUTS2_en"))

## Warning: Column `region`/`NUTS2_en` joining character vector and factor,  ## coercing into character vector

tb_map <- readfile ("000000CG.csv") %>%    filter (`occuptional  (SSYK 2012)` == "214 Engineering professionals") %>%    left_join(nuts, by = c("region" = "NUTS2_en"))

## Warning: Column `region`/`NUTS2_en` joining character vector and factor,  ## coercing into character vector

tb_map %>%    filter (sex == "men") %>%    right_join(map_ln_n, by = c("Länskod" = "lnkod_n")) %>%    ggplot() +      geom_polygon(mapping = aes(x = ggplot_long, y = ggplot_lat, group = lnkod, fill = salary)) +        facet_grid(. ~ year) +       coord_equal()

Figure 1: SSYK 214, Architects, engineers and related professionals, men, Year 2014 – 2018

tb_map %>%    filter (sex == "women") %>%    right_join(map_ln_n, by = c("Länskod" = "lnkod_n")) %>%    ggplot() +      geom_polygon(mapping = aes(x = ggplot_long, y = ggplot_lat, group = lnkod, fill = salary)) +        facet_grid(. ~ year) +       coord_equal()

Figure 2: SSYK 214, Architects, engineers and related professionals, women, Year 2014 – 2018

tb %>%    ggplot () +        geom_point (mapping = aes(x = year_n, y = salary, colour = region, shape=sex)) +     labs(      x = "Year",      y = "Salary (SEK/month)"    )

Figure 3: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

The F-value from the Anova table for the region is 40 (Pr(>F) < 2.2e-16), sufficient for rejecting the null hypothesis that the region has no effect on the salary holding year as constant. The adjusted R-squared value is 0,882 implying a good fit of the model.

model <-lm (log(salary) ~ year_n + sex + region , data = tb)    tb <- bind_cols(tb, as_tibble(exp(predict(model, tb, interval = "confidence"))))    tb %>%    ggplot () +        geom_point (mapping = aes(x = year_n,y = fit, colour = region, shape=sex)) +     labs(      x = "Year",      y = "Salary (SEK/month)"    )

Figure 4: Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

summary(model) %>%      tidy() %>%    knitr::kable(     booktabs = TRUE,    caption = 'Summary from linear model fit')

Table 1: Summary from linear model fit
term	estimate	std.error	statistic	p.value
(Intercept)	-29.2269335	3.7348614	-7.825440	0.00e+00
year_n	0.0198303	0.0018526	10.703985	0.00e+00
sexwomen	-0.0587056	0.0052400	-11.203449	0.00e+00
regionSE12 East-Central Sweden	-0.0574855	0.0104799	-5.485297	6.00e-07
regionSE21 Småland and islands	-0.1286385	0.0104799	-12.274762	0.00e+00
regionSE22 South Sweden	-0.0457725	0.0104799	-4.367642	4.26e-05
regionSE23 West Sweden	-0.0513546	0.0104799	-4.900285	6.00e-06
regionSE31 North-Central Sweden	-0.0948080	0.0104799	-9.046630	0.00e+00
regionSE32 Central Norrland	-0.1099716	0.0104799	-10.493550	0.00e+00
regionSE33 Upper Norrland	-0.1360235	0.0104799	-12.979443	0.00e+00

summary(model)$r.squared

## [1] 0.8817871

Anova(model, type=2) %>%     tidy() %>%     knitr::kable(     booktabs = TRUE,    caption = 'Anova report from linear model fit')

Table 1: Anova report from linear model fit
term	sumsq	df	statistic	p.value
year_n	0.0629183	1	114.57530	0
sex	0.0689270	1	125.51728	0
region	0.1548911	7	40.29419	0
Residuals	0.0384401	70	NA	NA

Let's check what we have found.

model <-lm (log(salary) ~ year_n + sex + NUTS2_sh , data = tb)      plot_model(model, type = "pred", terms = c("NUTS2_sh"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 5: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

Figure 6: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

tb[38,]

## # A tibble: 1 x 11  ##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit  ##                                ## 1 SE21 ~ 0 all~ 214 Engineering~ women 2016   34700   2016 SE21     38698.  ## # ... with 2 more variables: lwr , upr

tb[27,]

## # A tibble: 1 x 11  ##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit  ##                                ## 1 SE31 ~ 0 all~ 214 Engineering~ men   2015   38100   2015 SE31     41616.  ## # ... with 2 more variables: lwr , upr

tb[5,]

## # A tibble: 1 x 11  ##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit  ##                                ## 1 SE21 ~ 0 all~ 214 Engineering~ men   2014   41500   2014 SE21     39442.  ## # ... with 2 more variables: lwr , upr

Also examine the interaction between gender and region. The F-value from the Anova table for the interaction is 2,7 (Pr(>F) < 0.017) implying a relation to 95 % significance.

model <-lm (log(salary) ~ year_n + sex * NUTS2_sh , data = tb)     summary(model) %>%      tidy() %>%    knitr::kable(     booktabs = TRUE,    caption = 'Summary from linear model fit')

Table 2: Summary from linear model fit
term	estimate	std.error	statistic	p.value
(Intercept)	-29.2212864	3.4559410	-8.4553776	0.0000000
year_n	0.0198303	0.0017142	11.5678970	0.0000000
sexwomen	-0.0699998	0.0137140	-5.1042618	0.0000033
NUTS2_shSE12	-0.0755067	0.0137140	-5.5058132	0.0000007
NUTS2_shSE21	-0.1161490	0.0137140	-8.4693756	0.0000000
NUTS2_shSE22	-0.0520542	0.0137140	-3.7957018	0.0003332
NUTS2_shSE23	-0.0535387	0.0137140	-3.9039471	0.0002331
NUTS2_shSE31	-0.1099009	0.0137140	-8.0137791	0.0000000
NUTS2_shSE32	-0.1293018	0.0137140	-9.4284543	0.0000000
NUTS2_shSE33	-0.1327796	0.0137140	-9.6820490	0.0000000
sexwomen:NUTS2_shSE12	0.0360425	0.0193945	1.8583838	0.0677877
sexwomen:NUTS2_shSE21	-0.0249791	0.0193945	-1.2879441	0.2024762
sexwomen:NUTS2_shSE22	0.0125634	0.0193945	0.6477815	0.5194801
sexwomen:NUTS2_shSE23	0.0043683	0.0193945	0.2252313	0.8225283
sexwomen:NUTS2_shSE31	0.0301860	0.0193945	1.5564170	0.1246186
sexwomen:NUTS2_shSE32	0.0386605	0.0193945	1.9933706	0.0505553
sexwomen:NUTS2_shSE33	-0.0064879	0.0193945	-0.3345206	0.7390979

summary(model)$r.squared

## [1] 0.908906

Anova(model, type=2) %>%     tidy() %>%     knitr::kable(     booktabs = TRUE,    caption = 'Anova report from linear model fit')

Table 2: Anova report from linear model fit
term	sumsq	df	statistic	p.value
year_n	0.0629183	1	133.816241	0.0000000
sex	0.0689270	1	146.595736	0.0000000
NUTS2_sh	0.1548911	7	47.060909	0.0000000
sex:NUTS2_sh	0.0088184	7	2.679327	0.0170929
Residuals	0.0296216	63	NA	NA

plot_model(model, type = "pred", terms = c("NUTS2_sh", "sex"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 7: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

Figure 8: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

tb[37,]

## # A tibble: 1 x 11  ##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit  ##                                ## 1 SE21 ~ 0 all~ 214 Engineering~ men   2016   40100   2016 SE21     41037.  ## # ... with 2 more variables: lwr , upr

And the interaction between year and region. The F-value from the Anova table for the region is 0,57 (Pr(>F) < 0,77) implying no significant relation.

model <-lm (log(salary) ~ year_n * NUTS2_sh + sex , data = tb)      summary(model) %>%      tidy() %>%    knitr::kable(     booktabs = TRUE,    caption = 'Summary from linear model fit')

Table 3: Summary from linear model fit
term	estimate	std.error	statistic	p.value
(Intercept)	-32.1955668	10.7951966	-2.9823975	0.0040635
year_n	0.0213028	0.0053548	3.9782933	0.0001818
NUTS2_shSE12	-4.0204905	15.2667129	-0.2633501	0.7931402
NUTS2_shSE21	0.6481345	15.2667129	0.0424541	0.9662710
NUTS2_shSE22	18.2873103	15.2667129	1.1978551	0.2354607
NUTS2_shSE23	7.7565167	15.2667129	0.5080672	0.6131806
NUTS2_shSE31	-5.4895464	15.2667129	-0.3595762	0.7203666
NUTS2_shSE32	-3.2845607	15.2667129	-0.2151452	0.8303490
NUTS2_shSE33	9.2276482	15.2667129	0.6044293	0.5477290
sexwomen	-0.0587056	0.0053548	-10.9632632	0.0000000
year_n:NUTS2_shSE12	0.0019658	0.0075728	0.2595848	0.7960305
year_n:NUTS2_shSE21	-0.0003853	0.0075728	-0.0508802	0.9595820
year_n:NUTS2_shSE22	-0.0090938	0.0075728	-1.2008536	0.2343036
year_n:NUTS2_shSE23	-0.0038730	0.0075728	-0.5114312	0.6108371
year_n:NUTS2_shSE31	0.0026760	0.0075728	0.3533662	0.7249937
year_n:NUTS2_shSE32	0.0015747	0.0075728	0.2079419	0.8359451
year_n:NUTS2_shSE33	-0.0046447	0.0075728	-0.6133392	0.5418602

summary(model)$r.squared

## [1] 0.8888956

Anova(model, type=2) %>%     tidy() %>%     knitr::kable(     booktabs = TRUE,    caption = 'Anova report from linear model fit')

Table 3: Anova report from linear model fit
term	sumsq	df	statistic	p.value
year_n	0.0629183	1	109.7152945	0.0000000
NUTS2_sh	0.1548911	7	38.5850132	0.0000000
sex	0.0689270	1	120.1931409	0.0000000
year_n:NUTS2_sh	0.0023115	7	0.5758246	0.7728945
Residuals	0.0361285	63	NA	NA

plot_model(model, type = "pred", terms = c("NUTS2_sh", "year_n"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 9: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

Figure 10: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

Finally the interaction between gender, year and region, the only significant interaction is between the region and gender, F-value from the Anova table for the interaction is 2,4 (Pr(>F) < 0.033)

model <-lm (log(salary) ~ year_n * NUTS2_sh * sex , data = tb)      summary(model) %>%      tidy() %>%    knitr::kable(     booktabs = TRUE,    caption = 'Summary from linear model fit')

Table 4: Summary from linear model fit
term	estimate	std.error	statistic	p.value
(Intercept)	-34.6715204	14.5497427	-2.3829645	0.0211805
year_n	0.0225338	0.0072171	3.1222585	0.0030380
NUTS2_shSE12	5.2021839	20.5764435	0.2528223	0.8014851
NUTS2_shSE21	5.5082002	20.5764435	0.2676945	0.7900815
NUTS2_shSE22	25.5308721	20.5764435	1.2407816	0.2207168
NUTS2_shSE23	16.1162328	20.5764435	0.7832370	0.4373355
NUTS2_shSE31	-18.4371634	20.5764435	-0.8960326	0.3747072
NUTS2_shSE32	7.6855067	20.5764435	0.3735100	0.7104136
NUTS2_shSE33	5.0530040	20.5764435	0.2455723	0.8070603
sexwomen	4.8932017	20.5764435	0.2378060	0.8130438
year_n:NUTS2_shSE12	-0.0026179	0.0102066	-0.2564919	0.7986672
year_n:NUTS2_shSE21	-0.0027899	0.0102066	-0.2733393	0.7857651
year_n:NUTS2_shSE22	-0.0126899	0.0102066	-1.2433117	0.2197916
year_n:NUTS2_shSE23	-0.0080207	0.0102066	-0.7858392	0.4358239
year_n:NUTS2_shSE31	0.0090909	0.0102066	0.8906917	0.3775375
year_n:NUTS2_shSE32	-0.0038764	0.0102066	-0.3797940	0.7057736
year_n:NUTS2_shSE33	-0.0025723	0.0102066	-0.2520253	0.8020975
year_n:sexwomen	-0.0024619	0.0102066	-0.2412080	0.8104213
NUTS2_shSE12:sexwomen	-18.4453487	29.0994854	-0.6338720	0.5291736
NUTS2_shSE21:sexwomen	-9.7201315	29.0994854	-0.3340310	0.7398112
NUTS2_shSE22:sexwomen	-14.4871236	29.0994854	-0.4978481	0.6208644
NUTS2_shSE23:sexwomen	-16.7194322	29.0994854	-0.5745611	0.5682714
NUTS2_shSE31:sexwomen	25.8952341	29.0994854	0.8898863	0.3779654
NUTS2_shSE32:sexwomen	-21.9401348	29.0994854	-0.7539699	0.4545502
NUTS2_shSE33:sexwomen	8.3492884	29.0994854	0.2869222	0.7754068
year_n:NUTS2_shSE12:sexwomen	0.0091674	0.0144343	0.6351107	0.5283723
year_n:NUTS2_shSE21:sexwomen	0.0048091	0.0144343	0.3331727	0.7404549
year_n:NUTS2_shSE22:sexwomen	0.0071923	0.0144343	0.4982800	0.6205623
year_n:NUTS2_shSE23:sexwomen	0.0082955	0.0144343	0.5747113	0.5681705
year_n:NUTS2_shSE31:sexwomen	-0.0128299	0.0144343	-0.8888492	0.3785170
year_n:NUTS2_shSE32:sexwomen	0.0109022	0.0144343	0.7552986	0.4537602
year_n:NUTS2_shSE33:sexwomen	-0.0041447	0.0144343	-0.2871452	0.7752371

summary(model)$r.squared

## [1] 0.9231133

Anova(model, type=2) %>%     tidy() %>%     knitr::kable(     booktabs = TRUE,    caption = 'Anova report from linear model fit')

Table 4: Anova report from linear model fit
term	sumsq	df	statistic	p.value
year_n	0.0629183	1	120.7946290	0.0000000
NUTS2_sh	0.1637127	14	22.4504512	0.0000000
sex	0.0777463	8	18.6578050	0.0000000
year_n:NUTS2_sh	0.0023115	7	0.6339728	0.7254512
year_n:sex	0.0000085	1	0.0163969	0.8986442
NUTS2_sh:sex	0.0088184	7	2.4186030	0.0332038
year_n:NUTS2_sh:sex	0.0022998	7	0.6307550	0.7280524
Residuals	0.0250018	48	NA	NA

plot_model(model, type = "pred", terms = c("NUTS2_sh", "year_n", "sex"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 11: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

Figure 12: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now