[R-bloggers] SR2 Chapter 2 Medium (and 4 more aRticles)

SR2 Chapter 2 Medium
What to know before you adopt Hugo/blogdown
The significance of the sector on the salary in Sweden, a comparison between different occupational groups, part 3
How to Acquire Large Satellite Image Datasets for Machine Learning Projects
Version 0.4.0 of nnetsauce, with fruits and breast cancer classification

Posted: 28 Feb 2020 04:00 PM PST

[This article was first published on Brian Callander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

SR2 Chapter 2 Medium

Posted on 29 February, 2020 by Brian

Tags: statistical rethinking, solutions, conditional probability, counting, grid approximation

Category: statistical-rethinking-2

Here's my solutions to the medium exercises in chapter 2 of McElreath's Statistical Rethinking, 1st edition. My intention is to move over to the 1nd edition when it comes out next month.

$\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbernoulli}{Bernoulli} \DeclareMathOperator{\dpoisson}{Poisson} \DeclareMathOperator{\dnormal}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexponential}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvpamma}{Invpamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}$

Globe Tossing

Start by creating a grid and the function posterior which we we use for several calculations. This is analogous to the code provided in the chapter.

p_true <- 0.7 # assumed ground truth    granularity <- 1000 # number of points on grid    grid1 <- tibble(p = seq(0, 1, length.out = granularity)) %>%     mutate(prior = 1)    posterior <- function(data, grid) {    grid %>%       mutate(        likelihood = dbinom(sum(data == 'W'), length(data), p),        unstd_posterior = prior * likelihood,        posterior = unstd_posterior / sum(unstd_posterior)      )  }

The exercise asks us to approximate the posterior for each of the following three datasets. To do this, we just apply our posterior function above to each of them.

data <- list(      '1' = c('W', 'W', 'L'),      '2' = c('W', 'W', 'W', 'L'),      '3' = c('L', 'W', 'W', 'L', 'W', 'W', 'W')    )     m1 <- data %>%     map_dfr(posterior, grid1, .id = 'dataset')

The posterior becomes gradually more concentrated around the ground truth.

For the second question, we simply do the same but with a different prior. More specifically, for any p below 0.5 we set the prior to zero, then map our posterior over each the the datasets with this new grid.

grid2 <- grid1 %>%     mutate(prior = if_else(p < 0.5, 0, prior))    m2 <- data %>%     map_dfr(posterior, grid2, .id = 'dataset')

Again we see the posterior concentrate more around the ground truth. Moreover, the distribution is more peaked (at ~ 0.003) than with the uniform prior, which peaks at around (~0.0025). The first dataset already gets pretty close to this peak, i.e. this more informative prior gets us better inferences sooner.

For the final question on globe tossing, we can just use the counting method rather than grid approximation. We enumerate all possible events in proportion to how likely they are to occur: 10 L for Mars, 3 L and 7 W for Earth. Then we filter our any inconsistent with our observation of land, and summarise the remaining possibilities.

m3 <- tibble(mars = rep('L', 10)) %>%     mutate(earth = if_else(row_number() <= 3, 'L', 'W')) %>%     gather(planet, observation) %>%  # all possible events    filter(observation == 'L') %>% # only those events consistent with observation    summarise(mean(planet == 'earth')) %>% # fraction of possible events that are earth    pull()    m3

[1] 0.2307692

We get around 23%.

Card Drawing

We make a list of all sides, filter out any inconsistent with our observation of a black side, then summarise the remaining card possibilities.

m4_events <- tibble(card = c("BB", "BW", "WW")) %>% # all the cards    separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%     gather(side, colour, -card) # all the sides    m4_possibilities <- m4_events %>%      filter(colour == 'B') # just the possible events where there is a black side    m4 <- m4_possibilities %>%      summarise(mean(card == 'BB')) %>%     pull() # which fraction of possible events is a double black?    m4

[1] 0.6666667

The next exercise is the same as the previous but with more cards. Note that this equivalent to using the three cards as before but with a larger prior probability on the BB card.

m5_events <- tibble(card = c("BB", "BW", "WW", "BB")) %>%     separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%     gather(side, colour, -card)     m5_possibilities <- m5_events %>%     filter(colour == 'B')     m5 <- m5_possibilities %>%     summarise(mean(card == 'BB')) %>%     pull()    m5

[1] 0.8

Putting the prior on the cards is equivalent to having the cards in proportion to their prior. The rest of the calculation is the same.

m6_events <- c("BB", "BW", "WW") %>% # cards    rep(c(1, 2, 3)) %>% # prior: repeat each card the given number of times    tibble(card = .) %>%     separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%    gather(side, colour, -card)     m6_possibilities <- m6_events %>% # sides    filter(colour == 'B')     m6 <- m6_possibilities %>% # sides consistent with observation    summarise(mean(card == 'BB')) %>% # proportion of possible events that are BB    pull()    m6

[1] 0.5

This last card drawing exercise is slightly more involved since we can observe any of the two sides of the one card and any of the two sides of the other. Thus, we first generate the list of all possible pairs of cards, expand this into a list of all possible sides that could be observed for each card, filter out any event not consisent with our observations, then summarise whatever is left.

m7_card_pairs <- tibble(card = c("BB", "BW", "WW")) %>% # all the cards    crossing(., other_card = .$card) %>%      filter(card != other_card) # all card pairs (can't draw the same card twice)    m7_events <- m7_card_pairs %>%     separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%     separate(other_card, into = c('other_side1', 'other_side2'), sep = 1, remove = F) %>%     gather(side, colour, side1, side2) %>% # all the sides for card of interest    gather(other_side, other_colour, other_side1, other_side2) # all sides of other card    m7_possibilities <- m7_events %>%      filter(      colour == 'B', # we observe that card of interest has a black side      other_colour == 'W' # we observe that the other card has a white side    )     m7 <- m7_possibilities %>%      summarise(mean(card == 'BB')) %>% # which fraction of possible events is a double black?    pull()    m7

[1] 0.75

To leave a comment for the author, please follow the link and comment on their blog: Brian Callander.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

What to know before you adopt Hugo/blogdown

Posted: 28 Feb 2020 04:00 PM PST

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fancy (re-)creating your website using Hugo, with or without blogdown?
Feeling a bit anxious?
This post is aimed at being the Hugo equivalent of "What to know before you adopt a pet".
We shall go through things that can/will break in the future, and what you can do to prevent future pain.

I'm writing this post with R users in mind, which means I shall use R analogies and mentioning blogdown,
but I hope some aspects are generalizable to other potential Hugo adopters.
Some aspects are specific to deploying your website using git and an online git platform such as GitHub.

Why Hugo/blogdown?

If you're reading this you've probably heard of Hugo somewhere.
It's getting quite popular in the R community thanks to the blogdown package, whose associated book features an excellent intro to why Hugo (and blogdown).

I myself have used Hugo for this website (with a bit of blogdown) and dived into more details whilst working on tweaks to the rOpenSci website.
I really enjoy working with this website generator, partly because it's fast and open-source¹.

What can break or evolve?

When you have a website created with Hugo, there are two to four main actors:

Hugo itself which you could think of as R;
A Hugo theme which defines the layout and style of your website. You could think of the theme as an R package. Your theme is a folder inside your website folder. You need one, no matter how minimal.
the other things your website theme might depend on, such as Font Awesome.
potential fourth actor: your tweaks to the theme which might be your editing an R package source; or maintaining a script/a second package on to of that R package to add some syntactic sugar or so to the existing functions of the package.

So what's going to break and evolve?

Just like R packages need to keep up with changes to R sometimes, Hugo themes need to keep up with Hugo changes. Hugo is a trustworthy framework, that keeps improving, which however means sometimes removing support for some things or ways of doing them. Example: this commit to the theme I'm using made a small change for making the theme compatible with Hugo>0.57.
Just like R packages need to keep up with changes to their external dependencies (web APIs, data format standard), Hugo themes need to keep with changes to external dependencies (font providers, JS scripts, etc). Example: a commit to the theme I'm using that updates the templates to use the new syntax for Font Awesome icons.
Just like R packages often improve with time, Hugo themes evolve. And just as you might like updating packages to get the best new tools, you might like getting the fancy features offered by a new version of your theme. For instance look at this release notes for some version of Hugo academic theme: there are bug fixes but also cool improvements!
Just like your taste or needs of R packages in your script might evolve, e.g. you might want to update old scripts to use data.table instead of dplyr, you might even want to switch themes!

How to reduce the likelihood of breakages?

Choose your theme wisely and keep in touch!

When choosing a theme i.e. a collection of layouts for your website,
you'll have aesthetics and practicalities in mind.
E.g. if you're a prolific blogger you'll want the posts to be quite prominent.
As the blogdown book mentions, also look at the theme's popularity and activity before adopting it.
This way you can have more trust in the theme's responding to Hugo changes and to bug reports.

Now, you'll have to know when the theme gets updated. How?

You could go hardcore git and make the theme a git submodule of your website repo.
You could watch i.e. subscribe to the GitHub releases of that theme, or, if the maintainer(s) use a less formal workflow, watch all activity from that repo.
Any other idea? Maybe involving GitHub Actions, Dependabot, something else?
Maybe just a reminder to have a coffee date with your theme repo once in a while?

What if your theme gets orphaned?

What if you chose your theme wisely but it lost its maintainer(s)?
In that case you'll need to look into adopting it or rolling out your own version, or changing themes.

Make well-defined tweaks to the theme

Although you've adopted a theme, you'll probably want to personalize it a bit.
If you do so, do it with a good file structure and documentation hygiene!
As very well explained in the blogdown book,

some tweaks are directly supported by the theme via the website config file (think of it as an R function parameters) e.g. adding your name to the homepage rather than Jane Doe's;
some tweaks require your writing layout files (think of it as writing a wrapper for an R package/re-writing your own version of some functions) e.g. adding some fun sentence to the footer which the original theme doesn't support. In this case you should store your custom layouts, like the fancy footer partial template, in a folder called layouts/ at the root of your website folder; not in the theme folder. Hugo will give priority to layouts/ stuff when defined, to use them on top of theme stuff; and you'll easily see what you changed. Your future self will probably find much joy in your present self's documenting the why and how of your custom layouts in some sort of developer notes. The more custom layouts you write, the greater your future responsability.

If you define CSS and JS files, they'd live in static/ and be referred to in the website config file.

IMHO Have your content as Markdown, not html files

With blogdown you can use .Rmd, .RMarkdown or .md as your website source, refer to this exhaustive and clear comparison.
I am strongly in favour of never using .Rmd in a blogdown site because its output is an html file, not a .markdown/.md file. html files seem less portable to me. I like the idea of being able to take my Markdown content and rather easily move them to a new theme (or even a new framework, like when I migrated this website from Jekyll).

Now, as explained in the table mentioned above, using .RMarkdown/.md has its limitations.
For html widgets I can recommend looking at the setup Steph Locke created for the RECON learn website.

Update your theme

So if you've tweaked cleanly, you can update your theme when needed!
If you work with version control, and you probably should, make a branch and work on the update in that branch.
If you don't, backup your website's current state first.

To "update your theme", you need to replace the theme folder of your website folder with the new theme files.

You could do update the theme manually, or use blogdown::install_theme() with force = TRUE.

Build your website, look at what needs to be changed:

maybe the config parameter names changed (like an R function which would have renamed one of the parameters);
your custom layout might need some work too.

To figure out what needs to be changed, you'll probably want to read the changelog (or commit history) of your theme, and maybe even Hugo changelog.

Often, changes in your theme, and work needed on your website, won't be dramatic: the theme folder update, maybe one config parameter, a few lines diff in your custom layouts.

Follow Hugo news?

If you wrote no custom layouts and use a very well maintained theme, you might never need to keep up with Hugo changes yourself.
However, if you've written Hugo custom layouts, or strive to become a contributor to your theme, you might want to read Hugo changelogs, follow Hugo's source repository, or Hugo's Twitter account, etc.

Don't live on the edge

If you have a workflow on a continuous integration system updating your website every day from an external data source, like Noam Ross does (yes that's a very cool and very fancy setup!), use a specific Hugo version there, don't let the workflow install Hugo's latest version because it could break your website without your noticing.
Update Hugo with intent, after reading a bit and/or testing it doesn't break your website.

What if I just never update Hugo or my theme?

No, it's not a good solution in my opinion.
Never updating Hugo (neither locally nor on say Netlify) nor your theme probably means your website will still build as it used to.
However, updates to Hugo/themes contain both improvements and bug fixes so it's better to know you'll probably update the tools at least once in a while.

Quick fixes to bad news

Imagine you made yourself a pretty website to showcase your cool posts and informative slidedecks.
In the meantime, you changed laptops and today Saturday you want to post the link to your talk from last week,
before you head out for a barbecue with friends.
Your laptop doesn't have Hugo installed so you install its latest version using blogdown::install_hugo(), add your talk page, build the website, look at the preview and… notice it all looks wonky! Your website is broken!
If you remember the beginning of this post, what happened is probably an update in Hugo breaking your theme version. You'll eventually need to update your theme.

But you don't have time right now before your barbecue to do that, let alone to learn how to do that if it's the first time, so what can you do apart from not posting your talk content?

If your site is deployed by gh-pages

I.e. your build your website locally and then push the rendered content to a gh-pages branch of a GitHub repo, you need to retrograde Hugo before doing that. Note that blogdown::install_hugo() has a version argument, refer to Hugo changelogs to see what version you had last used.

If your site is deployed by Netlify

If you're lucky, you can just push your content, and since the Hugo version of your Netlify's config file hasn't changed, your website will build smoothly. You could make a PR to your own website to get a preview there before merging.

Later

Eventually, one day soon, get to updating Hugo again, looking at your theme's changes, if needed extracting your tweaks from the theme folder if you made them there (I've done that, and now I sure stick to using the layouts/ folder as mentioned earlier).
Use some of the tips of the beginning of the post to reduce the likelihood of the scenario happening again.

Conclusion

In this blog post I presented what the maintenance of a Hugo website entails in my experience.
I haven't mentioned other possible sources of breakage that are not specific to Hugo:

URLs in the content getting obsolete;
Your changing deploy workflows (so use your own domain instead of a Netlify's one?);
Your website source disappearing (so use backups);
etc.

Coming back to Hugo, if you encounter problems I'd unsurprisingly recommend Hugo docs and Hugo forum.

Now, if this all sounds overwhelming, I don't think these tech skills are harder than R skills but time is a limited resource so maybe you could outsource some of your website's creating and maintenance.
Maybe you can hire someone, or do a skill swap?

Don't hesitate to share your own experience and advice on maintaining Hugo websites!

Also because I love the font of its docs website and looking at pages full of curly braces.
^[return]

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

The significance of the sector on the salary in Sweden, a comparison between different occupational groups, part 3

Posted: 28 Feb 2020 04:00 PM PST

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To complete the analysis on the significance of the sector on the salary for different occupational groups in Sweden I will in this post examine the correlation between salary and sector using statistics for education.

The F-value from the Anova table is used as the single value to discriminate how much the region and salary correlates. For exploratory analysis, the Anova value seems good enough.

First, define libraries and functions.

library (tidyverse)

## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3  ## v tibble  2.1.3     v dplyr   0.8.3  ## v tidyr   1.0.2     v stringr 1.4.0  ## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --  ## x dplyr::filter() masks stats::filter()  ## x dplyr::lag()    masks stats::lag()

library (broom)  library (car)

## Loading required package: carData

##   ## Attaching package: 'car'

## The following object is masked from 'package:dplyr':  ##   ##     recode

## The following object is masked from 'package:purrr':  ##   ##     some

library (sjPlot)

## Registered S3 methods overwritten by 'lme4':  ##   method                          from  ##   cooks.distance.influence.merMod car   ##   influence.merMod                car   ##   dfbeta.influence.merMod         car   ##   dfbetas.influence.merMod        car

## Learn more about sjPlot with 'browseVignettes("sjPlot")'.

readfile <- function (file1){read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%    gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%    drop_na() %>%    mutate (year_n = parse_number (year))  }

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000CY.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

I have renamed the file to 000000CY_sector.csv because the filename 000000CY.csv was used in a previous post.

The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by sector, occupational group (SSYK 2012), sex and educational level (SUN). Year 2014 – 2018 Monthly salary 1-3 public sector 4-5 private sector

In the plot and tables, you can also find information on how the increase in salaries per year for each occupational group is affected when the interactions are taken into account.

tb <- readfile("000000CY_sector.csv") %>%     mutate(edulevel = `level of education`)    numedulevel <- read.csv("edulevel.csv")     numedulevel %>%    knitr::kable(    booktabs = TRUE,    caption = 'Initial approach, length of education')

Table 1: Initial approach, length of education
level.of.education	eduyears
primary and secondary education 9-10 years (ISCED97 2)	9
upper secondary education, 2 years or less (ISCED97 3C)	11
upper secondary education 3 years (ISCED97 3A)	12
post-secondary education, less than 3 years (ISCED97 4+5B)	14
post-secondary education 3 years or more (ISCED97 5A)	15
post-graduate education (ISCED97 6)	19
no information about level of educational attainment	NA

tbnum <- tb %>%     right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%    filter(!is.na(eduyears)) %>%    mutate(eduyears = factor(eduyears))

## Warning: Column `level of education`/`level.of.education` joining character  ## vector and factor, coercing into character vector

summary_table = vector()  anova_table = vector()    for (i in unique(tbnum$`occuptional  (SSYK 2012)`)){    temp <- filter(tbnum, `occuptional  (SSYK 2012)` == i)    if (dim(temp)[1] > 90){      model <- lm(log(salary) ~ edulevel + sex + year_n + sector, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "none"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "none"))            model <- lm(log(salary) ~ edulevel * sector + sex + year_n, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and edulevel"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and edulevel"))              model <- lm(log(salary) ~ edulevel + sector * sex + year_n, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and sex"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and sex"))              model <- lm(log(salary) ~ edulevel +  year_n * sector + sex, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector and year"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector and year"))              model <- lm(log(salary) ~ edulevel * sector * sex * year_n, data = temp)      summary_table <- rbind (summary_table, mutate (tidy (summary (model)), ssyk = i, interaction = "sector, year, edulevel and sex"))      anova_table <- rbind (anova_table, mutate (tidy (Anova (model, type = 2)), ssyk = i, interaction = "sector, year, edulevel and sex"))          }  }

## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison  ## Note: model has aliased coefficients  ##       sums of squares computed by model comparison

anova_table <- anova_table %>% rowwise() %>% mutate(contcol = str_count(term, ":"))     summary_table <- summary_table %>% rowwise() %>% mutate(contcol = str_count(term, ":"))    merge(summary_table, anova_table, by = c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (term.y == "sector") %>%        filter (interaction == "none") %>%        mutate (estimate = (exp(estimate) - 1) * 100) %>%      ggplot () +      geom_point (mapping = aes(x = estimate, y = statistic.y, colour = interaction)) +      labs(        x = "Increase in salaries (% / year)",        y = "F-value for sector"      )

Figure 1: The significance of the sector on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018

merge(summary_table, anova_table, by = c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%        # only look at the interactions between all four variables in the case with interaction sector, year, edulevel and sex    filter (!(contcol.y < 3 & interaction == "sector, year, edulevel and sex")) %>%         mutate (estimate = (exp(estimate) - 1) * 100) %>%      ggplot () +      geom_point (mapping = aes(x = estimate, y = statistic.y, colour = interaction)) +      labs(        x = "Increase in salaries (% / year)",        y = "F-value for interaction"      )

The significance of the interaction between sector, edulevel, year and sex on the salary in Sweden, a comparison between different occupational groups, Year 2014 - 2018

Figure 2: The significance of the interaction between sector, edulevel, year and sex on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018

The tables with all occupational groups sorted by F-value in descending order.

merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (term.y == "sector") %>%       filter (interaction == "none") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector) and the yearly increase in salaries')

Table 2: Correlation for F-value (sector) and the yearly increase in salaries
ssyk	Increase in salary	F-value	interaction
242 Organisation analysts, policy administrators and human resource specialists	1.609869	515.5748558	none
819 Process control technicians	2.250895	467.4732796	none
251 ICT architects, systems analysts and test managers	2.217446	465.2253589	none
331 Financial and accounting associate professionals	1.964825	340.2901702	none
962 Newspaper distributors, janitors and other service workers	1.971413	336.7676606	none
334 Administrative and specialized secretaries	2.410127	333.7196510	none
351 ICT operations and user support technicians	2.474549	305.6052807	none
241 Accountants, financial analysts and fund managers	2.534461	251.4058600	none
335 Tax and related government associate professionals	2.586686	250.8147705	none
515 Building caretakers and related workers	2.522386	250.0742967	none
321 Medical and pharmaceutical technicians	2.493038	230.0064060	none
213 Biologists, pharmacologists and specialists in agriculture and forestry	2.303600	228.0668837	none
134 Architectural and engineering managers	3.161068	226.5669850	none
333 Business services agents	3.001597	222.9774594	none
411 Office assistants and other secretaries	2.227235	214.6750980	none
243 Marketing and public relations professionals	1.481519	179.3099186	none
264 Authors, journalists and linguists	2.046538	164.8394018	none
129 Administration and service managers not elsewhere classified	4.059900	158.7687951	none
342 Athletes, fitness instructors and recreational workers	1.586943	132.1816646	none
159 Other social services managers	2.541205	69.9268014	none
123 Administration and planning managers	3.849200	50.0164438	none
541 Other surveillance and security workers	2.460130	41.9287342	none
235 Teaching professionals not elsewhere classified	1.415591	40.1919620	none
911 Cleaners and helpers	1.938366	35.3850213	none
533 Health care assistants	2.157379	22.0488822	none
534 Attendants, personal assistants and related workers	1.959595	21.6985964	none
332 Insurance advisers, sales and purchasing agents	2.637486	19.1041252	none
131 Information and communications technology service managers	4.000609	17.0841502	none
311 Physical and engineering science technicians	2.325958	16.5471030	none
214 Engineering professionals	2.626260	15.6152029	none
432 Stores and transport clerks	1.231854	13.9155189	none
723 Machinery mechanics and fitters	2.362984	12.5847785	none
532 Personal care workers in health services	2.906578	6.3336230	none
512 Cooks and cold-buffet managers	2.483280	5.4612144	none
732 Printing trades workers	2.158854	5.2405535	none
234 Primary- and pre-school teachers	2.985653	4.6801860	none
422 Client information clerks	2.527801	1.7881181	none
531 Child care workers and teachers aides	1.881615	0.8334164	none
611 Market gardeners and crop growers	1.980288	0.3732826	none
341 Social work and religious associate professionals	2.357787	0.0276157	none
941 Fast-food workers, food preparation assistants	1.981512	0.0046670	none

merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%       filter (interaction == "sector and sex") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector and sex) and the yearly increase in salaries')

Table 3: Correlation for F-value (sector and sex) and the yearly increase in salaries
ssyk	Increase in salary	F-value	interaction
911 Cleaners and helpers	1.966955	183.3539258	sector and sex
331 Financial and accounting associate professionals	1.988628	84.3723061	sector and sex
342 Athletes, fitness instructors and recreational workers	1.609407	53.2856268	sector and sex
351 ICT operations and user support technicians	2.474549	53.1541368	sector and sex
241 Accountants, financial analysts and fund managers	2.549161	39.5233707	sector and sex
333 Business services agents	2.952973	32.5926396	sector and sex
611 Market gardeners and crop growers	1.936338	25.7248443	sector and sex
243 Marketing and public relations professionals	1.489860	24.3260135	sector and sex
533 Health care assistants	2.157379	19.0714901	sector and sex
732 Printing trades workers	2.207617	16.8065594	sector and sex
512 Cooks and cold-buffet managers	2.497408	15.7408059	sector and sex
159 Other social services managers	2.511558	15.1529257	sector and sex
123 Administration and planning managers	3.819995	14.2436097	sector and sex
532 Personal care workers in health services	2.899482	13.1974037	sector and sex
334 Administrative and specialized secretaries	2.416957	12.8317750	sector and sex
213 Biologists, pharmacologists and specialists in agriculture and forestry	2.317893	10.2339741	sector and sex
134 Architectural and engineering managers	3.161068	9.7816667	sector and sex
321 Medical and pharmaceutical technicians	2.493038	8.6670096	sector and sex
941 Fast-food workers, food preparation assistants	1.989465	8.6486138	sector and sex
335 Tax and related government associate professionals	2.586686	8.4427211	sector and sex
242 Organisation analysts, policy administrators and human resource specialists	1.594999	8.1271264	sector and sex
723 Machinery mechanics and fitters	2.337983	6.4299201	sector and sex
332 Insurance advisers, sales and purchasing agents	2.627736	5.7400150	sector and sex
311 Physical and engineering science technicians	2.316138	5.3473607	sector and sex
234 Primary- and pre-school teachers	2.999161	5.0012138	sector and sex
129 Administration and service managers not elsewhere classified	4.121066	4.6687219	sector and sex
819 Process control technicians	2.250895	3.7631281	sector and sex
534 Attendants, personal assistants and related workers	1.959595	3.5292601	sector and sex
131 Information and communications technology service managers	4.029303	2.5233240	sector and sex
264 Authors, journalists and linguists	2.033203	2.5032667	sector and sex
341 Social work and religious associate professionals	2.357787	2.2988889	sector and sex
251 ICT architects, systems analysts and test managers	2.217446	2.1823519	sector and sex
422 Client information clerks	2.519631	2.1300830	sector and sex
432 Stores and transport clerks	1.231854	1.5040767	sector and sex
541 Other surveillance and security workers	2.457007	1.3945913	sector and sex
214 Engineering professionals	2.629057	1.3225970	sector and sex
962 Newspaper distributors, janitors and other service workers	1.971413	0.3480210	sector and sex
531 Child care workers and teachers aides	1.879581	0.1432868	sector and sex
515 Building caretakers and related workers	2.522844	0.1379141	sector and sex
235 Teaching professionals not elsewhere classified	1.424488	0.0670314	sector and sex
411 Office assistants and other secretaries	2.227235	0.0001653	sector and sex

merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%       filter (interaction == "sector and edulevel") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector and edulevel) and the yearly increase in salaries')

Table 4: Correlation for F-value (sector and edulevel) and the yearly increase in salaries
ssyk	Increase in salary	F-value	interaction
335 Tax and related government associate professionals	2.586686	48.2902588	sector and edulevel
234 Primary- and pre-school teachers	2.936162	32.3560802	sector and edulevel
214 Engineering professionals	2.636843	29.4462376	sector and edulevel
332 Insurance advisers, sales and purchasing agents	2.420632	26.7105578	sector and edulevel
432 Stores and transport clerks	1.231854	24.9194979	sector and edulevel
134 Architectural and engineering managers	3.161068	20.3665858	sector and edulevel
321 Medical and pharmaceutical technicians	2.493038	18.1556092	sector and edulevel
723 Machinery mechanics and fitters	2.358171	17.3272819	sector and edulevel
311 Physical and engineering science technicians	2.330300	12.9686992	sector and edulevel
123 Administration and planning managers	3.809019	11.7436022	sector and edulevel
732 Printing trades workers	2.220728	11.0453452	sector and edulevel
241 Accountants, financial analysts and fund managers	2.647068	10.5100504	sector and edulevel
235 Teaching professionals not elsewhere classified	1.396076	8.6264575	sector and edulevel
213 Biologists, pharmacologists and specialists in agriculture and forestry	2.264110	7.7952190	sector and edulevel
941 Fast-food workers, food preparation assistants	1.981512	7.5766373	sector and edulevel
331 Financial and accounting associate professionals	1.980338	6.9758948	sector and edulevel
532 Personal care workers in health services	2.919933	6.6231457	sector and edulevel
534 Attendants, personal assistants and related workers	1.959595	5.7464596	sector and edulevel
962 Newspaper distributors, janitors and other service workers	1.971413	5.4435379	sector and edulevel
131 Information and communications technology service managers	3.908111	4.9661376	sector and edulevel
159 Other social services managers	2.565523	4.7197836	sector and edulevel
129 Administration and service managers not elsewhere classified	4.160995	4.3062655	sector and edulevel
251 ICT architects, systems analysts and test managers	2.209131	3.5539841	sector and edulevel
512 Cooks and cold-buffet managers	2.435903	3.4242806	sector and edulevel
333 Business services agents	2.955970	3.3635706	sector and edulevel
243 Marketing and public relations professionals	1.473064	3.2639790	sector and edulevel
264 Authors, journalists and linguists	2.043677	3.0449469	sector and edulevel
334 Administrative and specialized secretaries	2.387467	2.8528940	sector and edulevel
422 Client information clerks	2.527801	2.3007696	sector and edulevel
242 Organisation analysts, policy administrators and human resource specialists	1.612900	1.9599904	sector and edulevel
351 ICT operations and user support technicians	2.474549	1.5903279	sector and edulevel
819 Process control technicians	2.250895	1.5586112	sector and edulevel
341 Social work and religious associate professionals	2.357787	1.3263497	sector and edulevel
611 Market gardeners and crop growers	2.016345	1.2719035	sector and edulevel
541 Other surveillance and security workers	2.460130	1.0672924	sector and edulevel
411 Office assistants and other secretaries	2.227235	1.0114046	sector and edulevel
515 Building caretakers and related workers	2.526100	0.8254389	sector and edulevel
342 Athletes, fitness instructors and recreational workers	1.540952	0.8230839	sector and edulevel
531 Child care workers and teachers aides	1.897863	0.7969406	sector and edulevel
533 Health care assistants	2.157379	0.5425248	sector and edulevel
911 Cleaners and helpers	1.938366	0.0965584	sector and edulevel

merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 0) %>%       filter (interaction == "sector and year") %>%    mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector and year) and the yearly increase in salaries')

Table 5: Correlation for F-value (sector and year) and the yearly increase in salaries
ssyk	Increase in salary	F-value	interaction
129 Administration and service managers not elsewhere classified	5.8528187	17.1667457	sector and year
351 ICT operations and user support technicians	3.2455362	16.6101284	sector and year
334 Administrative and specialized secretaries	1.0672775	14.9508269	sector and year
534 Attendants, personal assistants and related workers	2.2769477	12.4036331	sector and year
422 Client information clerks	3.2733483	11.6812523	sector and year
962 Newspaper distributors, janitors and other service workers	2.4907346	11.5673142	sector and year
264 Authors, journalists and linguists	2.9905121	10.5945994	sector and year
531 Child care workers and teachers aides	2.4510286	9.5936100	sector and year
242 Organisation analysts, policy administrators and human resource specialists	2.3355826	7.7816291	sector and year
432 Stores and transport clerks	0.5332567	7.4196762	sector and year
243 Marketing and public relations professionals	2.1828057	5.5660180	sector and year
732 Printing trades workers	2.8825380	4.3716553	sector and year
213 Biologists, pharmacologists and specialists in agriculture and forestry	2.8099783	3.6786224	sector and year
611 Market gardeners and crop growers	2.3161238	3.3017939	sector and year
131 Information and communications technology service managers	3.3504930	2.8750373	sector and year
532 Personal care workers in health services	2.7411030	2.8290655	sector and year
235 Teaching professionals not elsewhere classified	1.8047571	2.7841902	sector and year
311 Physical and engineering science technicians	2.8608741	2.6173373	sector and year
533 Health care assistants	1.9920113	2.2486864	sector and year
214 Engineering professionals	3.0209963	2.0446973	sector and year
723 Machinery mechanics and fitters	2.6258864	1.4272778	sector and year
515 Building caretakers and related workers	2.3423357	1.3492650	sector and year
321 Medical and pharmaceutical technicians	2.1376334	1.2494762	sector and year
941 Fast-food workers, food preparation assistants	2.1906312	1.1458297	sector and year
234 Primary- and pre-school teachers	3.1522849	0.8075167	sector and year
411 Office assistants and other secretaries	2.4217447	0.7799156	sector and year
241 Accountants, financial analysts and fund managers	2.2625663	0.7751592	sector and year
541 Other surveillance and security workers	2.3018747	0.6646888	sector and year
134 Architectural and engineering managers	3.3811821	0.5852232	sector and year
333 Business services agents	3.2520012	0.5424390	sector and year
335 Tax and related government associate professionals	2.3986233	0.3139375	sector and year
123 Administration and planning managers	4.0945884	0.2682143	sector and year
911 Cleaners and helpers	2.0082873	0.2021948	sector and year
341 Social work and religious associate professionals	2.3994292	0.0754320	sector and year
342 Athletes, fitness instructors and recreational workers	1.6673896	0.0633281	sector and year
512 Cooks and cold-buffet managers	2.4249631	0.0494622	sector and year
819 Process control technicians	2.2225873	0.0378303	sector and year
331 Financial and accounting associate professionals	2.0557505	0.0377336	sector and year
251 ICT architects, systems analysts and test managers	2.2536138	0.0294035	sector and year
332 Insurance advisers, sales and purchasing agents	2.6015664	0.0125275	sector and year
159 Other social services managers	2.5487758	0.0013436	sector and year

merge(summary_table, anova_table, c("ssyk", "interaction"), all = TRUE) %>%    filter (term.x == "year_n") %>%    filter (contcol.y > 1) %>%       filter (interaction == "sector, year, edulevel and sex") %>%    filter (!(contcol.y < 3 & interaction == "sector, year, edulevel and sex")) %>%      mutate (estimate = (exp(estimate) - 1) * 100) %>%      select (ssyk, estimate, statistic.y, interaction) %>%    rename (`F-value` = statistic.y) %>%    rename (`Increase in salary` = estimate) %>%    arrange (desc (`F-value`)) %>%    knitr::kable(      booktabs = TRUE,      caption = 'Correlation for F-value (sector, year, edulevel and sex) and the yearly increase in salaries')

Table 6: Correlation for F-value (sector, year, edulevel and sex) and the yearly increase in salaries
ssyk	Increase in salary	F-value	interaction
264 Authors, journalists and linguists	2.0717871	5.0646298	sector, year, edulevel and sex
311 Physical and engineering science technicians	-1.9358596	3.3593399	sector, year, edulevel and sex
159 Other social services managers	2.1339049	2.5206357	sector, year, edulevel and sex
134 Architectural and engineering managers	-0.2221716	2.4150753	sector, year, edulevel and sex
331 Financial and accounting associate professionals	0.2989757	2.2892768	sector, year, edulevel and sex
342 Athletes, fitness instructors and recreational workers	2.7799167	2.1976399	sector, year, edulevel and sex
214 Engineering professionals	5.2732904	1.9988132	sector, year, edulevel and sex
723 Machinery mechanics and fitters	2.1334983	1.8843643	sector, year, edulevel and sex
432 Stores and transport clerks	-0.2883737	1.8215339	sector, year, edulevel and sex
241 Accountants, financial analysts and fund managers	2.6268377	1.8184489	sector, year, edulevel and sex
533 Health care assistants	1.3176280	1.6357939	sector, year, edulevel and sex
911 Cleaners and helpers	1.1875630	1.6319874	sector, year, edulevel and sex
129 Administration and service managers not elsewhere classified	16.0932403	1.5716145	sector, year, edulevel and sex
512 Cooks and cold-buffet managers	2.2335677	1.4882138	sector, year, edulevel and sex
242 Organisation analysts, policy administrators and human resource specialists	1.9377802	1.4494844	sector, year, edulevel and sex
234 Primary- and pre-school teachers	2.6286211	1.4408203	sector, year, edulevel and sex
235 Teaching professionals not elsewhere classified	1.4806649	1.4162230	sector, year, edulevel and sex
532 Personal care workers in health services	3.0809717	1.3439893	sector, year, edulevel and sex
962 Newspaper distributors, janitors and other service workers	1.7217534	1.3344044	sector, year, edulevel and sex
332 Insurance advisers, sales and purchasing agents	4.5515134	0.9604905	sector, year, edulevel and sex
123 Administration and planning managers	1.6860436	0.9155189	sector, year, edulevel and sex
541 Other surveillance and security workers	3.1372549	0.9113280	sector, year, edulevel and sex
213 Biologists, pharmacologists and specialists in agriculture and forestry	2.0956711	0.9044780	sector, year, edulevel and sex
422 Client information clerks	7.6231100	0.8611877	sector, year, edulevel and sex
515 Building caretakers and related workers	3.8284291	0.8343329	sector, year, edulevel and sex
334 Administrative and specialized secretaries	-1.8021942	0.7940149	sector, year, edulevel and sex
531 Child care workers and teachers aides	2.9556278	0.7580341	sector, year, edulevel and sex
131 Information and communications technology service managers	6.2536436	0.7429188	sector, year, edulevel and sex
243 Marketing and public relations professionals	-0.7939038	0.7338409	sector, year, edulevel and sex
351 ICT operations and user support technicians	3.2312302	0.7188959	sector, year, edulevel and sex
732 Printing trades workers	3.9993823	0.6905319	sector, year, edulevel and sex
321 Medical and pharmaceutical technicians	2.2164674	0.6326293	sector, year, edulevel and sex
611 Market gardeners and crop growers	2.2753348	0.6157880	sector, year, edulevel and sex
341 Social work and religious associate professionals	3.0666620	0.5266982	sector, year, edulevel and sex
941 Fast-food workers, food preparation assistants	2.7940568	0.4253642	sector, year, edulevel and sex
335 Tax and related government associate professionals	5.0470456	0.3217115	sector, year, edulevel and sex
251 ICT architects, systems analysts and test managers	2.9031256	0.2684425	sector, year, edulevel and sex
819 Process control technicians	1.6741717	0.2603997	sector, year, edulevel and sex
333 Business services agents	1.8143344	0.1540308	sector, year, edulevel and sex
411 Office assistants and other secretaries	5.5231019	0.0911334	sector, year, edulevel and sex
534 Attendants, personal assistants and related workers	2.6927719	0.0899580	sector, year, edulevel and sex

Let's check what we have found.

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "242 Organisation analysts, policy administrators and human resource specialists")     model <- lm (log(salary) ~ year_n + eduyears + sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "sector"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 3: Highest F-value sector, Organisation analysts, policy administrators and human resource specialists

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "941 Fast-food workers, food preparation assistants")     model <-lm (log(salary) ~ year_n + eduyears + sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "sector"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 4: Lowest F-value sector, Fast-food workers, food preparation assistants

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "911 Cleaners and helpers")     model <- lm (log(salary) ~ year_n + eduyears + sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "sex", "sector"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 5: Highest F-value interaction sector and gender, Cleaners and helpers

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "411 Office assistants and other secretaries")     model <- lm (log(salary) ~ year_n + eduyears + sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "sex", "sector"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 6: Lowest F-value interaction sector and gender, Office assistants and other secretaries

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "335 Tax and related government associate professionals")     model <- lm (log(salary) ~ year_n + eduyears * sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "sector"))

## Warning in predict.lm(model, newdata = fitfram, type = "response", se.fit =  ## se, : prediction from a rank-deficient fit may be misleading

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 7: Highest F-value interaction sector and edulevel, Tax and related government associate professionals

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "911 Cleaners and helpers")     model <- lm (log(salary) ~ year_n + eduyears * sector + sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "sector"))

## Warning in predict.lm(model, newdata = fitfram, type = "response", se.fit =  ## se, : prediction from a rank-deficient fit may be misleading

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 8: Lowest F-value interaction sector and edulevel, Cleaners and helpers

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "129 Administration and service managers not elsewhere classified")     model <- lm (log(salary) ~ year_n * sector + eduyears + sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "year_n", "sector"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 9: Highest F-value interaction sector and year, Administration and service managers not elsewhere classified

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "159 Other social services managers")     model <- lm (log(salary) ~ year_n * sector + eduyears + sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "year_n", "sector"))

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 10: Lowest F-value interaction sector and year, Other social services managers

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "264 Authors, journalists and linguists")     model <- lm (log(salary) ~ year_n * eduyears * sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "year_n", "sex", "sector"))

## Warning in predict.lm(model, newdata = fitfram, type = "response", se.fit =  ## se, : prediction from a rank-deficient fit may be misleading

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 11: Highest F-value interaction sector, edulevel, year and gender, Authors, journalists and linguists

## TableGrob (2 x 1) "arrange": 2 grobs  ##   z     cells    name           grob  ## 1 1 (1-1,1-1) arrange gtable[layout]  ## 2 2 (2-2,1-1) arrange gtable[layout]

temp <- tbnum %>%    filter(`occuptional  (SSYK 2012)` == "534 Attendants, personal assistants and related workers")     model <- lm (log(salary) ~ year_n * eduyears * sector * sex, data = temp)      plot_model(model, type = "pred", terms = c("eduyears", "year_n", "sex", "sector"))

## Warning in predict.lm(model, newdata = fitfram, type = "response", se.fit =  ## se, : prediction from a rank-deficient fit may be misleading

## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Figure 12: Lowest F-value interaction sector, edulevel, year and gender, Attendants, personal assistants and related workers

## TableGrob (2 x 1) "arrange": 2 grobs  ##   z     cells    name           grob  ## 1 1 (1-1,1-1) arrange gtable[layout]  ## 2 2 (2-2,1-1) arrange gtable[layout]

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

How to Acquire Large Satellite Image Datasets for Machine Learning Projects

Posted: 28 Feb 2020 07:39 AM PST

[This article was first published on r – Appsilon Data Science | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Historically, only governments and large corporations have had access to quality satellite images. In recent years, satellite image datasets have become available to anyone with a computer and an internet connection. The quality, quantity, and precision of these datasets is continuously improving, and there are many free and commercial platforms at your disposal to acquire satellite images. On top of that, the prices of acquiring the images have fallen significantly, as well as the prices and availability of the tools that will allow you to analyze the images for machine learning and data science projects.

In this article, I hope to inspire you to start looking into the power and utility of publicly available satellite image datasets available today. I will show you a high level overview of where these images come from, then I will dive deeper into the details about which features you should think about when choosing the right data source. In a future article, I will give you an overview of the architecture that you need to have in place before you can start working with them on your local computer.

Let's jump right into satellites. How is this kind of dataset unique? Why should you bother with satellite images?

The difference between 30cm & 70cm satellite imagery. [Source: Maxar]

Satellite Image Data at Your Fingertips

First of all, you can get complete coverage of the Earth, which means that you can select virtually any location on the planet, and you will be able to see what that place looks like. Further, the images are readily available. You can go to a website and easily download an image for any location that you want because there are public space programs that offer free images to whoever wants them. So when you start your research, I strongly encourage you to take a look at the available free resources first. We've included a list of those resources at the bottom of this article. There are plenty of commercial options available that provide higher quality images for specialized purposes. You can reach out to Appsilon directly for assistance with acquiring commercial satellite datasets.

One way to think about satellite image datasets is that they give you the ability to travel backwards in time. When you think of satellite images you might think about Google Maps, which provides you with satellite images that give you a snapshot of the surface of the Earth. But with access to the right provider, you can go back in time and access images for any day that you want, going back years – in some cases back to the 1980's. This added temporal dimension gives you additional abilities when it comes to analyzing data. Imagine that you can take a look at one point on Earth, and then go back in time and see how this place has changed. You can then build predictive models to forecast what this place is going to look like in the future.

A representation of the 4500 satellites currently orbiting the Earth

This visualization shows the scale of what is going on above our heads, outside the stratosphere. Right now there are more than 4,500 satellites orbiting our planet, and over 600 of them are constantly taking photos. There are more and more preparing to launch, especially since this area of technology has been accelerating very rapidly in recent years. This means that the quantity and quality of satellite image datasets is rapidly improving.

Currently, the best resolution that you can get from a satellite image is 25cm per pixel. This means that if you zoom in very closely on a quality satellite image, one pixel is going to represent approximately 25 cm of the Earth's surface. If a satellite image shows a person, then that person will be represented by approximately three pixels. Three pixels is not much to go on, but if you combine this rough representation of a person with their shadow, then you can confirm that those three pixels are indeed a person.

How to Acquire Satellite Datasets

Now I would like to jump into more specific aspects of satellite imagery — what kind of dataset it is and how you can acquire these datasets.

There are two types of available satellite data. There are public datasets that are freely available with quality that is good enough for many use cases. And there are several commercial outfits that offer even better images with more potential uses.

The best known public datasets are provided by Landsat and Sentinel. You can Google those companies right now and find the right image for you. One image is going to be about 1GB of data. It's not immediately obvious how you can work with these images, but later on I'll explain how to do it easily. You can also feel free to reach out to us for more information on working with larger commercial datasets.

There are plenty of commercial companies acquiring satellite images. Commercial datasets are primarily provided by Maxar, Planet Labs, Airbus Defence & Space, Imagesat and Skywatch. A year ago, Planet Labs launched 150 satellites each the size of a shoe box. So right now there is a huge constellation of small satellites capturing images. Currently you can get a new image every two days. Another interesting company to watch is SkyWatch. SkyWatch is a hub for satellite images. They gather images from all of the other providers – they don't have their own satellites. SkyWatch is a good place to find decent prices for commercial satellite images.

I am often asked about image prices. The prices range from a few dollars for a single image to ~$1000 for the highest possible quality image. So if you want to identify people in a lot of images or you need a consistent and precise historical record for research, it is going to be quite expensive for you at the moment. However, given how fast the technology is progressing the prices should decline in the future. In a sense we are at a moment where there is a wave of new satellite image technology coming. The wave hasn't reached its peak yet. If you start researching right now, you will be on top of the wave when satellite images are cheap and available. Now is the perfect time to start playing with satellite image datasets.

Satellite Images: Spatial and Temporal Resolution

When selecting datasets, the first consideration is image resolution. The bigger the resolution, the more details we're able to see. But there are some tradeoffs which we'll discuss soon. In this plot you can see how spatial resolution has changed over the years. We started with 100m in 1970, and now we're down to 25-30 centimeters.

Spatial and Temporal Resolution

Spatial resolution is not the only resolution we need to consider when designing solutions based on satellite imagery. Equally important, and sometimes even more important, is the temporal resolution. How often do we get a picture of a given area? What is the revisit time?

Landsat, one of the publicly available satellite image datasets, gives you 30 meters resolution and you get one picture every 14 days. Sentinel gives you 10m resolution every 5 to 7 days. So if you want to invest in your project, you have the option for much better resolution and frequency of images.

Satellite Images: Layers of Information

The way that sensors work in satellites is a really exciting topic. When you think about a satellite image, it's more than just taking a picture with a normal camera. Humans are able to decode red, green, and blue. But a satellite can decode much more electromagnetic information than that. Some satellites have 12 sensors, which means that you get an image that has 12 layers of information.

Some satellites have 12 sensors, which means that you get an image that has 12 layers of information.

Satellite images can contain more layers of information than typical images

For us, an image is just a matrix that has values for red, green, and blue, but from the satellite you get many more values that the human eye is not able to process. For example, with a satellite image you can have an infrared channel, which can be used to detect the health of vegetation. So this is completely new data that one would not be able to detect with the naked eye. The infrared channel reflects differently from the chlorophyll in the plants, allowing for detection of sick plants from space. There is plenty more you can do with these extra layers of information. For instance, you can also detect moisture levels on the surface of the planet, which cannot be done very easily with standard visual color information.

There is also radar technology. Many of you know LIDAR, which tells you the height of a given surface. It is important to note that clouds are a huge problem when it comes to satellite images. There are plenty of people working on algorithms to eliminate the cloud problem in satellite imaging, but there is no ideal solution yet. Radar technology allows you to look through clouds, but you don't get all of the other layers of information that I mentioned above. You only really get quality information about elevation.

On the right we have a visible band image of a certain area in Sumatra, Indonesia. As you can see our view is obstructed by clouds. Now, on the left we see the same area in a radar image. We now have more and more radar images available, which is useful because radar can see through clouds. This can be crucial in many cases.

LIDAR mitigates cloud cover in a satellite image

One thing to keep in mind when you choose a data source — depending on your project, you may not necessarily need the best possible resolution. You might want to experiment with the free images at a resolution of 10m or 30m. You may also investigate what the image provider actually gives you. Some platforms do some of the pre-processing work for you. For example, I mentioned that one image can include 1GB of data. You can actually ask some providers to cut the image into one particular shapefile and in return you'll get just 1MB of data — a small image that consists only of the area that you wanted. This can be very helpful if you're working with a large number of images.

Conclusion

I hope you now have a sense of the many current and developing options for satellite imagery. In the past, such datasets were accessible to only a select few. Now there are many free and commercial platforms at your disposal. You can leverage temporal resolution, spatial resolution, and a dozen bands of the electromagnetic spectrum to aid in your projects. On top of that, the prices of acquiring satellite images have fallen significantly, as well as the prices and availability of the tools that will allow you to analyze these images. For my next article on satellites, we will further explore how to use satellite images in practice, and I will explain why R is an excellent tool for analyzing satellite images. I will share our experiences of trial and error with satellite images to save you time and effort.

References

Public Data Sets

Landsat

Sentinel

Commercial Data Sets

Maxar

Planetlabs

Airbus Defence & Space

Imagesat

Skywatch

Thanks for reading! For more, follow me on Linkedin.

Follow Appsilon Data Science on Social Media

Follow @Appsilon on Twitter
Follow Appsilon on LinkedIn
Sign up for our company newsletter
Try out our R Shiny open source packages
Sign up for the AI for Good newsletter
We are hiring software engineers

Article How to Acquire Large Satellite Image Datasets for Machine Learning Projects comes from Appsilon Data Science | End to End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End to End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

Version 0.4.0 of nnetsauce, with fruits and breast cancer classification

Posted: 27 Feb 2020 04:00 PM PST

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

English version / Version en français

English version

A new version of nnetsauce, version 0.4.0, is now available on Pypi and for R. As usual, you can install it on Python by using the following commands (command line):

pip install nnetsauce

And if you're using R, it's still (R console):

library(devtools)  devtools::install_github("thierrymoudiki/nnetsauce/R-package")  library(nnetsauce)

The R version may be slightly lagging behind on some features; feel free to signal it on GitHub or contact me directly. This new release is accompanied by a few goodies:

1) New features, detailed in the changelog.

2) A refreshed web page containing all the information about package installation, use, interface's work-in-progress documentation, and contribution to package development.

3) A specific RSS feed related to nnetsauce on this blog (there's still a general feed containing everything here).

4) A working paper related to Bayesianrvfl2Regressor, Ridge2Regressor, Ridge2Classifier, and Ridge2MultitaskClassifier : Quasi-randomized networks for regression and classification, with two shrinkage parameters. About Ridge2Classifier specifically, you can also consult this other paper.

Among nnetsauce's new features, there's a new model class called MultitaskClassifier, briefly described in the first paper from point 4). It's a multitask classification model based on regression models, with shared covariates. What does that mean? We use the figure below to start the explanation:

image-title-here

Imagine that we have 4 fruits at our disposal, and we would like to classify them as avocados (is an avocado a fruit?), apples or tomatoes, by looking at their color and shapes. What we called covariates before in model description are color and shape, also known as explanatory variables or predictors. The column containing fruit names in the figure – on the left – is a so-called response; a variable that MultitaskClassifier must learn to classify (which is typically much larger). This raw response is transformed into a one-hot encoded one – on the right.

Instead of one response vector, we now have three different responses. And instead of one classification problem on one response, three different two-class classification problems on three responses: is this fruit an apple or not? Is this fruit a tomato or not? Is this fruit an avocado or not? All these three problems share the same covariates: color and shape.

MultitaskClassifier can use any regressor (meaning, a statistical learning model for continuous responses) to solve these three problems; with the same regressor being used for all three of them – which is a priori a relatively strong hypothesis. Regressor's predictions on each response are interpreted as raw probabilities that the fruit is either one of them or not.

We now use MultitaskClassifier on breast cancer data, as we did in this post for AdaBoostClassifier, to illustrate how it works. The R version for this code would be almost identical, replacing "."'s by "$"'s.

Import packages:

import nnetsauce as ns  import numpy as np  from sklearn.datasets import load_breast_cancer  from sklearn.linear_model import LinearRegression  from sklearn.model_selection import train_test_split  from sklearn import metrics  from time import time

Model fitting on training set:

breast_cancer = load_breast_cancer()  Z = breast_cancer.data  t = breast_cancer.target    # Training/testing datasets (using reproducibility seed)  np.random.seed(123)  X_train, X_test, y_train, y_test = train_test_split(Z, t, test_size=0.2)    # Linear Regression is used (can be anything, but must be a regression model)  regr = LinearRegression()  fit_obj = ns.MultitaskClassifier(regr, n_hidden_features=5,                                n_clusters=2, type_clust="gmm")    # Model fitting on training set   start = time()  fit_obj.fit(X_train, y_train)  print(time() - start)    # Model accuracy on test set  print(fit_obj.score(X_test, y_test))    # Area under the curve  print(fit_obj.score(X_test, y_test, scoring="roc_auc"))

These results can be found in nnetsauce/demo/. MultitaskClassifier's accuracy on this dataset is 99.1%, and other indicators such as precision are equal to 99% on this dataset too. Let's visualize the missclassification results, as we did in a previous post, on the same dataset.

image-title-here

In this case with MultitaskClassifier, and no advanced hyperparameter tweaking, there is one patient out of 114 who is missclassified. A robust way to understand MultitaskClassifier's accuracy on this dataset using these same parameters, could be to repeat the same procedure for multiple random reproducibility seeds (see code, the training and testing sets randomly change when we change the seed, and we change MultitaskClassifier's seed too).

We obtain the results below for 100 reproducibility seeds. The accuracy is always at least 90%, mostly 95% and quite sometimes, higher than 98% (with no advanced hyperparameter tweaking).

image-title-here

French version

Une nouvelle version de nnetsauce, la version 0.4.0, est maintenant disponible sur Pypi et pour R. Comme d'habitude, vous pouvez l'installer sur Python en utilisant les commandes suivantes (ligne de commande):

pip install nnetsauce

Et si vous utilisez R, c'est toujours (console R):

library(devtools)  devtools::install_github("thierrymoudiki/nnetsauce/R-package")  library(nnetsauce)

La version R peut être légèrement en retard sur certaines fonctionnalités; n'hésitez pas à le signaler sur GitHub ou à me contacter directement en cas de problème. Cette nouvelle version s'accompagne de quelques goodies:

1) Nouvelles fonctionnalités, détaillées dans le fichier changelog.

2) Une page Web actualisée contenant toutes les informations sur l'installation, l'utilisation, la documentation de travail en cours, et la contribution au développement de l'outil.

3) Un flux RSS spécifique lié à nnetsauce sur ce blog (il existe toujours un flux général contenant tous les autres sujets, ici).

4) Un document de travail relatif à Bayesianrvfl2Regressor, Ridge2Regressor, Ridge2Classifier et Ridge2MultitaskClassifier: Quasi-randomized networks for regression and classification, with two shrinkage parameters. À propos de Ridge2Classifier en particulier, vous pouvez également consulter cet autre article.

Parmi les nouvelles fonctionnalités, il existe une nouvelle classe de modèle appelée MultitaskClassifier, brièvement décrite dans le premier article du point 4). Il s'agit d'un modèle de classification multitâche basé sur des modèles de régression, avec des covariables partagées. Qu'est-ce que cela signifie? Nous utilisons la figure ci-dessous pour commencer l'explication:

image-title-here

Imaginez que nous ayons 4 fruits à notre disposition, et nous aimerions les classer comme avocats (un avocat est-il un fruit?), pommes ou tomates, en regardant leur couleur et leur forme. Ce que nous appelions auparavant les covariables dans la description du modèle sont la couleur et la forme, également appelées variables explicatives ou prédicteurs. La colonne contenant les noms des fruits sur la figure – à gauche – est la réponse; la variable que MultitaskClassifier doit apprendre à classer (qui contient généralement beaucoup plus d'observations). Cette réponse brute est transformée en une réponse codée – à droite.

Au lieu d'une réponse, nous avons maintenant trois réponses différentes. Et au lieu d'un problème de classification sur une réponse, trois problèmes de classification à deux classes différents, sur trois réponses: ce fruit est-il ou non une pomme? Ce fruit est-il ou non une tomate? Ce fruit est-il ou non un avocat? Ces trois problèmes partagent les mêmes covariables: la couleur et la forme.

MultitaskClassifier peut utiliser n'importe quel modèle de régression (c'est-à-dire un modèle d'apprentissage statistique pour des réponses continues) pour résoudre ces trois problèmes simultanément; avec le même modèle de régression utilisé pour les trois – ce qui est a priori une hypothèse relativement forte. Les prédictions du modèle de régression sur chaque réponse sont alors interprétées comme des probabilités brutes que le fruit soit ou non, dans l'une ou l'autre des classes.

Nous utilisons maintenant MultitaskClassifier sur des données de cancer du sein, comme nous l'avions fait dans cet article pour AdaBoostClassifier. La version R de ce code serait presque identique, il s'agirait essentiellement de remplacer les «.» 'S par des « $ »' s.

Import des packages:

import nnetsauce as ns  import numpy as np  from sklearn.datasets import load_breast_cancer  from sklearn.linear_model import LinearRegression  from sklearn.model_selection import train_test_split  from sklearn import metrics  from time import time

Ajustement du modèle sur l'ensemble d'entraînement:

breast_cancer = load_breast_cancer()  Z = breast_cancer.data  t = breast_cancer.target    # Training/testing datasets (using reproducibility seed)  np.random.seed(123)  X_train, X_test, y_train, y_test = train_test_split(Z, t, test_size=0.2)    # Linear Regression is used (can be anything, but must be a regression model)  regr = LinearRegression()  fit_obj = ns.MultitaskClassifier(regr, n_hidden_features=5,                                n_clusters=2, type_clust="gmm")    # Model fitting on training set   start = time()  fit_obj.fit(X_train, y_train)  print(time() - start)    # Model accuracy on test set  print(fit_obj.score(X_test, y_test))    # Area under the curve  print(fit_obj.score(X_test, y_test, scoring="roc_auc"))

Les résultats de ce traitement se trouvent dans ce notebook. La précision du MultitaskClassifier sur cet ensemble de données est de 99,1%, et d'autres indicateurs sont également de l'ordre de 99% sur ces données. Visualisons maintenant comment les observations sont bien ou mal classées en fonction de leur classe réelle, comme nous l'avions fait dans un post précédent, sur le même ensemble de données.

image-title-here

Dans ce cas, avec MultitaskClassifier et aucun ajustement avancé des hyperparamètres, il n'y a qu'un patient sur 114 qui est mal classé. Un moyen robuste de comprendre la précision de MultitaskClassifier sur cet ensemble de données en utilisant ces mêmes paramètres, pourrait être de répéter la même procédure pour plusieurs graines de reproductibilité aléatoire (voir le code, les ensembles d'apprentissage et de test changent de manière aléatoire lorsque nous changeons la graine seed, et nous changeons aussi la graine de MultitaskClassifier ).

Nous obtenons les résultats ci-dessous pour 100 graines de reproductibilité. La précision est toujours d'au moins 90%, la plupart du temps égale à 95% et assez souvent, supérieure à 98% (sans ajustement avancé des hyperparamètres).

image-title-here

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs' interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posting includes an audio/video/photo media file: Download Now

THE BENEFIT

Search This Blog