[R-bloggers] Three ways of visualizing a graph on a map (and 6 more aRticles) |
- Three ways of visualizing a graph on a map
- Defining Marketing with the Rvest and Tidytext Packages
- Harry Potter and rankings with comperank
- Algorithmic Trading: Using Quantopian’s Zipline Python Library In R And Backtest Optimizations By Grid Search And Parallel Processing
- OS Secrets Exposed: Extracting Extended File Attributes and Exploring Hidden Download URLs With The xattrs Package
- New Version of ggplot2
- Does financial support in Australia favour residents born elsewhere? Responding to racism with data
Three ways of visualizing a graph on a map Posted: 31 May 2018 05:32 AM PDT (This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers) When visualizing a network with nodes that refer to a geographic place, it is often useful to put these nodes on a map and draw the connections (edges) between them. By this, we can directly see the geographic distribution of nodes and their connections in our network. This is different to a traditional network plot, where the placement of the nodes depends on the layout algorithm that is used (which may for example form clusters of strongly interconnected nodes). In this blog post, I'll present three ways of visualizing network graphs on a map using R with the packages igraph, ggplot2 and optionally ggraph. Several properties of our graph should be visualized along with the positions on the map and the connections between them. Specifically, the size of a node on the map should reflect its degree, the width of an edge between two nodes should represent the weight (strength) of this connection (since we can't use proximity to illustrate the strength of a connection when we place the nodes on a map), and the color of an edge should illustrate the type of connection (some categorical variable, e.g. a type of treaty between two international partners).
PreparationWe'll need to load the following libraries first: Now, let's load some example nodes. I've picked some random countries with their geo-coordinates: So we now have 15 countries, each with an ID, geo-coordinates ( Each of these edges defines a connection via the node IDs in the Our nodes and edges fully describe a graph so we can now generate a graph structure We now create some data structures that will be needed for all the plots that we will generate. At first, we create a data frame for plotting the edges. This data frame will be the same like the Let's give each node a weight and use the degree metric for this. This will be reflected by the node sizes on the map later. Now we define a common ggplot2 theme that is suitable for displaying maps (sans axes and grids): Not only the theme will be the same for all plots, but they will also share the same world map as "background" (using Plot 1: Pure ggplot2Let's start simple by using ggplot2. We'll need three geometric objects (geoms) additional to the country polygons from the world map ( A warning will be displayed in the console saying "Scale for 'size' is already present. Adding another scale for 'size', which will replace the existing scale.". This is because we used the "size" aesthetic and its scale twice, once for the node size and once for the line width of the curves. Unfortunately you cannot use two different scales for the same aesthetic even when they're used for different geoms (here: "size" for both node size and the edges' line widths). There is also no alternative to "size" I know of for controlling a line's width in ggplot2. With ggplot2, we're left of with deciding which geom's size we want to scale. Here, I go for a static node size and a dynamic line width for the edges: Plot 2: ggplot2 + ggraphLuckily, there is an extension to ggplot2 called ggraph with geoms and aesthetics added specifically for plotting network graphs. This allows us to use separate scales for the nodes and edges. By default, ggraph will place the nodes according to a layout algorithm that you can specify. However, we can also define our own custom layout using the geo-coordinates as node positions: We pass the layout The edges' widths can be controlled with the Note that the plot's edges are differently drawn than with the ggplot2 graphics before. The connections are still the same only the placement is different due to different layout algorithms that are used by ggraph. For example, the turquoise edge line between Canada and Japan has moved from the very north to south across the center of Africa. Plot 3: the hacky way (overlay several ggplot2 "plot grobs")I do not want to withhold another option which may be considered a dirty hack: You can overlay several separately created plots (with transparent background) by annotating them as "grobs" (short for "graphical objects"). This is probably not how grob annotations should be used, but anyway it can come in handy when you really need to overcome the aesthetics limitation of ggplot2 described above in plot 1. As explained, we will produce separate plots and "stack" them. The first plot will be the "background" which displays the world map as before. The second plot will be an overlay that only displays the edges. Finally, a third overlay shows only the points for the nodes and their labels. With this setup, we can control the edges' line widths and the nodes' point sizes separately because they are generated in separate plots. The two overlays need to have a transparent background so we define it with a theme: The base or "background" plot is easy to make and only shows the map: Now we create the first overlay with the edges whose line width is scaled according to the edges' weights: The second overlay shows the node points and their labels: Finally we combine the overlays using grob annotations. Note that proper positioning of the grobs can be tedious. I found that using As explained before, this is a hacky solution and should be used with care. Still it is useful also in other circumstances. For example when you need to use different scales for point sizes and line widths in line graphs or need to use different color scales in a single plot this way might be an option to consider. All in all, network graphs displayed on maps can be useful to show connections between the nodes in your graph on a geographic scale. A downside is that it can look quite cluttered when you have many geographically close points and many overlapping connections. It can be useful then to show only certain details of a map or add some jitter to the edges' anchor points. The full R script is available as gist on github.
To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
Defining Marketing with the Rvest and Tidytext Packages Posted: 30 May 2018 05:00 PM PDT (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers) I am preparing to facilitate another session of the marketing course for the La Trobe University MBA. The first lecture delves into the definition of marketing. Like most other social phenomena, marketing is tough to define. Definitions of social constructs often rely on the perspective taken by the person or group writing the definition. As such, definitions also change over time. While a few decades ago, definitions of marketing revolved around sales and advertising, contemporary definitions are more holistic and reference creating value. Heidi Cohen wrote a blog post where she collated 72 definitions of marketing. So rather than arguing over which definition is the best, why not use all definitions simultaneously? This article attempts to define a new definition of marketing, using a data science approach. We can use the R language to scrape the 72 definitions from Heidi's website and attempt text analysis to extract the essence of marketing from this data set. I have mentioned in a previous post about qualitative data science that automated text analysis is not always a useful method to extract meaning from a text. I decided to delve a little deeper into automated text analysis to see if we find out anything useful about marketing using the rvest and tidytext packages. The presentation below shows the slides I use in my introductory lecture into marketing. The code and analyses are shown below the slideshow. You can download the most recent version of the code from my GitHub Repository.
Scraping text with RvestWeb scraping is a common technique to download data from websites where this data is not available as a clean data source. Web scraping starts with downloading the HTML code from the website and the filtering the wanted text from this file. The rvest package makes this process very easy. The code for this article uses a pipe ( The result of the scraping process is converted to a Tibble, which is a type of data frame used in the Tidyverse. The definition number is added to the data, and the Tibble is converted to the format required by the Tidytext package. The resulting data frame is much longer than the 72 definitions because there are other lists on the page. Unfortunately, I could not find a way to filter only the 72 definitions. library(tidyverse) library(rvest) definitions <- read_html("https://heidicohen.com/marketing-definition/") %>% html_nodes("ol li") %>% html_text() %>% as_data_frame() %>% mutate(No = 1:nrow(.)) %>% select(No, Definition = value) Tidying the TextThe Tidytext package extends the tidy data philosophy to a text. In this approach to text analysis, a corpus consists of a data frame where each word is a separate item. The code snippet below takes the first 72 rows and the The last section of the pipe removes the trailing "s" from each word to convert plurals into single words. The mutate function in the Tidyverse creates or recreates a new variable in a data frame. library(tidytext) def_words <- definitions[1:72, ] %>% unnest_tokens(word, Definition) %>% mutate(word = gsub("s$", "", word)) This section creates a data frame with two variables. The No variable indicates the definition number (1–72) and the word variable is a word within the definition. The order of the words is preserved in the row name. To check the data frame you can run Using Rvest and Tidytext to define marketingWe can now proceed to analyse the definitions scraped from the website with Rvest and cleaned with Tidytext. The first step is to create a word cloud, which is a popular way to visualise word frequencies. This code creates a data frame for each unique word, excluding the word marketing itself, and uses the wordcloud package to visualise the fifty most common words. library(wordcloud) library(RColorBrewer) word_freq <- def_words %>% anti_join(stop_words) %>% count(word) %>% filter(word != "marketing") word_freq %>% with(wordcloud(word, n, max.words = 50, rot.per = .5, colors = rev(brewer.pal(5, "Dark2")))) While a word cloud is certainly a pretty way to visualise the bag of words in a text, it is not the most useful way to get the reader to understand the data. The words are jumbled, and the reader needs to search for meaning. A better way to visualise word frequencies is a bar chart. This code takes the data frame created in the previous snippet, determines the top ten occurring words. The mutate statement reorders to factor levels so that the words are plotted in order. word_freq %>% top_n(10) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill = "dodgerblue4") + coord_flip() + theme(text = element_text(size=20)) A first look at the word cloud and bar chart suggests that marketing is about customers and products and services. Marketing is a process that includes branding and communication; a simplistic but functional definition. Topic Modeling using TidytextWord frequencies are a weak method to analyse text because it interprets each word as a solitary unit. Topic modelling is a more advanced method that analyses the relationships between words, i.e. the distance between them. The first step is to create a Document-Term Matrix, which is a matrix that indicates how often a word appears in a text. As each of the 72 texts are very short, I decided to treat the collection of definitions as one text about marketing. The cast_dtm function converts the data frame to a Document-Term Matrix. The following pipe determines the top words in the topics. Just like k-means clustering, the analyst needs to choose the number of topics before analysing the text. In this case, I have opted for 4 topics. The code determines the contribution of each word to the four topics and selects the five most common words in each topic. The faceted bar chart shows each of the words in the four topics. marketing_dtm <- word_freq %>% mutate(doc = 1) %>% cast_dtm(doc, word, n) marketing_lda <- LDA(marketing_dtm, k = 4) %>% tidy(matrix = "beta") %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) marketing_lda %>% mutate(term = reorder(term, beta)) %>% ggplot(aes(term, beta, fill = factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~topic, scales = "free") + coord_flip() + theme(text = element_text(size=20)) This example also does not tell me much more about what marketing is, other than giving a slightly more sophisticated version of the word frequency charts. This chart shows me that marketing is about customers that enjoy a service and a product. Perhaps the original definitions are not distinctive enough to be separated from each other. The persistence of the word "president" is interesting as it seems to suggest that marketing is something that occurs at the highest levels in the business.
What have we learnt?This excursion into text analysis using rvest and Tidytext shows that data science can help us to make some sense out of an unread text. If I did not know what this page was about, then perhaps this analysis would enlighten me. This kind of analysis can assist us in wading through to large amounts of text to select the ones we want to read. I am still not convinced that this type of analysis will provide any knowledge beyond what can be obtained from actually reading and engaging with a text. Although I am a data scientist and want to maximise the use of code in analysing data, I am very much in favour of developing human intelligence before we worry about the artificial kind. The post Defining Marketing with the Rvest and Tidytext Packages appeared first on The Devil is in the Data.
To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
Harry Potter and rankings with comperank Posted: 30 May 2018 05:00 PM PDT (This article was first published on QuestionFlow , and kindly contributed to R-bloggers) Ranking Harry Potter books with comperank package. ProloguePackage comperank is on CRAN now. It offers consistent implementations of several ranking and rating methods. Originally, it was intended to be my first CRAN package when I started to build it 13 months ago. Back then I was very curious to learn about different ranking and rating methods that are used in sport. This led me to two conclusions:
These discoveries motivated me to write my first ever CRAN package. Things didn't turn out the way I was planning, and now After diverging into creating this site and writing ruler in pair with keyholder, a few months ago I returned to competition results and rankings. Gained experience helped me to improve functional API of both packages which eventually resulted into submitting them to CRAN. OverviewThis post, as one of the previous ones, has two goals:
We will cover the following topics:
Another very interesting set of ranking methods implemented in The idea behind converting survey results into competition results is described in aforementioned post. We will need the following setup: Functionality of comperankRating is considered to be a list (in the ordinary sense) of numerical values, one for each player, or the numerical value itself. Its interpretation depends on rating method: either bigger value indicates better player performance or otherwise. Ranking is considered to be a rank-ordered list (in the ordinary sense) of players: rank 1 indicates player with best performance.
There are three sets of functions:
Exploration rankingPreviously we established that "Harry Potter and the Prisoner of Azkaban" seems to be "the best" book and "Harry Potter and the Chamber of Secrets" comes last. This was evaluated by mean score: As simple as it is, this approach might leave some available information unused. Survey originally was designed to obtain information not only about books performance as separate objects, but also to learn about possible pair relationships between them. Maybe some book is considered generally "not the best" but it "outperforms" some other "better" book. This was partially studied in "Harry Potter and competition results with comperes" by computing different Head-to-Head values and manually studying them. Here we will attempt to summarise books performance based on their Head-to-Head relationships. Rankings with fixed H2H structureIn
Being very upset for moment, we realize that in dataset under study there are games with different number of players. Fortunately, Massey methodIdea of Massey method is that difference in ratings should be proportional to score difference in direct confrontations. Bigger value indicates better player competition performance. Colley methodIdea of Colley method is that ratings should be proportional to share of player's won games. Bigger value indicates better player performance. Both Massey and Colley give the same result differing from Exploration ranking in treating "HP_5" ("Order of the Phoenix") and "HP_7" ("Deathly Hallows") differently: "HP_5" moved up from 6-th to 4-th place. Rankings with variable H2H structureAll algorithms with variable Head-to-Head structure depend on user supplying custom Head-to-Head expression for computing quality of direct confrontations between all pairs of players of interest. There is much freedom in choosing Head-to-Head structure appropriate for ranking. For example, it can be "number of wins plus half the number of ties" (implemented in Keener methodKeener method is based on the idea of "relative strength" – the strength of the player relative to the strength of the players he/she has played against. This is computed based on provided Head-to-Head values and some flexible algorithmic adjustments to make method more robust. Bigger value indicates better player performance. Results for Keener method again raised "HP_5" one step up to third place. Markov methodThe main idea of Markov method is that players "vote" for other players' performance. Voting is done with Head-to-Head values and the more value the more "votes" gives player2 ("column-player") to player1 ("row-player"). For example, if Head-to-Head value is "number of wins" then player2 "votes" for player1 proportionally to number of times player1 won in a matchup with player2. Actual "voting" is done in Markov chain fashion: Head-to-Head values are organized in stochastic matrix which vector of stationary probabilities is declared to be output ratings. Bigger value indicates better player performance. We can see that Markov method put "HP_4" ("Goblet of Fire") on second place. This is due to its reasonably good performance against the leader "HP_3" ("Prisoner of Azkaban"): mean score difference is only 0.05 in "HP_3" favour. Doing well against the leader in Markov method has a great impact on output ranking, which somewhat resonates with common sense. Offense-Defense methodThe idea of Offense-Defense (OD) method is to account for different abilities of players by combining different ratings:
Offensive and defensive ratings describe different skills of players. In order to fully rate players, OD ratings are computed: offensive ratings divided by defensive. The more OD rating the better player performance. All methods give almost equal results again differing only in ranks of "HP_5" and "HP_7". Combined rankingsTo obtain averaged, and hopefully less "noisy", rankings we will combine rankings produced with As we can see, although different ranking methods handle results differently for books with "middle performance", combined rankings are only slightly different from exploration ones. Only notable difference is in switched rankings of "Order of the Phoenix" and "Deathly Hallows". Conclusion
To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
Posted: 30 May 2018 05:00 PM PDT (This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers) We are ready to demo our new new experimental package for Algorithmic Trading, New: Business Science LabsI (Davis) am excited to introduce a new open source initiative called Business Science Labs. A lot of the experimental work we do is done behind the scenes, and much of it you don't see early on. What you do see is a "refined" version of what we think you need based on our perception, which is not always reality. We aim to change this. Starting today, we have created Business Science Labs, which is aimed at bringing our experimental software to you earlier so you can test it out and let us know your thoughts! Our first initiative is to bring Quantopian's Open Source algorithmic trading Python library, What We're Going To LearnIntroducing Business Science Labs is exciting, but we really want to educate you on some new packages! In this tutorial, we are going to go over how to backtest algorithmic trading strategies using parallel processing and Quantopian's Zipline infrastructure in R. You'll gain exposure to a
Here's an example of the grid search we perform to determine which are the best combinations of short and long moving averages for the stock symbol JPM (JP Morgan). Here's an example of the time series showing the order (buy/sell) points determined by the moving average crossovers, and the effect on the portfolio value. Algorithmic Trading Strategies And BacktestingAlgorithmic trading is nothing new. Financial companies have been performing algorithmic trading for years as a way of attempting to "beat" the market. It can be very difficult to do, but some traders have successfully applied advanced algorithms to yield significant profits. Using an algorithm to trade boils down to buying and selling. In the simplest case, when an algorithm detects an asset (a stock) is going to go higher, a buy order is placed. Conversely, when the algorithm detects that an asset is going to go lower, a sell order is placed. Positions are managed by buying and selling all or part of the portfolio of assets. To keep things simple, we'll focus on just the full buy/sell orders. One very basic method of algorithmic trading is using short and long moving averages to detect shifts in trend. The crossover is the point where a buy/sell order would take place. The figure below shows the price of Halliburton (symbol "HAL"), which a trader would have an initial position in of say 10,000 shares. In a hypothetical case, the trader could use a combination of a 20 day short moving average and a 150 day long moving average and look for buy/sell points at the crossovers. If the trader hypothetically sold his/her position in full on the sell and bought the position back in full, then the trader would stand to avoid a delta loss of approximately $5/share during the downswing, or $50,000. Backtesting is a strategy that is used to detect how a trading strategy would have performed in the past. It's impossible to know what the future will bring, but using trading strategies that work in the past helps to instill confidence in an algorithm. Quantopian is a platform designed to enable anyone to develop algorithmic trading strategies. To help its community, Quantopian provides several open source tools. The one we'll focus on is With the advent of the RStudio Cloud Experiment SandboxIn this code-based tutorial, we'll use an experimental package called Packages Needed For Backtest OptimizationThe meat of this code-tutorial is the section Backtest Optimization Using tibbletime + furrr + flyingfox . However, before we get to it, we'll go over the three main packages used to do high-performance backtesting optimizations:
Putting It All Together: Install & Load LibrariesInstall Packages For this post, you'll need to install development version of If you are on windows, you should also install the development version of Other packages you'll need include Load Packages We'll cover how a few packages work before jumping into backtesting and optimizations. 1. tibbletimeThe
It's best to learn now, and we'll go over the basics along with a few commonly used functions: First, let's get some data. We'll use the FANG data set that comes with Next, you'll need to convert this collapse_by()Beautiful. Now we have a time-aware tibble. Let's test out some functions. First, let's take a look at rollify()Next, let's take a look at filter_time()Let's check out as_period()We can use the Next, let's check out a new package for parallel processing using 2. furrrThe purrrThe futureThe
Now, the major point: If you're familiar with furrr = future + purrr
Every furrr Example: Multiple Moving AveragesSay you would like to process not a single moving average but multiple moving averages for a given data set. We can create a custom function, We can test the function out with the FB stock prices from FANG. We'll ungroup, filter by FB, and select the important columns, then pass the data frame to the We can apply this Next, we can perform a rowwise map using the combination of Great, we have our moving averages. But… What if instead of 10 moving averages, we had 500? This would take a really long time to run on many stocks. Solution: Parallelize with There are two ways we could do this since there are two maps:
We'll choose the former (1) to show off To make the
That's it! In the previous rowwise map, switch out Bam! 500 moving averages run in parallel in fraction of the time it would take running in series. 3. flyingfoxWe have one final package we need to demo prior to jumping into our Algorithmic Trading Backtest Optimization: What is Quantopian?Quantopian is a company that has setup a community-driven platform for everyone (from traders to home-gamers) enabling development of algorithmic trading strategies. The one downside is they only use Python. What is Zipline?Zipline is a Python module open-sourced by Quantopian to help traders back-test their trading algorithms. Here are some quick facts about Quantopian's
What is reticulate?The reticulate package from RStudio is an interface with Python. It smartly takes care of (most) conversion between R and Python objects. Can you combine them?Yes, and that's exactly what we did. We used What is the benefit to R users?What if you could write your Introducing flyingfox: An R interface to Ziplineflyingfox integrates the
Why "Flying Fox"?Zipline just doesn't quite make for a good hex sticker. A flying fox is a synonym for zipliners, and it's hard to argue that this majestic animal wouldn't create a killer hex sticker. Getting Started With flyingfox: Moving Average CrossoverLet's do a Moving Average Crossover example using the following strategy:
Setup Setup can take a while and take up some computer space due to ingesting data (which is where
Initialize First, write the R function for Handle Data Next, write a
Run The Algorithm Finally, run the algorithm from 2013-2016 using If you got to this point, you've just successfully run a single backtest. Let's review the performance output. Reviewing The PerformanceLet's glimpse
First, let's plot the asset (JPM) along with the short and long moving averages. We can see there are a few crossovers. Next, we can investigate the transactions. Stored within the Finally, we can visualize the results using Last, let's use Backtest Optimization Via Grid SearchNow for the main course: Optimizing our algorithm using the backtested performance. To do so, we'll combine what we learned from our three packages: Let's say we want to use backtesting to find the optimal combination or several best combinations of short and long term moving averages for our strategy. We can do this using Cartesian Grid Search, which is simply creating a combination of all of the possible "hyperparameters" (parameters we wish to adjust). Recognizing that running multiple backtests can take some time, we'll parallelize the operation too. PreparationBefore we can do grid search, we need to adjust our Making The GridNext, we can create a grid of values from a Now that we have the hyperparameters, let's create a new column with the function we wish to run. We'll use the Running Grid Search In Parallel Using furrrNow for the Grid Search. We use the Inspecting The Backtest Performance ResultsThe performance results are stored in the "results" column as We can also get the final portfolio value using a combination of We can turn this into a function and map it to all of the columns to obtain the "final_portfolio_value" for each of the grid search combinations. Visualizing The Backtest Performance ResultsNow let's visualize the results to see which combinations of short and long moving averages maximize the portfolio value. It's clear that short >= 60 days and long >= 200 days maximize the return. But, why? Let's get the transaction information (buy/sell) by unnesting the results and determining which transactions are buys and sells. Finally, we can visualize the portfolio value over time for each combination of short and long moving averages. By plotting the buy/sell transactions, we can see the effect on a stock with a bullish trend. The portfolios with the optimal performance are those that were bought and held rather than sold using the moving average crossover. For this particular stock, the benefit of downside protection via the moving average crossover costs the portfolio during the bullish uptrend. ConclusionsWe've covered a lot of ground in this article. Congrats if you've made it through. You've now been exposed to three cool packages:
Further, you've seen how to apply all three of these packages to perform grid search backtest optimization of your trading algorithm. Business Science UniversityIf you are looking to take the next step and learn Data Science For Business (DS4B), you should consider Business Science University. Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You'll learn:
All while solving a REAL WORLD CHURN PROBLEM: Employee Turnover! DS4B Virtual Workshop: Predicting Employee AttritionDid you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don't realize this because it's an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here's an example of a Shiny app you will create. Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301 Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:
The Virtual Workshop is intended for intermediate and advanced R users. It's code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process. Don't Miss A Beat
Connect With Business ScienceIf you like our software (
To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
Posted: 30 May 2018 10:19 AM PDT (This article was first published on R – rud.is, and kindly contributed to R-bloggers) Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it's fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc. When you're in a terminal session you can tell that a file has extended attributes by looking for an $ cd ~/Downloads $ ls -l total 264856 -rw-r--r--@ 1 user staff 169062 Nov 27 2017 1109.1968.pdf -rw-r--r--@ 1 user staff 171059 Nov 27 2017 1109.1968v1.pdf -rw-r--r--@ 1 user staff 291373 Apr 27 21:25 1804.09970.pdf -rw-r--r--@ 1 user staff 1150562 Apr 27 21:26 1804.09988.pdf -rw-r--r--@ 1 user staff 482953 May 11 12:00 1805.01554.pdf -rw-r--r--@ 1 user staff 125822222 May 14 16:34 RStudio-1.2.627.dmg -rw-r--r--@ 1 user staff 2727305 Dec 21 17:50 athena-ug.pdf -rw-r--r--@ 1 user staff 90181 Jan 11 15:55 bgptools-0.2.tar.gz -rw-r--r--@ 1 user staff 4683220 May 25 14:52 osquery-3.2.4.pkg You can work with extended attributes from the terminal with the I didn't think so. Thus begat the Exploring Past DownloadsData scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let's use the We're not going to work with the entire package in this post (it's really straightforward to use and has a README on the GitHub site along with extensive examples) but I'll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example. First, let's see what is hidden with the RStudio disk image: There are four keys we can poke at, but the one that will help transition us to a larger example is Why "raw"? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s ( So, we can kinda figure out the URL but it's definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references). There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the I like to prime the Python setup with That's much better. Let's work with metadata for the whole directory: Now we can focus on the task at hand: recovering the URLs: (There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.) Note: if Python is not an option for you, you can use the hack-y FINHave some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back! Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.
To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
Posted: 30 May 2018 08:30 AM PDT (This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers) I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here's what choroplethr users need to know about this new version of ggplot2. Choroplethr Update RequiredThe new version of ggplot2 introduces bugs into choroplethr. In particular, choroplethr does not pass R cmd check when the new version of ggplot2 is loaded. I am planning to submit a new version of choroplethr to CRAN that addresses this issue prior to June 25. This change is the third or fourth time I've had to update choroplethr in recent years as a result of changes to ggplot2. This experience reminds me of one of the first lessons I learned as a software engineer: "More time is spent maintaining old software than writing new software." Simple Features SupportOne of the most common questions I get about choroplethr is whether I intend to add support for interactive maps. My answer has always been "I can't do that until ggplot2 adds support for Simple Features." Thankfully, this new version of ggplot2 introduces that support! Currently all maps in the choroplethr ecosystem are stored as ggplot2 "fortified" dataframes. This is a format unique to ggplot2. Storing the maps in this format makes it possible to render the maps as quickly as possible. The downside is that:
Once ggplot2 adds support for Simple Features, I can begin work on adding interactive map support to choroplethr. The first steps will likely be:
After that, I can start experimenting with adding interactive graphics support to choroplethr. Note that Simple Features is not without its drawbacks. In particular, many users are reporting performance problems when creating maps using Simple Features and ggplot2. I will likely not begin this project until these issues have been resolved. Thoughts on the CRAN EcosystemThis issue has caused me to reflect a bit about the stability of the CRAN ecosystem. ggplot2 is used by about 1,700 packages. It's unclear to me how many of these packages will have similar problems as a result of this change to ggplot2. And of the impacted packages, how many have maintainers who will push out a new version before June 25? And ggplot2, of course, is just one of many packages on CRAN. This issue has the potential to occur whenever any package on CRAN is updated. This issue reminded me that CRAN has a complex web of dependencies, and that package maintainers are largely unpaid volunteers. It seems like a situation where bugs can easily creep into an end user's code.
The post New Version of ggplot2 appeared first on AriLamstein.com.
To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
Does financial support in Australia favour residents born elsewhere? Responding to racism with data Posted: 30 May 2018 08:09 AM PDT (This article was first published on blogR, and kindly contributed to R-bloggers)
Australian racism goes viral, againAustralian racism went viral again this year when a man was filmed abusing staff at Centrelink, which delivers social security payments and services to Australians (see story here). The man yells that he didn't vote for multiculturalism and that Centrelink is supporting everyone except "Australians". It is distressing to watch, especially as someone whose ancestors found a home in Australia having escaped persecution. He can't take it back, but the man did publically apologise and may be suffering from mental illness (see story here). This topic is still polarising. Many of us want to vilify this man while others may applaud him. But hate begets hate, and fighting fire with fire makes things worse. As a data scientist, the best way I know to respond to the assumptions and stereotypes that fuel racism is with evidence. So, without prejudice, let us investigate the data and uncover to whom the Australian government provides support through Centrelink. Centrelink supports Australians, so who are we talking about?With rare exceptions, Centrelink supports Australian Residents living in Australia (see here and here). So, the claim that Centrelink supports everyone but Australians in misguided. Perhaps the reference to "multiculturalism" can direct us to a more testable question. Centrelink offers support to Australian residents who can be born anywhere in the world. So in this article, I'll use publically accessible data to investigate differences in support given to residents born in Australia or elsewhere. Estimated Residential PopulationThe Figure below shows changes in Australia's Estimated Residential Population, which is an official value published by the Australian Bureau of Statistics and used for policy formation and decision making. The residential population has been increasing from about 17 million in 1992 to over 24 million in 2016. In contrast, the percentage of residents who are Australian-born has decreased from 76.9% to 71.5%. This will guide our sense of whether Centrelink payments are unbiased. As a side note, Census statistics reported that the percentage of Australian-born residents in 2016 was 66.7% (4.8% lower than the official estimate above). This discrepancy is the result of the the Australian Bureau of Statistics making adjustments that you can learn about here. All Centrelink PaymentsCentrelink data is published quarterly and has included country-of-birth breakdowns since December 2016 (which aligns with the last available population data reported above). At this time, Centrelink made 15 million payments to Australian residents.
The data shows that Centrelink payments are made to residents born in Australia or elsewhere in approximately the same proportions as these groups are represented in the population. The difference of a couple of percent indicates that slightly fewer payments were going to Australian-born residents than we'd expect. As we'll see in the following section, this difference can be almost wholly accounted for by the Age Pension. Still, the difference is small enough to negate the claim that Centrelink substantially favours residents born outside Australia. Breakdown by TypeIt's also possible to break down these total numbers into the specific payment types shown below (detailed list here). It's expected that these payment types, which support specific needs, will show biases in favour of certain groups. For example, ABSTUDY supports study costs and housing for Aboriginal or Torres Strait Islander residents. This should mostly go to Australian-born residents. To investigate, we can extend the Figure above to include the number of Australian-born recipients: Looking at this Figure, most Centrelink payments fall along the dotted line, which is what we'd expect from a fair system (if 71.5% of the recipients were Australian-born). The outlier is the Age Pension, which falls below the line. More recipients of the Age Pension are born outside Australia than is reflected in the total population. I cannot say from this data alone why there is some bias in the Age Pension and perhaps a knowledgeable reader can comment. Nonetheless, this discrepancy is large enough that removing the Age Pension from consideration results in 70.5% of all other Centrelink payments going to Australian-born residents – almost exactly the proportion in the population. Ignoring Total NumbersThe Figure below shows the percentage of Australian-born recipients for each payment type, ignoring totals. At the upper end of the scale, we can see Australian-born recipients being over-represented for ABSTUDY and Youth Allowance payments. At the lower end, residents who are born outside Australia are over-represented for Wife and Widow pension and allowance. These payments with large biases (in either direction) have some common features. They have very specific eligibility criteria and are among the least-awarded services (shown in earlier Figures). Furthermore, the granting of payments to new recipients has been stopped in some cases such as the Wife Pension. These findings are consistent with the expectation that specific types of payments should be biased in specific ways. It also shows that substantial biases only arise for specific payments that are awarded to very few individuals. Concluding remarksIn response to a racist outburst, I sought out publically available data to investigate whether there was evidence that the Australian Government unfairly supported residents based on their country of origin. I found that the percentage of residents born outside Australia has increased over time. However, with the minor exception of the Age pension (which the outraged man was not eligible for), residents born in Australia or elsewhere were fairly represented in the total number of Centrelink payments. I'd like to thank the Australian Government and Australian Bureau of Statistics for publicising this data and making it possible for me to respond to racism with evidence. If you'd like to reproduce this work or dig into the data yourself, everything from explaining where I got the data to create this article is freely available on GitHub. You can also keep in touch with me on LinkedIn or by following @drsimonj on Twitter.
To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment