[R-bloggers] Uncertainty in Data Science (Transcript) (and 6 more aRticles)

[R-bloggers] Uncertainty in Data Science (Transcript) (and 6 more aRticles)

Link to R-bloggers

Uncertainty in Data Science (Transcript)

Posted: 24 Sep 2018 09:00 AM PDT

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is a link to the podcast.

Introducing Allen Downey

Hugo: Hi, there, Allen, and welcome to DataFramed.

Allen: Hey, Hugo. Thank you very much.

Hugo: Such a pleasure to have you on the show, and I'm really excited to have you here to talk about uncertainty in data science, how we think about prediction, and how we can think probabilistically, and how we do it right, and how we can get it wrong as well, but before we get into that, I'd love to find out a bit about you, and so I'm wondering what you're known for in the data community.

Allen: Right. Well, I'm working on a book series that's called Think X, for all X, so hopefully some people know about that. Think Python is kind of the starting point, and then for data science, Think Stats and Think Bayes, for data science and for Bayesian statistics.

Hugo: Great, and so why Think?

Allen: Came about, roundabout, the original book was called How to Think Like a Computer Scientist, and it was originally a Java book, and then it became a Python book, and then it wasn't really about programming. It was about bigger ideas, and so then when I started the other books, the premise of the books is that you're using computation as a tool to learn something else, so it's a way of thinking, it's an approach to the topic, and so that's how we got to the schema that's always think something for various values of something.

Computation

Hugo: Right. I like that a lot, and speaking to this idea of computation, I know you're a huge proponent of the role of computation in helping us to think, so maybe you can speak to that for a minute.

Allen: Sure. I mean, it partly comes … I've been teaching in an engineering program, and engineering education has been very math-focused for a long time, so the curriculum, you have to take a lot of calculus and linear algebra before you get to do any engineering, and it doesn't have to be that way at all. I think there are a lot of ideas in engineering that you can get to very quickly computationally that are much harder mathematically.

Allen: One of the examples that comes up all the time is integration, which is a little bit of a difficult idea. Students, when they see an integral sign, immediately there's gonna be some challenge there, but if you do everything discretely, you can take all of those integrals, you just turn them into summations, and then if you do it computationally, you take all of the summations and turn them into for loops, and then you can have very clear code where you're looping through space, you're adding up all of the elements. That's what an integral is.

Hugo: Absolutely, and I think another place that you've thought about a lot, and a lot of us have worked in where this rears its head is the idea of using computation and sampling and re-sampling datasets to get an idea about statistics. Right?

Allen: Right. Yeah. I think classical statistical inference, looking at things like confidence intervals and hypothesis tests, re-sampling is a very powerful tool. You're running simulations of the system, and you can compute things like sampling distribution or a p-value in a very straightforward way, meaning that it's easy to do, but it also just makes the concept transparent. It's really obvious what's going on.

Hugo: That's right, and you actually … We've had a segment on the podcast previously, which is … It's blog post of the week, and we had one on your blog post, There Is Only One Test, which really spells out the idea of that in the world of statistical hypothesis testing, there is really only one test, and the idea of you can actually see that, and this one of your great points, you can see that when you take the sampling, re-sampling, bootstrapping approach. Right?

Allen: Right. Yeah. I think it makes the framework visible, that hypothesis tests, there's a model of the null hypothesis, and that's gonna be different for different scenarios, and there's the test statistic, and that's gonna be different for different scenarios, but once you've specified those two pieces, everything else is the same. You're running the same framework. So, I think it makes the concept much clearer.

Hugo: Great, and we'll link to that in the show notes. We'll also link to your fantastic followup post called "There Is Still Only One Test".

Allen: Well, that's just because I didn't explain it very well the first time, so I had to try again.

How did you get into data science?

Hugo: It also proves the point, though, that there is still only one test, and I'll repeat that, that there is still only one test. So, how did you get into data science originally?

Allen: Well, my background is computer science, so there are a lot of ways, a lot of doors into data science, but I think computer science is certainly one of the big ones. I did … My master's thesis was on computer vision, so that was kind of a step in that direction. My PhD was all about measuring and modeling computational systems, so there are a lot of things that come in there like long tail distributions, and then in 2009 I did a sabbatical, and I was working at Google in a group that was working on internet performance, so we were doing a lot of measurement, modeling, statistical descriptions, and predictive modeling, so that's kind of where it started to get serious, and that's where I started when I was working on Think Stats for the first time.

Hugo: So, this origin story of you getting involved in data science I think makes an interesting point, that you've actually touched a lot of different types of data, and I know that you're a huge fan of the idea that data science isn't necessarily only for data scientists, that it actually could be of interest to everyone because it touches … There are so many touch points with the way we live and data science. Right?

Allen: Right. Yeah. This is one of my things that I get a little upset about, is when people talk about data science, and then they talk about big data, and then they talk about quantitative finance and business analytics, like that's all there is, and I use a broader notion of what data science is. I'd like to push the idea that it's any time that you're using data to answer questions and to guide decision making, because that includes a lot of science, which is often about answering questions, a lot about engineering where you're designing a system to achieve a particular goal, and of course, decision making, both on an individual or a business or a national public policy level. So, I'd like to see data science involved in all of those pieces.

Hugo: Absolutely. So, we're here to talk about uncertainty today. One part of data science is making predictions, which we'll get to, but the fact that we live in an uncertain world is incredibly interesting because what we do as a culture and a society, we use probability to think about uncertainty, so I'm wondering your thoughts on whether we us humans are actually good at thinking probabilistically.

Allen: Right. It's funny because we are and we are not at the same time.

Hugo: I'm glad you didn't say we probably are.

Allen: Right. Yeah. That would've been good. So, we do seem to have some instinct for probabilistic thinking, even for young children. We do something that's like a Bayesian update. When we get new data, if we're uncertain about something, we get new evidence, we update our beliefs, and in some cases we actually do a pretty good approximation of an accurate Bayesian update, typically for things that are kind of in the middling range of probability, maybe from about 25% to 75%. At the same time, we're terrible at very rare things. Small probabilities we're pretty bad at, and then there are a bunch of ways that we can be consistently fooled because we're not actually doing the math. We're doing approximations to it, and those approximations fail consistently in ways that behavioral psychologists have pointed out, things like confirmation bias and other cognitive failures like that.

"Why Are We So Surprised?""

Hugo: Absolutely. So, I want to speak to an article you wrote on your blog called Why Are We So Surprised?, in which you stated, "In theory, we should not be surprised by the outcome of the 2016 presidential election, but in practice, we are." So, I'm wondering why you think we shouldn't have been surprised.

Allen: Right. Well, a lot of the forecasts, a lot of the models coming from FiveThirtyEight and from The New York Times, they were predicting that Trump had about a 25% chance, maybe more, of winning the election. So, if something's got a 25% chance, that's the same as flipping a coin twice and getting heads twice. You wouldn't be particularly surprised by that. So, in theory a 25% risk shouldn't be surprising, but in practice, I think people still don't really understand probabilistic predictions.

Allen: One reason we can see that is the lack of symmetry, which is, if I tell you that Trump has a 25% chance of winning, you think, "Well, okay. That might happen," but when FiveThirtyEight said that Hillary Clinton had a 70% chance of winning, I think a lot of people interpreted that as a deterministic prediction, that FiveThirtyEight was saying, "Hillary Clinton is going to win," and then when that didn't happen, they said, "Well, then FiveThirtyEight was wrong," and I don't think that's the right interpretation of a probabilistic prediction. If someone tells you there's a 70% chance and it doesn't happen, that should be mildly surprising, but it doesn't necessarily mean that the prediction was wrong.

Hugo: Yeah, and in your article, you actually make a related point that everybody predicted at some level, well, predicted that Hillary had over a 50% chance of winning, and you made the point that people interpreted this as there was consensus that Hillary would win with different degrees of confidence, but that's … So, as you stated, that's interpreting it as deterministic predictions, not probabilistic predictions. Right?

Allen: Yeah, I think that's right, and it also … It fails the symmetry test again because different predictions, they ranged all the way from 70% to 99%, and people reacted as if that was a consensus, but that's not a consensus. If you flip it around, that's the range from saying that Trump has anywhere between 1% and 30% chance of winning, and if the predictions had been expressed that way, I think people would've looked at that and said, "Oh, clearly there's not a consensus there, because there's a big difference between 1% and 30%."

Hugo: I really like this analogy to flipping coins, because it puts a lot of things in perspective, and another example, as you mention in your article, The New York Times gave Trump a 9% chance of winning, and if you flip a coin four times in a row and get four heads, that's relatively surprising, but you wouldn't be like, "Oh, I can't believe that happened," and that has a 6.25% chance of happening. Right?

Allen: Right. Yeah, I think that's a good way to get a sense for what these probabilities mean.

Hugo: Absolutely. So, you mentioned also that these models were actually relatively credible models, so maybe you can speak to that.

Allen: Yeah. I think going in, two reasons to think that these predictions were credible, one of them was just past performance, that FiveThirtyEight and The New York Times had done well in previous elections, but maybe more important, their methodology was transparent. They were showing you all of the poll data that they were using as inputs, and I think they weren't actually publishing the algorithms, but they gave a lot of detail about how these things were working. Some polls are more believable than others. They were applying correction factors, and they also had … They were taking time into account. So, a more recent poll would be weighted more heavily than a poll that was farther into the past. So, all of those, I think ahead of the fact, we had good reasons to believe the predictions, and after the fact, even though the outcome wasn't what we expected, that really just doesn't mean that the models are wrong.

Hugo: So, with all of this knowledge around how uncertain we are about uncertainty and how we can be good and bad about thinking probabilistically, what approaches can we as a data reporting community take to communicate around uncertainty better in the future?

Allen: Right. I think we don't know yet, but one of the things that I think is good is that people are trying a lot of different things. So, again, taking the election as an example, The New York Times had the twitchy needle that was sort of famously maybe not the best way to represent that information. There were other examples. Nate Silver's predictions are based on running many simulations. So, he would show a histogram that would show the outcome of doing many, many simulations, and that I think probably works for some audiences. I think it's tough for other audience.

Allen: One of the suggestions I made that I would love to see someone try is instead of running many simulations and trying to summarize the results, I'd love to see one simulation per day with the results of one simulation presented in detail. So, thinking back to 2016, suppose that every day you looked in the paper, and it showed you one possible outcome of the election, and let's say that Nate Silver's predictions were right, and there was a 70% chance that Clinton would win. So, in a given week, you would see Clinton win maybe four or five times. You would see Trump win two or three times, and I think at the end of that week, your intuition would actually have a good sense for that probability.

Hugo: I think that's an incredible idea, because what it speaks to for me personally is you're not really looking at these simulations or these results in the abstract. You're actually experiencing them firsthand in some way.

Allen: Exactly. So, you get the emotional effect of opening the paper and seeing that Trump won, and if that's already happened a few times in simulation, then the reality would be a lot less surprising.

Hugo: Absolutely. Are there any other types of approaches or ways of thinking that you'd like to see more in the future?

Allen: Well, as I said, I think there are a lot of experiments going, so I think we will get better at communicating these ideas, and I think the audience is also learning, so different visualizations that wouldn't have worked very well a few years ago, now people are I think just better at interpreting data, interpreting visualizations, because it's become part of the media in a way that it wasn't. If you'd look back not that long ago, I don't know if you remember when USA Today started doing infographics, and that was a thing. People were really excited about those infographics, and you look back at those things now, and they're terrible. It'll be like-

Hugo: Mm-hmm (affirmative). We've come a long way.

Allen: It's something that's really just a bar chart, except that the bar is made up of stacked up apples and stacked up oranges, and that was data visualization, say, 20 years ago, and now you look at the things that The New York Times is doing with interactive visualizations. I saw one the other day, which is their three-dimensional visualization of the yield curve, which is a tough idea in finance and economics, and a 3-D visualization is tough, and interactive visualization is challenging, so maybe it doesn't work for every audience, but I really appreciated just the ambition of it.

Hugo: So, you mentioned the role of data science in decision making in general, and I think in a lot of ways, we make decisions based on all the data we have, and then a decision is made, but a lot of the time, the quality of the decision will be rated on the quality of the outcome, which isn't necessarily the correct way to think about these things. Right?

Allen: Right. I gave an example about Blackjack, that you can make the right play in Blackjack. You take a hit when you're supposed to take a hit, and if you go bust, it's tempting to say, "Oh. Well, I guess I shouldn't have done that," but that's not correct. You made the right play, and in the long run that's the right decision. Any specific outcome is not necessarily gonna go your way.

Hugo: Yeah, but we know that in that case because we can evaluate the predictions based on the theory we have and the simulations we have in our mind or computationally. Right? On long-term rates, essentially.

Allen: Right. Yeah. Blackjack is easy because every game of Blackjack is kind of the same, so you've got these identical trials. You've got long-term rates. We have a harder time with single-case predictions, single-case probabilities.

Hugo: Like election forecasting?

Allen: Like elections, right, but in that case, right, you can't evaluate a single prediction. You can't say specifically whether it's right or wrong, but you can evaluate the prediction process. You can check to make sure that probabilistic predictions are calibrated. So, maybe getting back to Nate Silver again, in The Signal and the Noise, he uses a nice example, which is the National Weather Service, which is, they make probabilistic predictions. They say, "20% chance of rain, 80% chance of rain," and on any given day, you don't know if they were wrong.

Allen: So, if they 20% then it rains, or if they say 80% and it doesn't rain, that's a little bit surprising, but it doesn't make them wrong. But in the long run, if you keep track of every single time that they say 20% and then you count up how many times does it actually rain on 20% days, and how many times does it rain on 80% days, if the answer is 20% and 80%, then that's a well-calibrated probabilistic prediction.

Where is uncertainty prevalent in society?

Hugo: Absolutely. So, this is another example. The weather is one. We've talked about election forecasting, and these are both examples where it's we really need to think about uncertainty. I'm wondering what other examples in society are where we need to think about uncertainty and why they're important.

Allen: Yep. Well, a big one … Anything that's related to health and safety, those are all cases where we're talking about risks, we're talking about interventions that have certain probabilities of good outcomes, certain probabilities of side effects, and those are other cases, I think, where sometimes our heuristics are good, and other times we make really consistent cognitive errors.

Hugo: There are a lot of cognitive biases, and one that I fall prey to constantly is, I'm not even sure what it's called, but it's when you have a small sample size, and I see something occur several times, I'm like, "Oh, that's probably the way things work."

Allen: Right. Yeah. I guess that's a form of over-fitting. In statistics, there's sort of a joke that people talk about the law of small numbers, but that's right. I think that's a version of jumping to conclusions. That's an example where I think doctors have had a version of that in the past, which is they make decisions often about treatment that are based on their own patients, so, "Such-and-such a drug has worked well for my patients, and I've seen bad outcomes with my patients," as contrasted with using large randomized trials, which we've got a lot of evidence now that randomized trials are a more reliable form of evidence than the example that you gave of generalizing from small numbers.

Hugo: So, health and safety, as you said, are two relevant examples. What can we do to combat this, do you think?

Allen: That one's tough. I'm thinking about some of the ways that we get health wrong, some of the ways that we get safety. Certainly, one of the problems is that we're very bad at small risks, small probabilities. There's some evidence that we can do a little bit better if we express things in terms of natural frequencies, so if I tell you that something has a .01% probability, you might have a really hard time making sense of that, but if I tell you that it's something like one person out of 10,000, then you might have a way to picture that. You could say, "Well, okay. At a baseball game, there might be 30,000 people, so there could be three people here right now how have such-and-such a condition." So, I think expressing things in terms of natural frequencies might be one thing that helps.

Hugo: Interesting. So, essentially, these are, I suppose, linguistic technologies and adopting things that we know work in language.

Allen: Yeah, I think so. I think graphical visualizations are important, too. Certainly, we have this incredibly powerful tool, which is our vision system, that's able to take a huge amount of data and process it quickly, so that's, I think, one of the best ways to get information off a page and into someone's brain.

Hugo: Yeah. Look, this actually just reminded me of something I haven't thought about in years, but it must've been 10 or 15 years ago, I was at an art show in Melbourne, Australia, and there was an artwork which it was visualizing how many people had been in certain situations or done certain things using grains of rice. So, they had a bowl, like the total population of Australia, the total population of the US, and then the number of people who were killed during the Holocaust and the number of people who've stepped on the moon, and that type of stuff, and it was actually incredibly vivid and memorable, and you got a strong sense of magnitude there.

Allen: Yes. I think that works. There's a video I saw, we'll have to find this and maybe put in a link, about war casualties and showing a little individual person for each casualty, but then adding it up and showing colored rectangles of different casualties in different wars, the number of people from each country, and that was very effective, and then I'm reminded of XKCD has done several really nice examples to show the relative sizes of things, just by mapping them onto area on the page. One of the ones that I think is really good is different doses of radioactivity, where he was able to show many different orders of magnitude by starting with a small unit that was represented by a single square, and then scaling it up, and then scaling it up, so that you could see that there are orders of magnitude between things like dental x-rays that we really should not be worrying about, and other kinds of exposure that are actual health risks.

Uncertainty Misconceptions

Hugo: Incredible. So, what are the most important misconceptions regarding uncertainty that you think we need to correct, those data-oriented educators?

Allen: Right. Well, we talked about probabilistic predictions. I think that's a big one. I think the other big one that I think about is the shapes of distributions, that when you try to summarize a distribution, if I just tell you the mean, then people generally assume that it's something like a bell-shaped curve, and we have some intuition for what that's like, that if I tell you that the average human being is about 165 centimeters tall, or I think it's more than that, but anyway, you get a sense of, "Okay. So, probably there are some people who are over 200, and probably there are some people who are less than 60, but there probably isn't anybody who is a kilometer tall." We have a sense of that distribution.

Allen: But then you get things like the Pareto distribution, and this is one of the examples I use in my book, is what I call Pareto World, which is same as our world, because the average height is about the same, but the distribution is shaped like a Pareto distribution, which is one of these crazy long-tailed distributions, and in Pareto World, the average height is between one and two meters, but the vast majority of people are only a centimeter tall, and if you have seven billion people in Pareto World, the tallest one is probably a hundred kilometers tall.

Pareto Distributions

Hugo: That's incredible, and just quickly, what type of phenomena do Pareto distributions, what are they known to model?

Allen: Right. Well, I think wealth and income are two of the big ones. In fact, I think that's the original domain where Pareto was looking at these long-tailed distributions, and that's the case where a few people have almost all of the wealth, and the vast majority of people have almost none. So, that's a case where if I tell you the mean and you are imagining a bell-shaped distribution, you have totally the wrong picture of what's going on. The mean is really not telling you what a typical person has. In fact, there may be no typical person.

Hugo: Absolutely, and in fact, that's a great example. Another example is if you have a bimodal distribution with nothing in the middle, the mean. There could actually be no one with that particular quantity of whatever we're talking about.

Allen: Yeah, that's a good example.

Hugo: So Allen, when you were discussing the Pareto distribution and the normal distribution, then something really struck me that as stakeholders and decision makers and research scientists and data scientists, we seem to be more comfortable in thinking about summary statistics and concrete numbers instead of distribution. So what I mean by that is, we like to report the mean, the mode, the median and measures of spread such as the variance. And there seems to be some sort of discomfort we feel, and we're not great at thinking about distributions which seem kind of necessary to quantify and think about uncertainty.

Allen: No, I think that's right. It doesn't come naturally. You know, I work with students. It takes awhile to just understand the idea of what a distribution is. But I think it's important because it captures all of the information that you have about a prediction. You want to know all possible outcomes, and the probability for each possible outcome. That's what a distribution is. It captures exactly the information that you need as a decision maker.

Hugo: Exactly. So, I mean, instead of communicating, for example, P-values in hypothesis testing, we can actually show the distribution of the possible effect sizes, right?

Allen: Right, and this is the strength of Bayesian methods, because what you've got is a posterior distribution that captures this information. And if you now feed that into a decision making process, it answers all the questions that you might want to ask. If you only care about the central tendency you can get that, but very often there's a cost function that says, you know, if this value turns out to be very high, there's a cost associated with that. If it's low, there's a cost associated with that. So if you've got the whole distribution, you can feed that into a cost benefit analysis and make better decisions.

Hugo: Absolutely. And I love the point that you made, which I think about a lot of the time, and when I teach Bayesian thinking and Bayesian inference, I make this incredibly explicit all the time, that from the posterior, from the distribution, you can get out so many of the other things that you need and you would want to report.

Allen: Right, so maybe you care, you know, what's the probability of a given catastrophic output. So, in that case you would be looking at, you know, the tails of that distribution. Or something like, you know, what's the probability that I'll be off by a certain amount or again, you know, things like the mean and the spread. Whatever the number is, you can get it from the distribution.

What technologies are best suited for thinking and communicating around uncertainty?

Hugo: Absolutely. And this is actually … this leads to another question which I wanted to talk about. Bayesian inference I think of in a number of ways, as a technology that we've developed to deal with these types of questions and concepts. I think also we have reached a point in the past decades where Bayesian inference now, because of computational power we have, is actually far more feasible to do in a robust and efficient manner. And I think we may get to that in a bit. But I'm wondering in general, so what technologies, to your mind, are best suited for thinking and communicating around uncertainty, Allen?

Allen: Well, you know, a couple of the visualizations that people use all the time, and of course, you know, the classic one is a histogram. And that one, I think, is most appropriate for a general audience. Most people understand histograms. Violin plots are kinda similar, that's just two histograms back-to-back. And I think those are good because people understand them, but problematic. I mean, I've seen a number of articles of people pointing out that you kinda have to get histograms right. If the bin size is too big, then you're smoothing away a lot of information that you might care about. If the bin size is too small, you're getting a lot of noise and it can be hard to see the shape of the distribution through the noise.

Allen: So, one of the things I advocate for is using CDFs instead of histograms, or PDFs, as the default visualization. And when I'm exploring a data set, I'm almost always looking at CDFs because you get the best view of the shape of the distribution, you can see modes, you can see central tendencies, you can see spread. But also if you've got weird outliers, they jump out, and if you've got repeated values, you can see those clearly in a CDF, with less visual noise that distracts you from the important stuff. So I love CDFs. The only problem is that people don't understand them. But I think this is another case where the audience is getting educated, that the more people are consuming data journalism, the more they're seeing visualizations like this. And there's some implicit learning that's going on.

Allen: I saw one example very recently, someone showing the altitude that human populations live at. 'Cause they were talking about sea levels rising and talking about the fraction of people who live less than four meters above sea level. But the visualization was kind of a sneaky CDF, they showed, it actually a CDF sideways. But it was done in a way where a person who doesn't necessarily have technical training would be able to figure out what that graph was showing. So I think that's a step in a good direction.

Hugo: I like that a lot. And just to clarify, a CDF is a cumulative distribution function?

Allen: Yes. Sorry, I should've said that.

Hugo: Yeah.

Allen: And in particular I'm talking about empirical CDFs, where you're just taking it straight from data and generating the cumulative distribution function.

Hugo: Fantastic. And one of the nice things there, so each point on the x-axis, the y value will correspond to the number of data points equal to a less than, that particular point. And one of the great things is, you can also read off all your percentiles, right?

Allen: Exactly, right. You can read it in both directions. So, if you start on the y-axis, you can pick the percentile you want, like the median, 50 percentile. And then read off the corresponding x value. Or, the flip side is exactly what you said. If you want to know what fraction of the values are below a certain threshold, then you just read off that threshold and get the corresponding y-value.

Hugo: Yeah. And one of the other things that I love, you mentioned a bunch of, well several very attractive characteristics of empirical CDF, ECDFs. I also love that you can plot, you know, your control and a lot of different experiments just on the same figure and actually see how they differ, as opposed to you try to plot a bunch of histograms together, you gotta do wacky transparencies and all this stuff, right?

Allen: Yes, that's exactly right. And you can stack lots of CDFs on the same axes, and the differences that you see are really the differences that matter. When you compare histograms, you're seeing a lot of noise and you can see differences between histograms that are just random. When you're looking at CDFs, you get a pretty robust view of what the differences are and where in the distribution those differences happen.

Hugo: Yeah. Fantastic. Look, I'm very excited for a day in which the general populace appreciates CDFs and they appear in the mainstream media. I think that's a bright future.

Allen: Yeah, and I think we're close. I've seen one example, there have got to be more.

Hugo: Are there any other technologies or ways of thinking about uncertainty that you think are useful?

Allen: Well we talked a little bit about visualizing simulations, I think that matters. There's one example maybe getting back to … if we have to get back to the 2016 election, I think one of the issues that came up is that a lot of the predictions, when they showed you a map of the different states, they were showing a color scale where there would be a red state and a blue state, but also pink and light blue and purple. And they were trying to show uncertainty using that color map, but then that's, you know, and that's not how the electoral college works. The electoral college, every state is either all red or all blue, with just a couple of exceptions. So that was a case where the predictions ended up looking very different from what the final results looked like, and I think that's part of why we were uncomfortable with predictions and the results.

Hugo: Interesting. So what is a fix for that, do you think?

Allen: Well, again coming back to my suggestion about, you know, don't try to show me all possible simulation outcomes, but show me one simulation per day. And in that case, the result that you show me, the daily result, would be all red or all blue. So, the predictions in that sense would look exactly like the outcome. And then when you see the outcome, the chances are that it's gonna resemble at least one of the predictions that you made.

Hugo: Great. Now I just had kind of a future flash, a brainwave into a future where we can use virtual reality technologies to drop people into potential simulations. But that's definitely future music.

Allen: Yes. I think that's interesting.

What does the future of data science look like to you?

Hugo: Yeah. So speaking of the future, we've talked a lot about modern data science and uncertainty. I'm wondering what the future of data science looks like to you?

Allen: I think a big part of it looks like more people being involved. So not just highly trained technical statisticians, but we've been talking like data journalists, for example, who are people who have a technical skill to look at data, but also the storytelling skill to ask interesting questions, get answers, and then communicate those answers. I'd love to see all of that become more a part of general education, starting in primary school. Starting in secondary school, working with data, working with some of these visualizations we've been talking about. Using data to answer questions. Using data to explore and find out about the world, you know, at the stage that's appropriate at different levels of education.

Allen: There's a lot of talk about trying to get maybe less calculus in the world and more data science, and I think that's gotta be the direction we go. If you look at what people really need to know and what they're likely to use, practically everybody is going to be a consumer of data science and I think more and more people are gonna be producers of data science. So I think that's gotta be part of a core education. And calculus, I love calculus. But, it's just not as important for as many people.

Hugo: Yeah. And arguably, for you in your engineering background, I mean, calculus is incredibly important for engineers and physicists, but other people who need to be quantitative, it is, I think your point is very strong that learning how to actually work with data and statistics around that, is arguably a lot more essential.

Allen: Yeah. I think, as I said, more and more people are gonna be doing at least some kind of data science where they're taking advantage of all of the data now that's freely available, and that's, you know, government agencies are producing huge volumes of data and often they don't have the resources to really do anything with it. They've got a mandate to produce the data, but they don't have the people to do that. But the flip side of that is there's a huge opportunity for anyone with basic data skills to get in there and find interesting things. Often, you're one of the first people to explore a data set, you know, if you jump in there on the day it's published, you can find all kinds of things, not necessarily using, you know, powerful or complex statistical methods, just basic exploratory data analysis.

Hugo: Yeah, and the ability now to get, you know, learners, students, people in education institutions, involved in data science by making it or letting them realize that it's relevant to them, that there's data about their lives or about their physiological systems that they can analyze and explore, I think, is a huge win.

Allen: It is. It's really empowering, and this is one of the reasons that I … I call myself a data optimist. And what I mean by that is I think there are huge opportunities here to use data science for social good. Getting into these data sets, as you said, they are relevant to people's lives. You can find things. I saw a great example at a conference recently, I was talking to a young guy from Brazil, who had worked on an application that was going through government data that was available online and flagging evidence of corruption, evidence of budgets that were being misspent. And they would tweet about it. There was just a robot that would find suspicious things in these accounts, and tweet them out there, which is, you know, kind of transparency that I think makes governments better. So I think there's a lot of potential there.

Hugo: That's incredible. Actually, that reminded me. I met a lawyer who was non-technical awhile ago, and non-computational, but he was learning a bit of machine learning, a bit of Python. He was trying to figure out whether you could predict judgements handed down by the Supreme Court based on previous judgements, and who would vote in a particular way. And that's just because that's something that really interests him professionally and in terms of social justice, as well.

Allen: Right. And I think, you know, the fact that people can do that who are not necessarily experts in that field, but amateurs for lack of a better word, can get in there and really do useful work. I think, you know, there are a lot of concerns, too. And this is getting a lot of attention right now, I'm actually in the middle of reading Weapons of Math Destruction, Cathy O'Neill's book. And there are a lot of concerns and I think there are things that are scary that we should be thinking about, but one of the things I'm actually thinking about now and trying to figure out is, how do we balance this discussion? 'Cause I think we're having, or at least starting, a good public discussion about this. It's good to get the problems on the table and address them, but how do we get the right balance between the optimism that I think is appropriate, but also the concerns that we should be dealing with.

Hugo: Yeah, absolutely. And as you say, the more and more books being published, more and more conversations happening in public. I mean it's the past several weeks that Mike Loukides, Hilary Mason, and DJ Patil who have posted their series of articles on data ethics and what they would like to see adoption in culture and in tech, among other places. I do think Weapons of Math Destruction is very interesting as part of this conversation, because of course one of the key parts of the definition for Cathy O'Neil over Weapon of Math Destruction is that it's not transparent, as well, right? So all the cases we're talking about kind of involve necessary transparency, so if we see more of that going forward, we'll at least be able to have a conversation around it.

Allen: Right, and I agree with both O'Neill and with you. I think that's a crucial part of these algorithms and, you know, open science and reproducible science is based on transparency and open data, and you know, also open code and open methodology.

Hugo: Absolutely. And this actually brings me to another question, which is a through line here is, the ability of everybody, every citizen to interact with data science in some sense. And I'm wondering for you in your practice, and as a data scientist and an educator, what is the role of the open source in the ability of everybody to interact with data science?

Allen: Right, I think it's huge. You know, reproducible science doesn't work if your code is proprietary. If you, you know, if you only share your data but not your methods, that only goes so far. It also doesn't help very much if I publish my code but it's in a language that's not accessible to everybody, you know, languages that are very expensive to get your hands on. Even among relatively affluent countries, you're not necessarily gonna have access to that code. And then when you go worldwide, there are, you know, a great majority of people in the world that are not gonna have access to that as contrasted with languages like R and Python that are freely available, now you still have to access to technology and that's not universal, but it's better and I think free software is an important part of that.

Hugo: Yeah.

Allen: This is, you know, part of the reason that I put my books up under free licenses is I know that there are a lot of people in the world who are not gonna buy hard copies of these books, but I want to make them available, and I do, you know, I get a lot of correspondence from people who are using my labs in electronic forms, who would not have access to them in hard copy.

Favorite Data Science Technique

Hugo: So, Allen, we've talked about a bunch of techniques that are dear to your heart. I'm wondering what one of your favorite data science-y techniques or methodologies is.

Allen: Right. I have a lot.

Hugo: Let's do it.

Allen: This might not be a short list.

Hugo: Sure.

Allen: So I am at heart a Bayesian. I do a certain amount of computational inference, you know, you do in classical statistical inference, but I'm really interested in helping Bayesian methods spread. And I think one of the challenges there is just understanding the ideas. It's one of these ideas that seems hard when you first encounter it, and then at some point there's a breakthrough, and then it seems obvious. Once you've got it, it is such a beautiful simple idea that it changes how you see everything. So that's what I want to help readers get to, and my students, is get that transition from the initial confusion into that moment of clarity.

Allen: One of the methods I use for that, and this is what I use in Think Bayes a lot, is just grid algorithms where you take everything that's continuous and break it up into discrete chunks, and then all the integrals become for loops, and I think it makes the ideas very clear. And then I think the other part of it that's important is the algorithms, particularly MCMC algorithms, which, you know, that's what makes Bayesian methods practical for substantial problems. You mentioned earlier that, you know, the computational power has become available. And that's a big part of what makes Bayes practical. But I think the algorithms are just as important, and particularly when you start to get up into higher dimensions. It's just not feasible without modern algorithms that are really quite new, developed in the last decade or so.

Hugo: Yeah. And I just want to speak to the idea of grid methods and, you said, turning, you say integrals become for loops. And I think is something which has actually been behind a lot of what we've been discussing as well and something that actually attracted me to your pedagogy initially and all of your work, was this idea of turning math into computation. And we see the same with techniques such as the bootstrap and resampling, but taking concepts that seem, you know, relatively abstract and seeing how they actually play out in a computational structure and making that translational step there.

Allen: Right. Yeah, I've found that very powerful for me as a learner. I've had that experience over and over, of reading something expressed using mathematical concepts, and then I turn it into code and I feel like that's how I get to understand it. Partly because you get to see it happening, often it's very visual in a way that the math is not, at least for me. But the other is it's debuggable. That if you have a misunderstanding, then when you try to represent it in code, you're gonna see evidence of the misunderstanding. It's gonna pop up as a bug. So, when you're debugging your code, you're also debugging your understanding. Which, for me, builds the confidence that when I've got working code, it also makes me believe that I understand the thing.

Hugo: Absolutely, and a related concept is the idea that breaking it down into chunks of code allows you to understand smaller concepts and build up the entire concept in smaller steps.

Allen: Right, yeah. I think that's a good point, too.

Hugo: Great. So, are there any other favorite techniques? You can have one or two more if you'd like.

Allen: I'll mention one which is survival analysis. And partly because it doesn't come up in an introductory class most of the time, but it's something I keep coming back to. I've used it for several projects, not necessarily looking at survival or medicine, but things like a study I did of how long a marriage lasts. Or, how long it is until someone has a first child, or gets married for the first time, or how long the marriage itself lasts until a divorce. So, as I say, it's not an idea that everybody sees, but once you learn it, you start seeing a lot of applications for it.

Hugo: Absolutely. And this did make it into your Think Stats book, do I recall correctly, or?

Allen: Yes. Yeah, I've got a section on survival analysis.

Call to Action

Hugo: Yeah, fantastic. So I'll definitely link to that in the show notes, as well. So, my last question is, do you have a call to action for our listeners out there?

Allen: Maybe two. I think if you have not yet had a chance to study data science, you should. And I think there are a lot of great resources that are available now that just weren't around not too long ago. And especially if you took a statistics class in high school or college, and it did not connect with you, the problem is not necessarily you. The standard curriculum in statistics for a long time I think has just not been right for most people. I think it's just spent way too much time on esoteric hypothesis tests. It gets bogged down in some statistical philosophy that's actually not very good philosophy, it's not very good philosophy, it's science.

Allen: If you come back to it now from a data science point of view, it's much more likely that you're gonna find classes and educational resources that are much more relevant. They're gonna be based on data. They're gonna be much more compelling. So give it another shot. I think that's my first call to action.

Hugo: I would second that.

Allen: And then the other is, for people who have got data science skills, there are a lot of ways to use that to do social good in the world. I think a lot of data scientists end up doing, you know, quantitative finance and business analytics, those are kinda the two big application domains. And there's nothing wrong with that, but I also think there are a lot of ways to use the skills that you've got to do something good, to, you know, find stories about what's happening and get those stories out. To, you know, use those stories as a way to effect change. Or if nothing else, just to answer questions about the world. If there's something that interests you, very often you can find data and answer questions.

Hugo: And there are a lot of very interesting data for social good programs out there, which we've actually had Peter Bull on the podcast to talk about data for good in general, and I'll put some links in the show notes as well.

Allen: Yes, and then I've got actually a talk that I want to link to that I've done a couple of times, and it's called Data Science, Data Optimism. And the last part of the talk is my call for data science for social good. I've got a bunch of links there that I've collected, that are just really the people that I know and groups that I know who are working in this area, but it's not complete by any means. So I would love to hear more from people, and maybe help me to expand my list.

Hugo: Fantastic. And people can reach out to you on Twitter, as well? Is that right?

Allen: Yes. I'm Allen Downey.

Hugo: Fantastic. Allen, it's been an absolute pleasure having you on the show.

Allen: Thank you very much. It's been great talking with you.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Save On an Annual DataCamp Subscription (Less Than 2 Days Left)

Posted: 24 Sep 2018 06:23 AM PDT

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

DataCamp is now offering a discount on unlimited access to their course curriculum. Access over 170+ course in R, Python, SQL and more taught by experts and thought-leaders in data science such as Mine Cetinkaya-Rundel (R-Studio), Hadley Wickham (R-Studio), Max Kuhn (caret) and more. Check out this link to get the discount!

Below are some of the tracks available. You can choose a career track which is a deep dive into a subject that covers all the skills needed. Or a skill track which focuses on a specific subject.

Tidyverse Fundamentals (Skill Track)
Experience the whole data science pipeline from importing and tidying data to wrangling and visualizing data to modeling and communicating with data. Gain exposure to each component of this pipeline from a variety of different perspectives in this tidyverse R track.

Finance Basics with R (Skill Track)
If you are just starting to learn about finance and are new to R, this is the right track to kick things off! In this track, you will learn the basics of R and apply your new knowledge directly to finance examples, start manipulating your first (financial) time series, and learn how to pull financial data from local files as well as from internet sources.

Data Scientist with R (Career Track)
A Data Scientist combines statistical and machine learning techniques with R programming to analyze and interpret complex data. This career track gives you exposure to the full data science toolbox.

Quantitative Analyst with R (Career Track)
In finance, quantitative analysts ensure portfolios are risk balanced, help find new trading opportunities, and evaluate asset prices using mathematical models. Interested? This track is for you.

And much more – the offer ends September 25th so don't wait!

About DataCamp:
DataCamp is an online learning platform that uses high-quality video and interactive in-browser coding challenges to teach you data science using R, Python, SQL and more. All courses can be taken at your own pace. To date, over 2.5+ million data science enthusiasts have already taken one or more courses at DataCamp.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Data Science With R Course Series – Week 2

Posted: 23 Sep 2018 11:04 PM PDT

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Data Science and Machine Learning in business begins with R. Why? R is the premier language that enables rapid exploration, modeling, and communication in a way that no other programming language can match: SPEED! This is why you need to learn R. Time is money, and, in a world where you are measured on productivity and skill, R is your machine-learning powered productivity booster.

In this Data Science With R Course Series, we'll cover what life is like in our ground-breaking, enterprise-grade course called Data Science For Business With R (DS4B 201-R). The objective is to experience the qualities that make R great for business by following a real-world data science project. We review the course that will take you to advanced in 10 weeks.

In this article, we'll cover Week 2: Business Understanding, which is where we begin coding in R using exploratory techniques with the goal of sizing the business problem.

But, first, a quick recap of our trajectory and the course overview.

Data Science With R Course Series

You're in the Week 2: Business Understanding. Here's our game-plan over the next 10 articles in this series. We'll cover how to apply data science for business with R following our systematic process.

  • Week 1: Getting Started
  • Week 2: Business Understanding (You're Here)
  • Week 3: Data Understanding
  • Week 4: Data Preparation
  • Week 5: Predictive Modeling With H2O
  • Week 6: H2O Model Performance
  • Week 7: Machine Learning Interpretability With LIME
  • Week 8: Link Data Science To Business With Expected Value
  • Week 9: Expected Value Optimization And Sensitivity Analysis
  • Week 10: Build A Recommendation Algorithm To Improve Decision Making

Week 2: Business Understanding

Course and Problem Overview

Data Science For Business With R (DS4B 201-R) is a one-of-a-kind course designed to teach you the essential aspects for applying data science to a business problem with R.

We analyze a single problem: Employee Turnover, which is a $15M per year problem to an organization that loses 200 high performing employees per year. It's designed to teach you techniques that can be applied to any binary classification (Yes/No) problem such as:

  • Predicting Employee Turnover: Will the employee leave?

  • Predicting Customer Churn: Will the customer leave?

  • Predicting Risk of Credit Default: Will the loan applicant or company default?

Here's why our students consistently give it a 9 of 10 for satisfaction rating:

  • It's based on real-world experience

  • You apply our systematic framework that cuts project times in half. Refer to this testimonial from our student.

  • We focus on return on investment (ROI)

  • We cover high performance R packages: H2O, LIME, tidyverse, recipes, and more.

  • You get results!

DS4B 201-R, Course Overview

Next, let's experience what life is like in Week 2: Business Understanding.

Week 2: Business Understanding

Week 2 is where we begin our deep-dive into data science for business. In Business Understanding, we learn how to:

The first thing you'll do is log into Business Science University, and move to the Week 2 Module, which looks like this.

DS4B 201-R Week 2 Module

Week 2: Business Understanding Module, DS4B 201-R Course

We'll begin by analyzing the problem in R in the section titled, Problem Understanding with the BSPF.

Understand the problem using R Code and BSPF

Sizing the business opportunity or cost is OVERLOOKED by most data scientists. If the cost / benefit to the organization is not large, it's not worth your time. We need to be efficient, which is our second focus. ROI is first, efficiency is second.

If the cost / benefit to the organization is not large, it's not worth your time.

To size the problem, we lean on a tool we learned about in Week 1: The Business Science Problem Framework (BSPF). Specifically, you'll learn to:

  • View the business as a machine
  • Understand the drivers
  • Measure the drivers

Business Science Problem Framework (BSPF)

Walking Through The Business Science Problem Framework (BSPF)

As we walk through the BSPF, we focus our efforts on identifying (1) if the organization has a problem and (2) how large that problem is. We investigate:

  • How many high performance employees are turning over

  • What the true cost of their turnover is, converting the Excel calculation to a scalable R calculation

  • Key Performance Indicators (KPIs) for turnover

  • Potential drivers including common cohorts: Job Department and Job Role

Here's a sample lecture showing what the code experience is like: "View the Business as a Machine".

View the Business As A Machine Lecture

As we go through the process of understanding and sizing the business problem, we realize that we are performing the same calculations repetitively. Any time repetitious code happens, we should create a function. Next, we'll learn about a powerful new set of tools for building tidy-functions that reduces and simplifies repetitive code: Tidy Eval.

Streamline repetitive employee attrition code using Tidy Eval

To this point you've sized the problem and even determined that the problem is larger within certain cohorts within the organization. Through this exploratory process, you've repeated the same code multiple times. Now it's time to streamline this code workflow with a powerful set of tools called Tidy Eval.

Tidy Eval

Learning Tidy Eval To Simplify Code Steps Repeated Frequently

You will use or create several functions that implement Tidy Eval and rlang including:

  • count: Summarizes the counts of grouped columns. Implemented in dplyr
  • count_to_pct: Converts counts to percentages (proportions). You create.
  • assess_attrition(): Filters, arranges, and compares attrition rates to KPIs. You create.

Armed with this streamlined code workflow, it's now time to visualize the problem using the ggplot2 library.

Visualize employee turnover with ggplot2

The best way to grab an executive decision maker's attention is to show him or her a business-themed plot that conveys the problem. In this section, we cover exactly how to do so using the ggplot2 package.

ggplot2

Using ggplot2 to create an impactful visualization of the problem

Next, you learn how to create a plotting function that can flexibly handle various grouped data within your code workflow.

Make our first custom plotting function, plot_attrition()

Once again, we're repetitively reusing code to plot different variations of the same information. In this section, we teach you how to create a custom plotting function called plot_attrition() that flexibly handles grouped features including the employee's Department and Job Role.

ggplot2

Create a flexible plotting function, plot_attrition()

By now, you have a serious set of dplyr and ggplot2 investigative skills. Next, we put them to use with your first challenge!

Challenge #1

Your first challenge is something that happens in the real world – your Subject Matter Experts (SMEs) – in this case the Accounting and Human Resources department provided you new data at a more granular level, which will make your analysis more accurate. Your job is to integrate the new information into you analysis. Are you up to the challenge?

DS4B 201-R: Challenge #1

Now It's You're Turn To Apply Your Knowledge!

At the end of the module, the challenge solution is provided for the learners along with the full code used in the course.

New Course Coming Soon: Build A Shiny Web App!

You're experiencing the magic of creating a high performance employee turnover risk prediction algorithm in DS4B 201-R. Why not put it to good use in an Interactive Web Dashboard?

In our new course, Build A Shiny Web App (DS4B 301-R), you'll learn how to integrate the H2O model, LIME results, and recommendation algorithm building in the 201 course into an ML-Powered R + Shiny Web App!

Shiny Apps Course Coming in October 2018!!! Sign up for Business Science University Now!


DS4B 301-R Shiny Application: Employee Prediction

Building an R + Shiny Web App, DS4B 301-R

Get Started Today!

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

A Subtle Flaw in Some Popular R NSE Interfaces

Posted: 23 Sep 2018 10:46 PM PDT

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in R programming.

"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."

Christopher Strachey, "Fundamental Concepts in Programming Languages", Higher-Order and Symbolic Computation, 13, 1149, 2000, Kluwer Academic Publishers (lecture notes written by Christopher Strachey for the International Summer School in Computer Programming at Copenhagen in August, 1967).

Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.

Most of my work on non-standard-evaluation (NSE or code capturing interfaces, alternatives to value oriented or referentially transparent interfaces) is to help eliminate and contain use of NSE. For instance wrapr::let() is designed to adapt existing NSE interfaces into standard value oriented interfaces, not to create new NSE interfaces. I pretty much feel composing one NSE interface into another NSE interface is cleaning up one mess by adding another mess. If wrapr::let() fails to clean up a mess: it can be evidence of a limitation of wrapr::let(), evidence of user error, or evidence the initial mess was already a problem.

Let's take a look at a standard interface. For example we can use base-R to select a column from a data.frame by name (technically we are building a new data.frame with the selected column(s)). The clarity of the value oriented notation is most valuable when exposed to potentially unclear code (we most value help when we most need help). For example: suppose the user is selecting the column named x, and their selection is in a poorly-named variable called y (technically values are not "in" variables, but there are environments associating values with names).

y <- "x"  d <- data.frame(x = 1, y = 2)    d[, y, drop= FALSE] # returns a data.frame
##   x  ## 1 1
d[[y]]  # returns a column
## [1] 1

And if the column we are looking for ("x") is not present we still get reliable and easily predictable results.

y <- "x"  d <- data.frame(y = 2)    d[, y, drop = FALSE] # exception, signalling no such column
## Error in `[.data.frame`(d, , y, drop = FALSE): undefined columns selected
d[[y]]  # NULL, signalling no such column
## NULL

Beyond the uninformative choice of variable name there was nothing confusing or dangerous about the above code.

Now let's look at a non-standard or code capturing interface: "$".

y <- "x"  d <- data.frame(x = 1, y = 2)    d$y
## [1] 2
d <- data.frame(x = 1)  d$y
## NULL

Notice both of these operations used the name "y" (captured from user code) to try and retrieve a column named "y", regardless of what the value referred to by the variable "y" is. In these cases the interface is unambiguously taking the name from the user code. This non-standard interface is convenient and reliable, and has an obvious standard analogue in "[[]]", so it does not present problems.

Now let's look at a more problematic non-standard interface base::subset()

y <- "x"  d <- data.frame(x = 1, y = 2)    subset(d, select = y)
##   y  ## 1 2

subset() appears to be using the name found in the code to look-up columns. This is not quite the case: from help(subset) we see the select argument is defined as "expression, indicating columns to select from a data frame." What may not be obvious from a cursory read of the documentation is these expressions are evaluated in a new environment that maps the data.frame column names to integers (a clever way to replace column names with column indices) but in an enclosure by the current execution environment (via a call to parent.frame()). This subtlety means that any names that are not column names of the data.frame are evaluated as expressions in the caller's environment. This in turn means they are replaced by values from this environment, which leads to very confusing outcomes such as the following.

y <- "x"  d <- data.frame(x = 1)    subset(d, select = y)
##   x  ## 1 1

Notice the column "x" was returned, (not a NULL). Essentially each specified column name gets two chances to resolve to a data.frame column name or index: first from an environment mapping data.frame names to indices, and second from the calling environment.

The above issue would be fixed if the expression evaluation occurred in a limited environment where only a few symbols (such as "-" and "c" were defined). Such a function is given below:

subset_strict <- function(x, subset, select, drop = FALSE) {    r <- if (missing(subset))      rep_len(TRUE, nrow(x))    else {      e <- substitute(subset)      r <- eval(e, x, parent.frame())      if (!is.logical(r))        stop("'subset' must be logical")      r & !is.na(r)    }    vars <- if (missing(select))      TRUE    else {      nl <- as.list(seq_along(x))      names(nl) <- names(x)      env <- new.env(parent = emptyenv())      for(sym in c("-", "c", ":")) {        assign(sym, get(sym, envir = parent.frame()), envir = env)      }      eval(substitute(select), nl, env)    }    x[r, vars, drop = drop]  }

This function is stricter than base::subset(), reliably signalling errors when non-columns are requested.

y <- "x"  d <- data.frame(x = 1)    subset_strict(d, select = y)
## Error in eval(substitute(select), nl, env): object 'y' not found
d <- data.frame(x = 1, y = 2, z = 3)  subset_strict(d, select = y)
##   y  ## 1 2
subset_strict(d, select = -y)
##   x z  ## 1 1 3
subset_strict(d, select = c(x, y))
##   x y  ## 1 1 2
subset_strict(d, select = c(x:z))
##   x y z  ## 1 1 2 3

Roughly I would say the above issue is a reason to not use base::subset() when you are not around to check if it is correct (i.e. in scripts or packages). And help(subset) has some text about this:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

If you don't use base::subset() you would think this is a non-issue. Except a similar mal-feature seems to be also in dplyr::select(), and dplyr documentation doesn't currently seem to advise not using dplyr in non-interactive environments.

help(select, package = "dplyr") tells us that select() takes positional argument from "..." and they are to be "One or more unquoted expressions separated by commas. You can treat variable names like they are positions." What is unclear is: what environment "..." is actually evaluated in.

R.version.string
## [1] "R version 3.5.0 (2018-04-23)"
library("dplyr")  packageVersion("dplyr")
## [1] '0.7.6'
packageVersion("rlang")
## [1] '0.2.2'
packageVersion("tidyselect")
## [1] '0.2.4'

Now consider the following code:

y <- "x"    data.frame(x = 1) %>%     select(y)

My theory is the typical dplyr user expectation is that the above should throw an exception (as that is what select(data.frame(x = 1), z) does in our clean environment). My opinion is: the above expresses that we are asking for a column named "y", and there is no such column in the data. The unfortunate coincidence that "y" has a value in our environment should be irrelevant to a select(). "y"'s value might (or might not) be relevant in a complex expression, such as a mutate()– but not to a select().

We the result actually is as before: the value "y" refers to in the executing environment can unfortunately alter the result (depending on the columns present in the data.frame).

y <- "x"    data.frame(x = 1) %>%     select(y)
##   x  ## 1 1

Notice we got back a data.frame with a column named "x". There was no warning or error indicated. This wrong value could poison later results (especially if we had used pull() to convert the data into an unnamed vector). Maybe your code has never run into this, or maybe it already quietly has (as it is not a signalling error).

Notice how the above result differs from the following:

y <- "x"    data.frame(x = 1, y = 2) %>%     select(y)
##   y  ## 1 2

The name/string version of this issue possibly goes back to dplyr 0.7.0 (the introduction of rlang). A numeric-index version of the issue can be exhibited for dplyr 0.5.0 and dplyr 0.4.0 (and possibly earlier, dplyr 0.4.0 is the earliest version of dplyr that is convenient to install in R 3.5.0). Notice how results vary with data changes below.

# restart R  install.packages("dplyr_0.4.0.tar.gz", repos = NULL)  # restart R  packageVersion("dplyr")  # [1] '0.4.0'  library("dplyr")      y <- "x"    data.frame(x = 1, y = 2) %>%      select(y)  #   y  # 1 2    data.frame(x = 1) %>%     select(y)  # Error: All select() inputs must resolve to integer column positions.  # The following do not:  # *  y    y <- 1L    data.frame(x = 1, y = 2) %>%      select(y)  #   y  # 1 2    data.frame(x = 1) %>%     select(y)  #   x  # 1 1        # restart R  install.packages("dplyr")  # restart R

It is possible this is in fact designed or intentional behavior. However, if that is the case it still violates some principles of safety through isolation and least astonishment.

In dplyr 0.7.6: select(.data[[y]]) shows the same issue, while select(.data[[!!y]]) does not (and select(.data$y) appears to always use the name "y", which seems right). I suspect ".data[y]", while commonly taught, may not be well-formed dplyr code (i.e. the user should not type that so the results are possibly user error, and not package error).

In conclusion:

  • base::subset() in addition to having known issues with the non-standard evaluation of its subset argument has a dangerous double look-up mal-feature in its select argument.
  • dplyr::select() has variations of a similar mal-feature. In the presence of this issue the only safe way to use name-capturing code with dplyr::select() appears to be select("y") or select(.data$y), which are both inconvenient enough to defeat the purpose of non-standard evaluation (the eliding of quotation marks in this case).
  • It is possible select(.data[[y]]) is not well-formed dplyr code, or is ambiguous (is it meant to operate on names, on values, or both?), or has a similar bug.

I think the above shows a small bit of how NSE interfaces can be confusing and hard to get right both for users and package developers. This is why I advise avoiding introducing NSE interfaces unless they are buying you something much larger than leaving out a few quote marks.

"Those who would give up essential Referential Transparency, to purchase a little temporary Notational Conveneince, deserve neither Referential Transparency nor Notational Conveneince."

(not) Ben Franklin

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

What Are We Plotting, What Are We Animating

Posted: 23 Sep 2018 05:00 PM PDT

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

This is my first blog post about gganimate — a package I've been working on since mid-spring this year. I have many thoughts and lots to say about animation and gganimate, so much in fact that it has seemed too big a task to begin writing about. Further, I felt like I had to spend my time developing the thing in the first place.

So this is an alternative entrance into writing about gganimate — sort of a tech-note about a specific problem. There will still come a time for some more formal writing about the theory and use of gganimate but until then I'll refer to my useR keynote for any words on my thoughts behind it all.

The Problem

When we animate data visualisations we often do it by calculating intermediary data points resulting in a smooth transition between the states represented by the raw data. In gganimate this is done by adding a transition which defines how data should be expanded across the animation frames. Underneath it all most transitions calculate intermediary data representations using tweenr and transformr — so far, so good.

What we have glanced over, and what is at the center of the problem, is what state of the data we decide to use as basis for our expansion. If you are not familiar with ggplot2 and the grammar of graphics this might be a strange phrasing — data is data — but if you are, you'll know that data can undergo several statistical transformations before it is encoded into a visual property and put on paper (or screen). Some of the states the data undergo are:

  1. Raw data as it is passed into the plotting function
  2. Raw data with only the columns mapped to aesthetics present
  3. Data transformed by a statistic
  4. Data with aesthetics mapped to a scale
  5. Data with default aesthetic values added
  6. Data transformed by the geom

If you prepare your data for animation beforehand (e.g. using tweenr), you're only able to touch the data at the first state and thus limited in what you can do. If there is a one-to-one mapping between the raw data and the final visual encoding this might not be a problem, but it breaks down spectacularly when the statistic transformation impose a grouping of the data into a shared visual encoding, e.g. a box-plot. Consider the task of calculating intermediary data for a transition from one box-plot showing statistics for 10 points, to another box-plot showing statistics for 15 points. If you could only use the raw data your atomic observations would suddenly have to change from 10 to 15 values in a smooth manner. On the other hand, if you could calculate the statistics used to draw the two box-plots and then calculate intermediary statistics instead, this discrepancy in the underlying data would not pose any problem. Indeed, the latter approach is what is done in gganimate — all data expansion is performed after statistics have been calculated. In fact, all expansion is done when data has reached state 5. Why wait so long? A simple example to explain this is the case of colour (or fill) aesthetics. If they are mapped to a categorical variable there will be no way to create a smooth transition based on the raw data. On the other hand, if we wait until the raw data has been mapped to its final colour value, we may smoothly transition the colour itself, ignoring the fact that the intermediary colours does not correspond to any meaningful category in the raw data.

The Curious Case of Tesselation

So, "what is the problem?", you may ask. Indeed, this approach is almost universally good, to the extend that you might just ignore the existence of other approaches… But the devils in the detail — let's make a plot:

library(ggplot2)  library(ggforce)    data <- data.frame(    x = runif(20),    y = runif(20),    state = rep(c('a', 'b'), 10)  )    ggplot(data, aes(x = x, y = y)) +     geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1)) +     geom_point() +     facet_wrap(~state)

Now, think about what you would expect a transition between the two panels to look like – my guess is that it is nothing like below:

library(gganimate)  ggplot(data, aes(x = x, y = y)) +     geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1)) +     geom_point() +     transition_states(state, transition_length = 3, state_length = 1) +     ease_aes('cubic-in-out')

Okay, what is going on? To be honest I had a different expectation about how this would fail when I started writing this. The reason why the voronoi tiles are static (and calculated based on all the points) is that the voronoi tessellation is calculated on the full panel data. At the time the voronoi tile statistic receives the data it all just belongs to the same panel since gganimate differentiate states using the group aesthetics. To show you how I expected this example to break down we'll have to tell the voronoi stat to tessellate based on the groups instead:

ggplot(data, aes(x = x, y = y)) +     geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1),                      by.group = TRUE) +     geom_point() +     transition_states(state, transition_length = 3, state_length = 1) +     ease_aes('cubic-in-out')

Now, at least it is wrong in the way that I expected it to be. Why is this wrong? The tessellation stat outputs polygon data that is then drawn by a polygon geom, so gganimate does the best it can to transition these polygons smoothly between the states. In this example this is not what we expected though. We expect a tessellation to always be true, even during the transition so the tessellation should be calculated for each frame, based on intermediary point positions. In other words, here we want the expansion to happen on the raw data.

library(tweenr)  library(magrittr)  data <- split(data, data$state)    data <- tween_state(data[[1]], data[[2]], 'cubic-in-out', 40) %>%     keep_state(10) %>%     tween_state(data[[1]],'cubic-in-out', 40) %>%     keep_state(10)    ggplot(data, aes(x = x, y = y)) +     geom_voronoi_tile(fill = 'grey', colour = 'black', bound = c(0, 1, 0, 1),                      by.group = TRUE) +     geom_point() +     transition_manual(.frame)

Ah, we have finally arrived at the expected animation, but what a mess of a journey.

Who Plots Tesselation Anyway?

You may think the above example is laughably construed — this may even be the first time you've heard of voronoi tessellation. Hold my beer, because it is about to get even worse, even using a geom from ggplot2 itself. We'll start with a plot again:

data <- data.frame(    x = c(rnorm(50, mean = 5, sd = 3), rnorm(40, mean = 2, sd = 1)),    y = c(rnorm(50, mean = -2, sd = 7), rnorm(40, mean = 6, sd = 4)),    state = rep(c('a', 'b'), c(50, 40))  )    ggplot(data, aes(x = x, y = y)) +    geom_contour(stat = 'density_2d') +     facet_wrap(~state)

And how might this look if we transition between a and b?

ggplot(data, aes(x = x, y = y)) +    geom_contour(stat = 'density_2d') +     transition_states(state, transition_length = 3, state_length = 1) +     ease_aes('cubic-in-out')

Oh my… The problem is more or less the same as with the tessellation – the stat creates a primitive data representation (here paths and not polygons) and gganimate does its best at transitioning those, but in doing this the intermediary frames does not resemble contour lines at all, but more a bowl of spaghetti.

So, could we fix it in the same way? Just prepare the data beforehand. Well, not really as we run into the first problem discussed, way up at the beginning of the blog. There is really no meaningful way of transitioning 50 points into 40. We could remove 10 and move the remaining 40, but in terms of the derived density this would look messy (but let's try anyway):

data2 <- split(data, data$state)  data2 <- tween_state(data2[[1]], data2[[2]], 'cubic-in-out', 40) %>%     keep_state(10) %>%     tween_state(data2[[1]], 'cubic-in-out', 40) %>%     keep_state(10)    ggplot(data2, aes(x = x, y = y)) +    geom_contour(stat = 'density_2d') +     transition_manual(.frame)

It sort of does the right thing, but there is a noticeable switch in the density as the 10 points disappears and reappears.

What we really want to do is to calculate intermediary states of the 2D densities that the contours are derived from. The densities remove the point discrepancy while presenting a statistic that can be truthfully transitioned. Unfortunately the density data is only present ephemerally inside the stat function and is not accessible to the outside world (where gganimate resides). We could rewrite the density_2d stat to wait with the contour transformation:

StatDensityContour <- ggproto('StatDensityContour', StatDensity2d,    compute_group = function (data, scales, na.rm = FALSE, h = NULL, contour = TRUE,                               n = 100, bins = NULL, binwidth = NULL) {      StatDensity2d$compute_group(data, scales, na.rm = na.rm, h = h, contour = FALSE,                                   n = n, bins = bins, binwidth = binwidth)    },    finish_layer = function(self, data, params) {      names(data)[names(data) == 'density'] <- 'z'      do.call(rbind, lapply(split(data, data$PANEL), function(d) {        StatContour$compute_panel(d, scales = NULL, bins = params$bins,                                   binwidth = params$binwidth)      }))    }  )    ggplot(data, aes(x = x, y = y)) +    geom_contour(stat = 'density_contour') +     transition_states(state, transition_length = 3, state_length = 1) +     ease_aes('cubic-in-out')

What to make of this?

You might feel like Alice who has stepped through the looking glass at this point. Should you always second guess whatever gganimate is doing? Of course not. The choice of interpolating the statistically transformed data is sound and will just work for most of what you want to do. I certainly want to allow gganimate to expand based on the raw data as well, though this has proven harder than expected as it is often only a subset of aesthetics you want to expand at that state (remember the problem with unmapped colour/fill).

Even if early expansion gets implemented it will only solve problems such as the voronoi example. The last contour example runs deeper and touches upon the theory of the grammar of graphics and how ggplot2 implements it itself. Statistical transformations are often envisioned as a single operation, but can just as well be thought of as a chain of transformation (here density_2d -> contour). Alternatively one could think that it was the responsibility of the geom to calculate the contour lines. All-in-all the dichotomy of stat+geom is not so clear cut as it might appear, which has not been much of a problem when generating static plots. With the advent of gganimate this problem becomes more pertinent and I honestly don't know the best way to address it. In a perfect world, all stats would return the data-state best fitted for expansion but this would require the finish_layer() hook to be more powerful, and would obviously require rewrites of a slew of geoms/stats. Then comes the question of whether it is even the responsibility of geom/stat developers to consider gganimate in the first place…

No matter the eventual solution to all this, I hope this post has made you a bit more aware of what happens to the data you plot as you passed it into ggplot2. Visualisations are after all first and foremost about data transformations…

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Why do we use arrow as an assignment operator?

Posted: 23 Sep 2018 05:00 PM PDT

(This article was first published on Colin Fay, and kindly contributed to R-bloggers)

A Twitter
Thread
turned
into a blog post.

In June, I published a little
thread on
Twitter about the history of the <- assignment operator in R. Here is
a blog post version of this thread.

Historical reasons

As you all know, R comes from S. But you might not know a lot about S (I
don't). This language used <- as an assignment operator. It's partly
because it was inspired by a language called APL, which also had this
sign for assignment.

But why again? APL was designed on a specific keyboard, which had a key
for
<-:

At that time, it was also chosen because there was no == for testing
equality: equality was tested with =, so assigning a variable needed
to be done with another symbol.

From APL Reference
Manual

Until 2001, in R, =
could only be used for assigning function arguments, like fun(foo =
"bar")
(remember that R was born in 1993). So before 2001, the <- was
the standard (and only way) to assign value into a variable.

Before that, _ was also a valid assignment operator. It was removed in
R 1.8:

(So no, at that time, no snake_case_naming_convention)

Colin Gillespie published some of his code from
early 2000
,
where assignment was made like this 🙂

The main reason "equal assignment" was introduced is because other
languages uses = as an assignment method, and because it increased
compatibility with S-Plus.

And today?

Readability

Nowadays, there are seldom any cases when you can't use one in place of
the other. It's safe to use = almost everywhere. Yet, <- is
preferred and advised in R Coding style guides:

One reason, if not historical, to prefer the <- is that it clearly
states in which side you are making the assignment (you can assign from
left to right or from right to left in R):

a <-  12  13 -> b   a  
## [1] 12  
b  
## [1] 13  
a -> b  a <- b  

The RHS assignment can for example be used for assigning the result of
a
pipe

library(dplyr)  iris %>%    filter(Species == "setosa") %>%     select(-Species) %>%    summarise_all(mean) -> res  res  
##   Sepal.Length Sepal.Width Petal.Length Petal.Width  ## 1        5.006       3.428        1.462       0.246  

Also, it's easier to distinguish equality comparison and assignment in
the last line of code here:

c <- 12  d <- 13  e = c == d  f <- c == d  

Note that <<- and ->> also exist:

create_plop_pouet <- function(a, b){    plop <<- a    b ->> pouet  }  create_plop_pouet(4, 5)  plop  
## [1] 4  
pouet  
## [1] 5  

And that Ross Ihaka uses = :
https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf

Environments

There are some environment and precedence differences. For example,
assignment with = is only done on a functional level, whereas <-
does it on the top level when called inside as a function argument.

median(x = 1:10)  
## [1] 5.5  
x  
## Error in eval(expr, envir, enclos): objet 'x' introuvable  
median(x <- 1:10)  
## [1] 5.5  
x  
##  [1]  1  2  3  4  5  6  7  8  9 10  

In the first code, you're passing x as the parameter of the median
function, whereas the second one is creating a variable x in the
environment, and uses it as the first argument of median. Note that it
works because x is the name of the parameter of the function, and
won't work with
y:

median(y = 12)  
## Error in is.factor(x): l'argument "x" est manquant, avec aucune valeur par défaut  
median(y <- 12)  
## [1] 12  

There is also a difference in parsing when it comes to both these
operators (but I guess this never happens in the real world), one
failing and not the other:

x <- y = 15  
## Error in x <- y = 15: impossible de trouver la fonction "<-<-"  
x = y <- 15  c(x, y)  
## [1] 15 15  

It is also good practice because it clearly indicates the difference
between function arguments and assignation:

x <- shapiro.test(x = iris$Sepal.Length)  x  
##   ##  Shapiro-Wilk normality test  ##   ## data:  iris$Sepal.Length  ## W = 0.97609, p-value = 0.01018  

And this weird behavior:

rm(list = ls())  data.frame(    a = rnorm(10),    b <- rnorm(10)  )  
##             a b....rnorm.10.  ## 1   0.9885196      1.3809205  ## 2  -0.2810080     -1.4165648  ## 3  -0.6709831     -1.6203407  ## 4  -1.3055656     -1.0713406  ## 5   1.2297421      2.2558878  ## 6  -1.5333307      0.5194378  ## 7  -0.1011028     -0.3651725  ## 8  -0.3976268     -1.0814520  ## 9  -0.3924576     -0.7030822  ## 10 -1.1745994     -0.7090015  
a  
## Error in eval(expr, envir, enclos): objet 'a' introuvable  
b  
##  [1]  1.3809205 -1.4165648 -1.6203407 -1.0713406  2.2558878  0.5194378  ##  [7] -0.3651725 -1.0814520 -0.7030822 -0.7090015  

Little bit unrelated but

I love this one:

g <- 12 -> h  g  
## [1] 12  
h  
## [1] 12  

Which of course is not doable with =.

Other operators

Some users pointed out on Twitter that this could make the code a little
bit harder to read if you come from another language. <- is use "only"
use in F#, OCaml, R and S (as far as Wikipedia can tell). Even if <-
is rare in programming, I guess its meaning is quite easy to grasp,
though.

Note that the second most used assignment operator is := (= being
the most common). It's used in {data.table} and {rlang} notably. The
:= operator is not defined in the current R language, but has not been
removed, and is still understood by the R parser. You can't use it on
the top level:

a := 12  
## Error in `:=`(a, 12): impossible de trouver la fonction ":="  

But as it is still understood by the parser, you can use := as an
infix without any %%, for assignment, or for anything else:

`:=` <- function(x, y){    x$y <- NULL    x  }  head(iris := Sepal.Length)  
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  ## 1          5.1         3.5          1.4         0.2  setosa  ## 2          4.9         3.0          1.4         0.2  setosa  ## 3          4.7         3.2          1.3         0.2  setosa  ## 4          4.6         3.1          1.5         0.2  setosa  ## 5          5.0         3.6          1.4         0.2  setosa  ## 6          5.4         3.9          1.7         0.4  setosa  

You can see that := was used as an assignment operator
https://developer.r-project.org/equalAssign.html :

All the previously allowed assignment operators (<-, :=, _, and
<<-) remain fully in effect

Or in R NEWS 1:

See also

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

By-Group Summary with SparkR – Follow-up for A Reader Comment

Posted: 23 Sep 2018 02:52 PM PDT

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

A reader, e.g. Mr. Wayne Zhang, of my previous post (https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation) made a good comment that "Why not use directly either Spark or H2O to derive such computations without involving detailed map/reduce".

Although Spark is not as flexible as R in the statistical computation (in my opinion), it does have advantages for munging large-size data sets, such as aggregating, selecting, filtering, and so on. In the demonstration below, it is shown how to do the same by-group calculation by using SparkR.

In SparkR, the most convenient way to do the by-group calculation is to use the agg() function after grouping the Spark DataFrame based on the specific column (or columns) with the groupBy() function.

  library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = ""))  sc <- sparkR.session(master = "local", sparkConfig = list(spark.driver.memory = "10g", spark.driver.cores = "4"))  df <- as.DataFrame(iris)  summ1 <- agg(    groupBy(df, alias(df$Species, "species")),     sl_avg = avg(df$Sepal_Length),     sw_avg = avg(df$Sepal_Width)  )  showDF(summ1)  +----------+-----------------+------------------+  |   species|           sl_avg|            sw_avg|  +----------+-----------------+------------------+  | virginica|6.587999999999998|2.9739999999999998|  |versicolor|            5.936|2.7700000000000005|  |    setosa|5.005999999999999| 3.428000000000001|  +----------+-----------------+------------------+  

Alternatively, we can also use the gapply() function to apply an anonymous function calculating statistics to each chunk of the grouped Spark DataFrame. What's more flexible in this approach is that we can define the schema of the output data, such as names and formats.

  summ2 <- gapply(    df,     df$"Species",     function(key, x) {      data.frame(key, mean(x$Sepal_Length), mean(x$Sepal_Width), stringsAsFactors = F)    },     "species STRING, sl_avg DOUBLE, sw_avg DOUBLE"  )  showDF(summ2)  +----------+------+------+  |   species|sl_avg|sw_avg|  +----------+------+------+  | virginica| 6.588| 2.974|  |versicolor| 5.936|  2.77|  |    setosa| 5.006| 3.428|  +----------+------+------+  

At last, we can take advantage of the Spark SQL engine after saving the DataFrame as a table.

  createOrReplaceTempView(df, "tbl")  summ3 <- sql("select Species as species, avg(Sepal_Length) as sl_avg, avg(Sepal_Width) as sw_avg from tbl group by Species")  showDF(summ3)  +----------+-----------------+------------------+  |   species|           sl_avg|            sw_avg|  +----------+-----------------+------------------+  | virginica|6.587999999999998|2.9739999999999998|  |versicolor|            5.936|2.7700000000000005|  |    setosa|5.005999999999999| 3.428000000000001|  +----------+-----------------+------------------+  

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Comments

Post a Comment