[R-bloggers] Network model trees (and 3 more aRticles) |
- Network model trees
- Microsoft ML Server 9.4 now available
- Grades Aren’t Normal
- EARLy bird ticket offer ends tomorrow!
Posted: 30 Jul 2019 03:00 PM PDT (This article was first published on Achim Zeileis, and kindly contributed to R-bloggers) The effect of covariates on correlations in psychometric networks is assessed with either model-based recursive partitioning (MOB) or conditional inference trees (CTree). CitationJones PJ, Mair P, Simon T, Zeileis A (2019). "Network Model Trees", OSF ha4cw, OSF Preprints. doi:10.31219/osf.io/ha4cw AbstractIn many areas of psychology, correlation-based network approaches (i.e., psychometric networks) have become a popular tool. In this paper we define a statistical model for correlation-based networks and propose an approach that recursively splits the sample based on covariates in order to detect significant differences in the network structure. We adapt model-based recursive partitioning and conditional inference tree approaches for finding covariate splits in a recursive manner. This approach is implemented in the networktree R package. The empirical power of these approaches is studied in several simulation conditions. Examples are given using real-life data from personality and clinical research. SoftwareCRAN package: https://CRAN.R-project.org/package=networktree IllustrationsNetwork model trees are illustrated using data from the Open Source Psychometrics Project:
The TIPI network is partitioned using MOB based on three covariates: engnat (English as native language), gender, and education. Generally, the structure of the network is characterized by strong negative relationships between the normal and reverse measurements of each domain with complex relationships between separate domains. When partitioning the network interesting differences are revealed. For example, native English speakers without a university degree showed a negative relationship between agreeableness and agreeableness-reversed that was significantly weakened in non-native speakers and in native speakers with a university degree. Among native English speakers with a university degree, males and other genders showed a stronger relationship between conscientiousness and neuroticism-reversed compared to females. In the network plots edge thicknesses are determined by the strength of regularized partial correlations between nodes. Node labels correspond to the first letter of each Big Five personality domain, with the character "r" indicating items that measure the domain in reverse. The DASS network is partitioned using MOB based on a larger variety of covariates in a highly exploratory scenario: engnat (Engligh as native language), gender, marital status, sexual orientation, and race. Again, the primary split occurred between native and non-native English speakers. Among native English speakers, two further splits were found with the race variable. Among the non-native English speakers, a split was found by gender. These results indicate various sources of potential heterogeneity in network structure. For example, among non-native speakers, the connection between worthlife (I felt that life wasn't worthwhile) and nohope (I could see nothing in the future to be hopeful about) was stronger compared to females and other genders. In native English speaking Asians, the connection between getgoing (I just couldn't seem to get going) and lookforward (I felt that I had nothing to look forward to) was stronger compared to all other racial groups. To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now | |
Microsoft ML Server 9.4 now available Posted: 30 Jul 2019 10:41 AM PDT (This article was first published on Revolutions, and kindly contributed to R-bloggers) Microsoft Machine Learning Server, the enhanced deployment platform for R and Python applications, has been updated to version 9.4. This update includes the open source R 3.5.2 and Python 3.7.1 engines, and supports integration with Spark 2.4. Microsoft ML Server also includes specialized R packages and Python modules focused on application deployment, scalable machine learning, and integration with SQL Server. Microsoft Machine Learning Server is used by organizations that need to use R and/or Python code in production applications. For some examples of deployments, take a look at these open-source solution templates for credit risk estimation, energy demand forecasting, fraud detection and many other applications. Microsoft ML Server 9.4 is available now. For more details on this update, take a look at the announcement at the link below. SQL Server Blog: Microsoft Machine Learning Server 9.4 is now available To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... This posting includes an audio/video/photo media file: Download Now | |
Posted: 30 Jul 2019 08:00 AM PDT (This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers) |
|
These grades are displayed in Figures 2 and 3. Here are some things to notice from doing this:
- The students in the tail benefited considerably from the curving, gaining considerably and going to D grades when they probably should have failed.
- Most students were penalized by the curve, and in hard-to-understand, seemingly arbitrary ways.
- Only one student will get an A. Another who would have gotten an A got an A-, and several other A students were pushed to the high Bs.
The curve has a very strong effect at the top of the distribution; two students with likely equivalent skill got very different grades, and the student in third place who appears to be just as skilled as the other two if it were not for luck got a B instead of an A. This appears to be very unfair.
Now we could screw around with the parameters and perhaps get a better distribution at the top of the curve, but that raises the question of why any distribution should be forced onto the data, let alone a Normal one. We could just as easily swapped qnorm()
with qcauchy()
and got a very different distribution for our scores. The data itself doesn't suggest it came from a Normal distribution, so what makes the Normal distribution special, above all others?
What's So Special about the Normal Distribution
The Normal distribution has a long history, dating back to the beginning of probability theory. It is the prominent distribution in the Central Limit Theorem and many well-known statistical tests, such at the -test and ANOVA. When people talk about "the bell curve" they are almost always referring to the Normal distribution (there is more than one "bell curve" in probability theory). The Fields Medalist Cรฉdric Villani once said in a 2006 TED talk that if the Greeks had known of the Normal distribution they would have worshipped it like a god.
So why does the Normal distribution hold the place it does? For reference, below is the formula for the PDF of the Normal distribution with mean and variance :
A plot of the Normal distribution is given in Figure 4. At first glance looks complicated, but it's actually well-behaved and easy to work with. It's rich in mathematical properties. While in principle any number could be produced by a Normally distributed random variable, in practice seeing anything farther than three standard deviations from the mean is unlikely. It is closed under addition; the sum of two (joinly) Normal random variables is a Normal random variable. And of course it features prominently in the Central Limit Theorem; the sum of IID random variables with finite variance starts to look Normally distributed, and this can happen even when these assumptions are relaxed. Additionally, Central Limit Theorems exists for vectors, functions, and partial sums, and in those cases the limiting distribution is some version of a Normal distribution.
Most practitioners, though, do not appreciate the mathematical "beauty" of the Normal distribution; I doubt this is why people would insist grades should be Normally distributed. Well, perhaps that's not quite true; people may know that the Normal distribution is special even if they themselves cannot say why, and they may want to see Normal distributions appear to keep with a fad that's been strong since eugenics. But "fad" feels like a cop-out answer, and I think there are better explanations.6
Many people get rudimentary statistical training, and the result is "cargo-cult statistics", as described by (1);7 they practice something that on the surface looks like statistics but lacks true understanding of why the statistical methods work or why certain assumptions were made. People in statistics classes learned about the Normal distribution and their instructors (rightly) drilled its features and its importance into their heads, but the result is that they think data should be Normally distributed since it's what they know when in reality data can follow any distribution, usually non-Normal ones.
Additionally, statistics' most popular tests–in particular, the -test and ANOVA–calls for Normally distributed data in order to be applied. And in the defense of practitioners, there are a lot of tests calling for Normally distributed data, especially the ones they learned. But they don't appreciate why these procedures use the Normal distribution.
The -test and ANOVA, in particular, are some of the oldest tests in existence, being developed by Fisher and Student around the turn of the century, and they prompted a revolution in science. But why did these tests use the Normal distribution? I speculate that a parametric test that worked for Normally distributed data was simply a low-hanging fruit; assuming the data was Normally distributed was the easiest way to produce a meaningful, useful product. Many tests with the same objectives as the -test and ANOVA have been developed that don't require Normality, but these tests came later and they're harder to do. (That said, it's just as easy to do the -test as it is to do an equivalent non-parametric test these days with software, but software is new and also it's harder to explain what the non-parametric test does to novices.) Additionally, results such as the Central Limit Theorem cause tests requiring Normality to work anyway in large data sets.
Good products often come for Normal things first; generalizations are more difficult and may take more time to be produced and be used. That said, statisticians appreciate the fact that most phenomena is not Normally distributed and that tweaks will need to be made when working with real data. Most people practicing statistics, though, are not statisticians; cargo-cult statistics flourishes.
Conclusion
Since statistics became prominent in science statisticians have struggled with how to handle their own success and most statistics being done by non-statisticians. Statistical education is a big topic since statistics is a hard topic to teach well. Also, failure to understand statistics produces real-world problems, from junk statistics to junk science and policy motivated by it. Assuming grades are Normally distributed is but one aspect of this phenomenon, and one that some students unfortunately feel personally.
Perhaps the first step to dealing with such problems is reading an article like these and appreciating the message. Perhaps it will change an administrator's mind (but I'm a pessimist). But perhaps the student herself reading this will see the injustice she suffers from such a policy and appreciate why the statisticians are on her side, then commit to never being so irresponsible herself.
Bibliography
- 1
- Philip B. Stark and Andrea Saltelli.
Cargo-cult statistics and scientific crisis.
Significance, 15 (4): 40-43, 2018.
doi: rm10.1111/j.1740-9713.2018.01174.x.
URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1740-9713.2018.01174.x.
About this document …
Grades Aren't Normal
This document was generated using the LaTeX2HTML translator Version 2019 (Released January 1, 2019)
The command line arguments were:
latex2html -split 0 -nonavigation -lcase_tags -image_type gif simpledoc.tex
The translation was initiated on 2019-07-29
Footnotes
- … material.1
- This is not the only possible objective of grading. Other grading objectives could be to rate students growth (how much a student improved since the beginning of the class) or to simply say which students were best and which were worst in the class. I take issue personally with either of these alternative "objectives" of grading; the first can be arbitrary and not produce useful information on the student since they could get an "A" for going from "terrible" to "mediocre", while the second not only suffers from a similar problem (perhaps the "best" student knows little about the material) but also feel elitist.
- … distributions.2
- Here is an argument for why grades might appear Normally distributed; since a final grade is the sum of grades from assignments, quizzes, tests, and so on, and grades generally emerge from summation of points, we could apply a version of the Central Limit Theorem to argue that the end result should be grades that appear Normally distributed. But there are still assumptions on how strong a dependence there is in points and assignments, and in fact while an individual student's grades might start to look Normally distributed if such a process were to continue for a long time, this says nothing about the student body since one could say each student is her own data-generating process with her own parameters and those parameters do not follow some larger distribution.
- … 60.3
- Scoring above 100 is possible in many of the classes I teach due to extra credit opportunities. The low grades are often those belonging to students who have effectively given up on the class, or at least are only moderately connected to it.
- … distribution.4
- In the data set
grades
, there are ties. This data was rounded; real data would not have such an issue, and presumably an instructor would have access to the original data that wasn't rounded. - ….5
- I use the variance notation for the Normal distribution; that is, for a Normal random variable with mean and variance .
- … explanations.6
- Tests such as IQ tests and SAT/ACT tests often seem to produce scores following a Normal distribution, but this seems to follow more from construction than from nature. One can see why an educator would start to think that student grades, which also assess "intelligence," should also be Normally distributed, but this appears to just be accepting a deception.
- …starksaltelli18;7
- The term "cargo-cult" describes a phenomenon observed after the American Pacific campaign of World War II at the islands that once were American bases; island natives would build replicas of the bases and imitate their operations after the military left. They saw and imitated the activities without understanding why they worked. Since then the term has been use to suggest "imitation without understanding;" for example, Nobel physicist Richard Feynman coined the term "cargo-cult science" to mean activity that looks like science but does not honestly apply the scientific method.
Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)!
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
EARLy bird ticket offer ends tomorrow!
Posted: 30 Jul 2019 07:32 AM PDT
R fans, you have just one more day to get your hands on discounted EARL London 2019 tickets. Our early bird offer gets you £100 off the full price ticket, so it makes persuading your boss easier!
Visit the EARL website for more details and see 2018's highlights below:
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
This posting includes an audio/video/photo media file: Download Now
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment