[R-bloggers] SR2 Chapter 2 Medium (and 4 more aRticles) | |
- SR2 Chapter 2 Medium
- What to know before you adopt Hugo/blogdown
- The significance of the sector on the salary in Sweden, a comparison between different occupational groups, part 3
- How to Acquire Large Satellite Image Datasets for Machine Learning Projects
- Version 0.4.0 of nnetsauce, with fruits and breast cancer classification
Posted: 28 Feb 2020 04:00 PM PST [This article was first published on Brian Callander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. SR2 Chapter 2 MediumHere's my solutions to the medium exercises in chapter 2 of McElreath's Statistical Rethinking, 1st edition. My intention is to move over to the 1nd edition when it comes out next month.
\(\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbernoulli}{Bernoulli} \DeclareMathOperator{\dpoisson}{Poisson} \DeclareMathOperator{\dnormal}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexponential}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvpamma}{Invpamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}\) Globe TossingStart by creating a grid and the function The exercise asks us to approximate the posterior for each of the following three datasets. To do this, we just apply our The posterior becomes gradually more concentrated around the ground truth. For the second question, we simply do the same but with a different prior. More specifically, for any p below 0.5 we set the prior to zero, then map our posterior over each the the datasets with this new grid. Again we see the posterior concentrate more around the ground truth. Moreover, the distribution is more peaked (at ~ 0.003) than with the uniform prior, which peaks at around (~0.0025). The first dataset already gets pretty close to this peak, i.e. this more informative prior gets us better inferences sooner. For the final question on globe tossing, we can just use the counting method rather than grid approximation. We enumerate all possible events in proportion to how likely they are to occur: 10 L for Mars, 3 L and 7 W for Earth. Then we filter our any inconsistent with our observation of land, and summarise the remaining possibilities. We get around 23%. Card DrawingWe make a list of all sides, filter out any inconsistent with our observation of a black side, then summarise the remaining card possibilities. The next exercise is the same as the previous but with more cards. Note that this equivalent to using the three cards as before but with a larger prior probability on the BB card. Putting the prior on the cards is equivalent to having the cards in proportion to their prior. The rest of the calculation is the same. This last card drawing exercise is slightly more involved since we can observe any of the two sides of the one card and any of the two sides of the other. Thus, we first generate the list of all possible pairs of cards, expand this into a list of all possible sides that could be observed for each card, filter out any event not consisent with our observations, then summarise whatever is left. To leave a comment for the author, please follow the link and comment on their blog: Brian Callander. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What to know before you adopt Hugo/blogdown Posted: 28 Feb 2020 04:00 PM PST [This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Fancy (re-)creating your website using Hugo, with or without blogdown? I'm writing this post with R users in mind, which means I shall use R analogies and mentioning Why Hugo/blogdown?If you're reading this you've probably heard of Hugo somewhere. I myself have used Hugo for this website (with a bit of What can break or evolve?When you have a website created with Hugo, there are two to four main actors:
So what's going to break and evolve?
How to reduce the likelihood of breakages?Choose your theme wisely and keep in touch!When choosing a theme i.e. a collection of layouts for your website, Now, you'll have to know when the theme gets updated. How?
What if your theme gets orphaned?What if you chose your theme wisely but it lost its maintainer(s)? Make well-defined tweaks to the themeAlthough you've adopted a theme, you'll probably want to personalize it a bit.
If you define CSS and JS files, they'd live in IMHO Have your content as Markdown, not html filesWith Now, as explained in the table mentioned above, using .RMarkdown/.md has its limitations. Update your themeSo if you've tweaked cleanly, you can update your theme when needed! To "update your theme", you need to replace the theme folder of your website folder with the new theme files. You could do update the theme manually, or use Build your website, look at what needs to be changed:
To figure out what needs to be changed, you'll probably want to read the changelog (or commit history) of your theme, and maybe even Hugo changelog. Often, changes in your theme, and work needed on your website, won't be dramatic: the theme folder update, maybe one config parameter, a few lines diff in your custom layouts. Follow Hugo news?If you wrote no custom layouts and use a very well maintained theme, you might never need to keep up with Hugo changes yourself. Don't live on the edgeIf you have a workflow on a continuous integration system updating your website every day from an external data source, like Noam Ross does (yes that's a very cool and very fancy setup!), use a specific Hugo version there, don't let the workflow install Hugo's latest version because it could break your website without your noticing. What if I just never update Hugo or my theme?No, it's not a good solution in my opinion. Quick fixes to bad newsImagine you made yourself a pretty website to showcase your cool posts and informative slidedecks. But you don't have time right now before your barbecue to do that, let alone to learn how to do that if it's the first time, so what can you do apart from not posting your talk content? If your site is deployed by gh-pagesI.e. your build your website locally and then push the rendered content to a gh-pages branch of a GitHub repo, you need to retrograde Hugo before doing that. Note that If your site is deployed by NetlifyIf you're lucky, you can just push your content, and since the Hugo version of your Netlify's config file hasn't changed, your website will build smoothly. You could make a PR to your own website to get a preview there before merging. LaterEventually, one day soon, get to updating Hugo again, looking at your theme's changes, if needed extracting your tweaks from the theme folder if you made them there (I've done that, and now I sure stick to using the ConclusionIn this blog post I presented what the maintenance of a Hugo website entails in my experience.
Coming back to Hugo, if you encounter problems I'd unsurprisingly recommend Hugo docs and Hugo forum. Now, if this all sounds overwhelming, I don't think these tech skills are harder than R skills but time is a limited resource so maybe you could outsource some of your website's creating and maintenance. Don't hesitate to share your own experience and advice on maintaining Hugo websites!
To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted: 28 Feb 2020 04:00 PM PST [This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. To complete the analysis on the significance of the sector on the salary for different occupational groups in Sweden I will in this post examine the correlation between salary and sector using statistics for education. The F-value from the Anova table is used as the single value to discriminate how much the region and salary correlates. For exploratory analysis, the Anova value seems good enough. First, define libraries and functions. The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000CY.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/. I have renamed the file to 000000CY_sector.csv because the filename 000000CY.csv was used in a previous post. The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by sector, occupational group (SSYK 2012), sex and educational level (SUN). Year 2014 – 2018 Monthly salary 1-3 public sector 4-5 private sector In the plot and tables, you can also find information on how the increase in salaries per year for each occupational group is affected when the interactions are taken into account.
![]() Figure 1: The significance of the sector on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018 ![]() Figure 2: The significance of the interaction between sector, edulevel, year and sex on the salary in Sweden, a comparison between different occupational groups, Year 2014 – 2018 The tables with all occupational groups sorted by F-value in descending order.
Let's check what we have found. ![]() Figure 3: Highest F-value sector, Organisation analysts, policy administrators and human resource specialists ![]() Figure 4: Lowest F-value sector, Fast-food workers, food preparation assistants ![]() Figure 5: Highest F-value interaction sector and gender, Cleaners and helpers ![]() Figure 6: Lowest F-value interaction sector and gender, Office assistants and other secretaries ![]() Figure 7: Highest F-value interaction sector and edulevel, Tax and related government associate professionals ![]() Figure 8: Lowest F-value interaction sector and edulevel, Cleaners and helpers ![]() Figure 9: Highest F-value interaction sector and year, Administration and service managers not elsewhere classified ![]() Figure 10: Lowest F-value interaction sector and year, Other social services managers ![]() Figure 11: Highest F-value interaction sector, edulevel, year and gender, Authors, journalists and linguists ![]() Figure 12: Lowest F-value interaction sector, edulevel, year and gender, Attendants, personal assistants and related workers To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
How to Acquire Large Satellite Image Datasets for Machine Learning Projects Posted: 28 Feb 2020 07:39 AM PST [This article was first published on r – Appsilon Data Science | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ![]() IntroductionHistorically, only governments and large corporations have had access to quality satellite images. In recent years, satellite image datasets have become available to anyone with a computer and an internet connection. The quality, quantity, and precision of these datasets is continuously improving, and there are many free and commercial platforms at your disposal to acquire satellite images. On top of that, the prices of acquiring the images have fallen significantly, as well as the prices and availability of the tools that will allow you to analyze the images for machine learning and data science projects. In this article, I hope to inspire you to start looking into the power and utility of publicly available satellite image datasets available today. I will show you a high level overview of where these images come from, then I will dive deeper into the details about which features you should think about when choosing the right data source. In a future article, I will give you an overview of the architecture that you need to have in place before you can start working with them on your local computer. Let's jump right into satellites. How is this kind of dataset unique? Why should you bother with satellite images? ![]() The difference between 30cm & 70cm satellite imagery. [Source: Maxar] Satellite Image Data at Your FingertipsFirst of all, you can get complete coverage of the Earth, which means that you can select virtually any location on the planet, and you will be able to see what that place looks like. Further, the images are readily available. You can go to a website and easily download an image for any location that you want because there are public space programs that offer free images to whoever wants them. So when you start your research, I strongly encourage you to take a look at the available free resources first. We've included a list of those resources at the bottom of this article. There are plenty of commercial options available that provide higher quality images for specialized purposes. You can reach out to Appsilon directly for assistance with acquiring commercial satellite datasets. One way to think about satellite image datasets is that they give you the ability to travel backwards in time. When you think of satellite images you might think about Google Maps, which provides you with satellite images that give you a snapshot of the surface of the Earth. But with access to the right provider, you can go back in time and access images for any day that you want, going back years – in some cases back to the 1980's. This added temporal dimension gives you additional abilities when it comes to analyzing data. Imagine that you can take a look at one point on Earth, and then go back in time and see how this place has changed. You can then build predictive models to forecast what this place is going to look like in the future. ![]() A representation of the 4500 satellites currently orbiting the Earth This visualization shows the scale of what is going on above our heads, outside the stratosphere. Right now there are more than 4,500 satellites orbiting our planet, and over 600 of them are constantly taking photos. There are more and more preparing to launch, especially since this area of technology has been accelerating very rapidly in recent years. This means that the quantity and quality of satellite image datasets is rapidly improving. Currently, the best resolution that you can get from a satellite image is 25cm per pixel. This means that if you zoom in very closely on a quality satellite image, one pixel is going to represent approximately 25 cm of the Earth's surface. If a satellite image shows a person, then that person will be represented by approximately three pixels. Three pixels is not much to go on, but if you combine this rough representation of a person with their shadow, then you can confirm that those three pixels are indeed a person. How to Acquire Satellite DatasetsNow I would like to jump into more specific aspects of satellite imagery — what kind of dataset it is and how you can acquire these datasets. There are two types of available satellite data. There are public datasets that are freely available with quality that is good enough for many use cases. And there are several commercial outfits that offer even better images with more potential uses. The best known public datasets are provided by Landsat and Sentinel. You can Google those companies right now and find the right image for you. One image is going to be about 1GB of data. It's not immediately obvious how you can work with these images, but later on I'll explain how to do it easily. You can also feel free to reach out to us for more information on working with larger commercial datasets. There are plenty of commercial companies acquiring satellite images. Commercial datasets are primarily provided by Maxar, Planet Labs, Airbus Defence & Space, Imagesat and Skywatch. A year ago, Planet Labs launched 150 satellites each the size of a shoe box. So right now there is a huge constellation of small satellites capturing images. Currently you can get a new image every two days. Another interesting company to watch is SkyWatch. SkyWatch is a hub for satellite images. They gather images from all of the other providers – they don't have their own satellites. SkyWatch is a good place to find decent prices for commercial satellite images. I am often asked about image prices. The prices range from a few dollars for a single image to ~$1000 for the highest possible quality image. So if you want to identify people in a lot of images or you need a consistent and precise historical record for research, it is going to be quite expensive for you at the moment. However, given how fast the technology is progressing the prices should decline in the future. In a sense we are at a moment where there is a wave of new satellite image technology coming. The wave hasn't reached its peak yet. If you start researching right now, you will be on top of the wave when satellite images are cheap and available. Now is the perfect time to start playing with satellite image datasets. Satellite Images: Spatial and Temporal ResolutionWhen selecting datasets, the first consideration is image resolution. The bigger the resolution, the more details we're able to see. But there are some tradeoffs which we'll discuss soon. In this plot you can see how spatial resolution has changed over the years. We started with 100m in 1970, and now we're down to 25-30 centimeters. ![]() Spatial and Temporal Resolution Spatial resolution is not the only resolution we need to consider when designing solutions based on satellite imagery. Equally important, and sometimes even more important, is the temporal resolution. How often do we get a picture of a given area? What is the revisit time? Landsat, one of the publicly available satellite image datasets, gives you 30 meters resolution and you get one picture every 14 days. Sentinel gives you 10m resolution every 5 to 7 days. So if you want to invest in your project, you have the option for much better resolution and frequency of images. Satellite Images: Layers of InformationThe way that sensors work in satellites is a really exciting topic. When you think about a satellite image, it's more than just taking a picture with a normal camera. Humans are able to decode red, green, and blue. But a satellite can decode much more electromagnetic information than that. Some satellites have 12 sensors, which means that you get an image that has 12 layers of information. ![]() Satellite images can contain more layers of information than typical images For us, an image is just a matrix that has values for red, green, and blue, but from the satellite you get many more values that the human eye is not able to process. For example, with a satellite image you can have an infrared channel, which can be used to detect the health of vegetation. So this is completely new data that one would not be able to detect with the naked eye. The infrared channel reflects differently from the chlorophyll in the plants, allowing for detection of sick plants from space. There is plenty more you can do with these extra layers of information. For instance, you can also detect moisture levels on the surface of the planet, which cannot be done very easily with standard visual color information. There is also radar technology. Many of you know LIDAR, which tells you the height of a given surface. It is important to note that clouds are a huge problem when it comes to satellite images. There are plenty of people working on algorithms to eliminate the cloud problem in satellite imaging, but there is no ideal solution yet. Radar technology allows you to look through clouds, but you don't get all of the other layers of information that I mentioned above. You only really get quality information about elevation. On the right we have a visible band image of a certain area in Sumatra, Indonesia. As you can see our view is obstructed by clouds. Now, on the left we see the same area in a radar image. We now have more and more radar images available, which is useful because radar can see through clouds. This can be crucial in many cases. ![]() LIDAR mitigates cloud cover in a satellite image One thing to keep in mind when you choose a data source — depending on your project, you may not necessarily need the best possible resolution. You might want to experiment with the free images at a resolution of 10m or 30m. You may also investigate what the image provider actually gives you. Some platforms do some of the pre-processing work for you. For example, I mentioned that one image can include 1GB of data. You can actually ask some providers to cut the image into one particular shapefile and in return you'll get just 1MB of data — a small image that consists only of the area that you wanted. This can be very helpful if you're working with a large number of images. ConclusionI hope you now have a sense of the many current and developing options for satellite imagery. In the past, such datasets were accessible to only a select few. Now there are many free and commercial platforms at your disposal. You can leverage temporal resolution, spatial resolution, and a dozen bands of the electromagnetic spectrum to aid in your projects. On top of that, the prices of acquiring satellite images have fallen significantly, as well as the prices and availability of the tools that will allow you to analyze these images. For my next article on satellites, we will further explore how to use satellite images in practice, and I will explain why R is an excellent tool for analyzing satellite images. I will share our experiences of trial and error with satellite images to save you time and effort. ReferencesPublic Data SetsCommercial Data SetsThanks for reading! For more, follow me on Linkedin. Follow Appsilon Data Science on Social Media
Article How to Acquire Large Satellite Image Datasets for Machine Learning Projects comes from Appsilon Data Science | End to End Data Science Solutions. To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End to End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Version 0.4.0 of nnetsauce, with fruits and breast cancer classification Posted: 27 Feb 2020 04:00 PM PST [This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. English version / Version en français English versionA new version of nnetsauce, version And if you're using R, it's still (R console): The R version may be slightly lagging behind on some features; feel free to signal it on GitHub or contact me directly. This new release is accompanied by a few goodies: 1) New features, detailed in the changelog. 2) A refreshed web page containing all the information about package installation, use, interface's work-in-progress documentation, and contribution to package development. 3) A specific RSS feed related to nnetsauce on this blog (there's still a general feed containing everything here). 4) A working paper related to Among nnetsauce's new features, there's a new model class called Imagine that we have 4 fruits at our disposal, and we would like to classify them as avocados (is an avocado a fruit?), apples or tomatoes, by looking at their color and shapes. What we called covariates before in model description are Instead of one response vector, we now have three different responses. And instead of one classification problem on one response, three different two-class classification problems on three responses: is this fruit an apple or not? Is this fruit a tomato or not? Is this fruit an avocado or not? All these three problems share the same covariates:
We now use Import packages: Model fitting on training set: These results can be found in nnetsauce/demo/. In this case with We obtain the results below for 100 reproducibility seeds. The accuracy is always at least 90%, mostly 95% and quite sometimes, higher than 98% (with no advanced hyperparameter tweaking). French versionUne nouvelle version de nnetsauce, la version 0.4.0, est maintenant disponible sur Pypi et pour R. Comme d'habitude, vous pouvez l'installer sur Python en utilisant les commandes suivantes (ligne de commande): Et si vous utilisez R, c'est toujours (console R): La version R peut être légèrement en retard sur certaines fonctionnalités; n'hésitez pas à le signaler sur GitHub ou à me contacter directement en cas de problème. Cette nouvelle version s'accompagne de quelques goodies: 1) Nouvelles fonctionnalités, détaillées dans le fichier changelog. 2) Une page Web actualisée contenant toutes les informations sur l'installation, l'utilisation, la documentation de travail en cours, et la contribution au développement de l'outil. 3) Un flux RSS spécifique lié à nnetsauce sur ce blog (il existe toujours un flux général contenant tous les autres sujets, ici). 4) Un document de travail relatif à Bayesianrvfl2Regressor, Ridge2Regressor, Ridge2Classifier et Ridge2MultitaskClassifier: Quasi-randomized networks for regression and classification, with two shrinkage parameters. À propos de Ridge2Classifier en particulier, vous pouvez également consulter cet autre article. Parmi les nouvelles fonctionnalités, il existe une nouvelle classe de modèle appelée MultitaskClassifier, brièvement décrite dans le premier article du point 4). Il s'agit d'un modèle de classification multitâche basé sur des modèles de régression, avec des covariables partagées. Qu'est-ce que cela signifie? Nous utilisons la figure ci-dessous pour commencer l'explication: Imaginez que nous ayons 4 fruits à notre disposition, et nous aimerions les classer comme avocats (un avocat est-il un fruit?), pommes ou tomates, en regardant leur couleur et leur forme. Ce que nous appelions auparavant les covariables dans la description du modèle sont la couleur et la forme, également appelées variables explicatives ou prédicteurs. La colonne contenant les noms des fruits sur la figure – à gauche – est la réponse; la variable que MultitaskClassifier doit apprendre à classer (qui contient généralement beaucoup plus d'observations). Cette réponse brute est transformée en une réponse codée – à droite. Au lieu d'une réponse, nous avons maintenant trois réponses différentes. Et au lieu d'un problème de classification sur une réponse, trois problèmes de classification à deux classes différents, sur trois réponses: ce fruit est-il ou non une pomme? Ce fruit est-il ou non une tomate? Ce fruit est-il ou non un avocat? Ces trois problèmes partagent les mêmes covariables: la couleur et la forme. MultitaskClassifier peut utiliser n'importe quel modèle de régression (c'est-à-dire un modèle d'apprentissage statistique pour des réponses continues) pour résoudre ces trois problèmes simultanément; avec le même modèle de régression utilisé pour les trois – ce qui est a priori une hypothèse relativement forte. Les prédictions du modèle de régression sur chaque réponse sont alors interprétées comme des probabilités brutes que le fruit soit ou non, dans l'une ou l'autre des classes. Nous utilisons maintenant MultitaskClassifier sur des données de cancer du sein, comme nous l'avions fait dans cet article pour AdaBoostClassifier. La version R de ce code serait presque identique, il s'agirait essentiellement de remplacer les «.» 'S par des « $ »' s. Import des packages: Ajustement du modèle sur l'ensemble d'entraînement: Les résultats de ce traitement se trouvent dans ce notebook. La précision du MultitaskClassifier sur cet ensemble de données est de 99,1%, et d'autres indicateurs sont également de l'ordre de 99% sur ces données. Visualisons maintenant comment les observations sont bien ou mal classées en fonction de leur classe réelle, comme nous l'avions fait dans un post précédent, sur le même ensemble de données. Dans ce cas, avec MultitaskClassifier et aucun ajustement avancé des hyperparamètres, il n'y a qu'un patient sur 114 qui est mal classé. Un moyen robuste de comprendre la précision de MultitaskClassifier sur cet ensemble de données en utilisant ces mêmes paramètres, pourrait être de répéter la même procédure pour plusieurs graines de reproductibilité aléatoire (voir le code, les ensembles d'apprentissage et de test changent de manière aléatoire lorsque nous changeons la graine Nous obtenons les résultats ci-dessous pour 100 graines de reproductibilité. La précision est toujours d'au moins 90%, la plupart du temps égale à 95% et assez souvent, supérieure à 98% (sans ajustement avancé des hyperparamètres). Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs' interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here! To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This posting includes an audio/video/photo media file: Download Now |
You are subscribed to email updates from R-bloggers. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
Comments
Post a Comment