[R-bloggers] htmlunitjars Updated to 2.34.0 (and 3 more aRticles)

[R-bloggers] htmlunitjars Updated to 2.34.0 (and 3 more aRticles)

Link to R-bloggers

htmlunitjars Updated to 2.34.0

Posted: 28 Feb 2019 03:10 PM PST

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The in-dev htmlunit package for javascript-"enabled" web-scraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript-"enabled" sites. The Chrome emulation is now also on-par with Chrome 72 series (my Chrome beta is at 73.0.3683.56 so it's super close to very current).

In reality, the update was to the htmlunitjars package where the main project JAR and dependent JARs all received a refresh.

The README and tests were all re-run on both packages and Travis is happy.

If you've got a working rJava installation (aye, it's 2019 and that's still "a thing") then you can just do:

install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is/")  

to get them installed and start playing with the DSL or work directly with the Java classes.

FIN

As usual, use your preferred social coding site to log feature requests or problems.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

EARL London early bird tickets now on sale

Posted: 28 Feb 2019 07:23 AM PST

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Early bird tickets for the Enterprise Applications of the R Language Conference are now on sale!

The EARL Conference is in its sixth year, its a cross-sector conference that focuses on the commercial use of the R programming language.

Take a look at our highlights from last year:

We are busy putting together another brilliant agenda, but there's still time to submit your abstract before 31 March.

We hope to hear from you!

Keep up-to-date on all things EARL via Twitter or our EARL mailing list

 

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

drat All The ��! : Enabling Easier Package Discovery and Installation with Your Own CRAN-like Repo for Your Packages

Posted: 28 Feb 2019 06:34 AM PST

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I've got a work-in-progress drat-ified CRAN-like repo for (eventually) all my packages over at CINC๐Ÿ”— ("CINC is not CRAN" and it also sounds like "sync"). This is in parallel with a co-location/migration of all my packages to SourceHut (just waiting for the sr.ht alpha API to be baked) and a self-hosted public Gitea instance. Everything will still be on that legacy social coding site y'all use but the ultimate goal is to have all installs be possible via the CINC repository (i.e. install.packages()) or via a remotes::install_git() install from this standalone or any social coding site.

I'll eventually publish the workflow but the idea is to customize a pkgdown YAML file in each package repo so the navbar has links back to CINC and other pages (this will take some time as I seem to have made alot of little packages over the years) and then to add a package to the CINC repo:

The above processes helped shine a light on some bad README practices I've had and also about how to make it a bit easier (in the future) to install C[++]-backed packages. Speaking of READMEs, I also need to get all the README's updated to use either install.packages() from CINC or a remotes install from Gitea.

Another couple of goals are to possibly get binary package versions added (though that's going to be interesting orchestration exercise) and see if I can't get some notary๐Ÿ”— concepts implemented.

It's actually been a fun mini-project since the drat part is a simple as drat::insertPackage('PKG', '/path/to/cinc') (#ty Dirk!) — though I need to think through some logic around maintaining Archive versions and also deleting packages which drat doesn't do yet but is also as simple as removing tarballs and running tools::write_PACKAGES().

As an aside, I also drat-ified all our $WORK packages and made that repo work-internally-accessible via static S3 web hosting. At $0.023 USD per GB (per-month) for just hosting the objects and $0.0004 USD per 1,000 GET requests (plus minimal setup charges for SSL) it's super cheap and also super-easy to maintain. Drop a note in the comments if you're interested in more details of the S3 drat setup.

FIN

After a few more weeks' baking period for the self-hosed Gitea and CINC sites will have all non-error web-logging disabled and error logs won't save IP addresses or referrers (I welcome anyone who wants to third-party audit the nginx configs) since another goal is also to help folks not be a product for tech startups or giant, soulless, global multi-national companies with a history of being horrendously evil.

Be on the lookout for a full writeup with code in the coming weeks.

P.S.

For Safari-users on 10.14+ I've made some tweaks to the "batman mode" version of the site. If you do use Safari (but…why?!) and have any issues with readability in "dark mode" just drop a note in the comments and I'll see what I can do.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

handlr: convert among citation formats

Posted: 26 Feb 2019 04:00 PM PST

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Citations are a crucial piece of scholarly work. They hold metadata on each scholarly work, including what people were involved, what year the work was published, where it was published, and more. The links between citations facilitate insight into many questions about scholarly work.

Citations come in many different formats including BibTex, RIS, JATS, and many more. This is not to be confused with citation styles such as APA vs. MLA and so on.

Those that deal with or do research on citations often get citations in one format (e.g., BibTex), but they would like them in a different format (e.g., RIS). One recent tool that does a very nice job of this is bolognese from Martin Fenner of Datacite. bolognese is written in Ruby. I love the Ruby language, but it does not play nicely with R; thus it wasn't an option to wrap or call bolognese from R.

handlr is a new R package modeled after bolognese.

The original motivation for starting handlr comes from this thread in the rorcid package, in which the citations retrieved from source A had mangled characters with accents, but source B gave un-mangled characters but in the wrong format. Thus the need for a citation format converter.

handlr converts citations from one format to another. It currently supports reading the following formats:

And the following writers are supported:

handlr has not yet focused on performance, but we will do so in future versions.

Links:

Installation

Install the lastest from CRAN

install.packages("handlr")  

Some binaries are not up yet on CRAN – you can also install from GitHub.
There's no compiled code though, so source install should work.
I somehow forgot to export the print.handl() function in the CRAN version, so
if you try this with the CRAN version you won't get the compact output seen below.

remotes::install_github("ropensci/handlr")  

Load handlr

library(handlr)  

The R6 approach

There's a single R6 interface to all readers and writers

grab an example file that comes with the package

z <- system.file("extdata/citeproc.json", package = "handlr")  

initialize the object

x <- HandlrClient$new(x = z)  x  #>    #>   doi:   #>   ext: json  #>   format (guessed): citeproc  #>   path: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/handlr/extdata/citeproc.json  #>   string (abbrev.): none  

read the file

x$read(format = "citeproc")  x  #>    #>   doi:   #>   ext: json  #>   format (guessed): citeproc  #>   path: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/handlr/extdata/citeproc.json  #>   string (abbrev.): none  

inspect the parsed content

x$parsed  #>    #>   from: citeproc  #>   many: FALSE  #>   count: 1  #>   first 10   #>     id/doi: https://doi.org/10.5438/4k3m-nyvg  

write out to bibtex. by default does not write to a file; you can
also specify a file path.

cat(x$write("bibtex"), sep = "\n")  #> @article{https://doi.org/10.5438/4k3m-nyvg,  #>   doi = {10.5438/4k3m-nyvg},  #>   author = {Martin Fenner},  #>   title = {Eating your own Dog Food},  #>   journal = {DataCite Blog},  #>   pages = {},  #>   publisher = {DataCite},  #>   year = {2016},  #> }  

Function approach

If you prefer not to use the above approach, you can use the various
functions that start with he format (e.g., bibtex) followed by
_reader or _writer.

Here, we play with the bibtex format.

Get a sample file and use bibtex_reader() to read it in.

z <- system.file('extdata/bibtex.bib', package = "handlr")  bibtex_reader(x = z)  #>    #>   from: bibtex  #>   many: FALSE  #>   count: 1  #>   first 10   #>     id/doi: https://doi.org/10.1142%2fs1363919602000495  

What this returns is a handl object, just a list with attributes.
The handl object is what we use as the internal representation that we
convert citations to and from.

Each reader and writer supports handling many citations at once. For all
formats, this means many citations in the same file.

z <- system.file('extdata/bib-many.bib', package = "handlr")  bibtex_reader(x = z)  #>    #>   from: bibtex  #>   many: TRUE  #>   count: 2  #>   first 10   #>     id/doi: https://doi.org/10.1093%2fbiosci%2fbiw022  #>     id/doi: https://doi.org/10.1890%2f15-1397.1  

To do

  • There's still definitely some improvements that need to be made to various parts of citations in some of the formats. Do open an issue/let me know if you find anything off.
  • Performance could be improved for sure
  • Problems with very large files, e.g., ropensci/handlr#9
  • Documentation, there is very little thus far

Get in touch

Get in touch if you have any handlr questions in the
issue tracker or the
rOpenSci discussion forum.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

This posting includes an audio/video/photo media file: Download Now

Comments