Our R package roundup

It’s the time of the year again where one eats too much, and gets in a reflective mood! 2015 is nearly over, and us bloggers here at opiateforthemass.es thought it would be nice to argue endlessly which R package was the best/neatest/most fun/most useful/most whatever in this year!

Since we are in a festive mood, we decided we would not fight it out but rather present our top five of new R packages, a purely subjective list of packages we (and Chuck Norris) approves of.

But do not despair, dear reader! We have also pulled hard data on R package popularity from CRAN, and will present this first.

Let’s start with some factual data before we go into our personal favourites of 2015. We’ll pull the titles of the new 2015 R packages from cranberries, and parse the CRAN downloads per day using cranlogs package.

Using downloads per day as a ranking metric could have the problem that earlier package releases have had more time to create a buzz and shift up the average downloads per day, skewing the data in favour of older releases. Or, it could have the complication that younger package releases are still on the early “hump” part of the downloads (let’s assume they’ll follow a log-normal (exponential decay) distribution, which most of these things do), thus skewing the data in favour of younger releases. I don’t know, and this is an interesting question I think we’ll tackle in a later blog post…

For now, let’s just assume that average downloads per day is a relatively stable metric to gauge package success with. We’ll grab the packages released using rvest:

1
2
3
4
berries <- read_html("http://dirk.eddelbuettel.com/cranberries/2015/")
titles <- berries %>% html_nodes("b") %>% html_text
new <- titles[grepl("^New package", titles)] %>% 
 gsub("^New package (.*) with initial .*", "\1", .) %>% unique

and then lapply() these titles into the CRAN and parse their respective average downloads per day:

1
2
3
4
5
6
7
8
9
10
11
logs <- pblapply(new, function(x) {
 down <- cran_downloads(x, from = "2015-01-01")$count 
 if(sum(down) > 0) {
 public <- down[which(down > 0)[1]:length(down)]
 } else {
 public <- 0
 }
 return(data.frame(package = x, sum = sum(down), avg = mean(public)))
})

logs <- do.call(rbind, logs) 

With some quick dplyr and ggplot magic, these are the top 20 new CRAN packages from 2015, by average number of daily downloads:

The full code is availble on github, of course.

As we can see, the main bias does not come from our choice of ranking metric, but by the fact that some packages are more “under the hood” and are pulled by many packages as dependencies, thus inflating the download statistics.

The top four packages (rversions, xml2, git2r, praise) are all technical packages. Although I have to say I did not know of praise so far, and it looks like it’s a very fun package, indeed: you can automatically add randomly generated praises to your output! Fun times ahead, I’d say.

Excluding these, the clear winner of “frontline” packages are readxl and readr, both packages by Hadly Wickham dealing with importing data into R. Well-deserved, in our opinion. These are packages nearly everybody working with data will need on a daily basis. Although, one hopes that contact with Excel sheets is kept to a minimum to ensure one’s sanity, and thus readxl is needed less often in daily life!

The next two are packages (DiagrammeR and visNetwork) relate to network diagrams, something that seems to be en vogue currently. R is getting some much-needed features on these topics here it seems.

plotly is the R package to the recently open-sourced popular plot.ly javascript libraries for interactive charts. A well-deserved top ranking entry! We also see packages that build and improve the ever-popular shiny packages (DT and shinydashboard), leaflet dealing with interactive mapping issues, and packages on stan, the Baysian statistical interference language (rstan, StanHeaders).

But now, this blog’s authors’ personal top five of new R packages for 2015:

(safferli’s pick)

readr is our package pick that also made it into the top downloads metric, above. Small wonder, as it’s written by Hadley and aims to make importing data easier, and especially more consistent. It is thus immediately useful for most, if not all, R users out there, and also received a tremendous “fame kickstart” from Hadley’s reputation within the R community. For extremely large datasets I still like to use data.table’s fread() function, but for anything else the new read_* functions make your life considerably easier. They’re faster compared to base R, and just the no more worries of stringsAsFactors alone is a godsend.

Since the package is written by Hadley, it is not only great but also comes with a fantastic documentation. If you’re not using readr currently, you should head over the the package readme and check it out.

(Yuki’s pick)

R already has many template engines but this one is simple yet quite useful if you work on data exploration, visualization, statistics in R and deploy your findings in Python while using the same SQL queries and as similar syntax as possible.

Code transition from R to Python is quick and easy with infuser like this now;

1
2
3
4
5
6
7
8
9
# R
library(infuser)

template <- "SELECT {{var}} FROM {{table}} WHERE month = {{month}}"

query <- infuse(template,var="apple",table="fruits",month=12)
cat(query)
# SELECT apple FROM fruits WHERE month = 12

Python

template = “SELECT {var} FROM {table} WHERE month = {month}” query = template.format(var=”apple”,table=”fruits”,month=12) print(query)

SELECT apple FROM fruits WHERE month = 12

```

(Kirill’s pick)

googlesheets by Jennifer Bryan finally allows me to directly output to Google Sheets, instead of output it to xlsx format and then push it (mostly manually) to Google Drive. At our company we use Google Drive as a data communication and storage tool for the management, so outputing Data Science results to Google Sheets is important. We even have some small reports stored in Google Sheets. The package allows for easy creating, finding, filling, and reading of Google Sheets with an incredible simplicity of use.

(Kirill’s second pick. He gets to pick two since he is so indecisive)

AnomalyDetection was developed by Twitter’s data scientists and introduced to the open source community in the first week of the year. A very handy, beautiful, well-developed tool to find anomalies in the data. This is very important for a data scientist to be able to find anomalies in the data fast and reliably, before real damage occurs. The package allows you to get a good first impression of the things going on in your KPIs (Key Performance Indicators) and react quickly. Building alerts with it is a no-brainer if you want to monitor your data and assure data quality.

(Jess’s pick)

emoGG is definitely falling into the category “most whatever” R package of the year. What this package does is fairly simple: it allows you to display emojis in your ggplot2 plots, either as plotting symbols or as a background. Under the hood, it adds a geom_emoji layer to your ggplot2 plots, in which you have to specify one or more emoji codes corresponding to the emojis you wish to plot. emoGG can be used to make visualisations more compelling and make plots transport more meaning, no doubt. But before anything else, it’s fun and a must have for an avid emoji fan like me.