At Jumping Rivers we run a lot of R courses. Some of our most popular courses revolve around the tidyverse, in particular, our Introduction to the tidyverse and our more advanced mastering course. We even trained over 200 data scientists NHS – see our case study for more details.
As you can imagine, when giving an on-site course, a reasonable question is what version of R is required for the course. We always have an RStudio cloud back-up, but it’s nice for participants to run code on their own laptop. If participants are to bring there own laptop it’s trivial for them to update R. But many of our clients are financial institutions or government where an upgrade is a non-trivial process.
So, what version of R is required for a tidyverse course? For the purposes of this blog post, we will define the list of packages we are interested in as
1 |
|
The code below will work with any packages of interest. In fact, you can set pkgs
to all R packages in CRAN, it just takes a while.
Package descriptions
In R, there is a handy function called available.packages()
that returns a matrix of details corresponding to packages currently available at one or more repositories. Unfortunately, the format isn’t initially amenable to manipulation. For example, consider the readr package
1 |
|
I immediately converted the data to a tibble, as that
-
changed the rownames to a proper column
-
changed the matrix to a data frame/tibble, which made selecting easier
Looking at the read_desc
, we see that it has a minimum R version
1 |
|
but due to the format, it would be difficult to compare to R versions. Also, the list of imports
1 |
|
has a similar problem. For example, with the data in this format, it would be difficult to select packages that depend on tibble.
Tidy package descriptions
We currently have four columns
- Imports, Depends, Suggests, Enhances
each entry in these columns contains multiple packages, with possible version numbers. To tidy the data set I’m going to create four new columns:
-
depend_type
: one of Imports, Depends, Suggests, Enhances and LinkingTo -
depend_package
: the package name -
depend_version
: the package version -
depend_condition
: something like equal to, less than or greater than
The hard work is done by the function clean_dependencies()
, which is at the end of the blog post. It essentially just does a bit of string manipulation to separate out the columns. The function works per package, so we iterate over packages using map_df()
1 |
|
After this step, we now have tibble with tidy columns:
1 |
|
and we can see minimum R version the package authors have indicated for their package. However, this isn’t the minimum version required. Each package imports a number of other packages, e.g. the readr imports 4 packages
1 |
|
and each of those packages, also import other packages. Clearly, the minimum version required to install dplyr is the maximum R version of all imported packages
Some interesting things
Before we work out the maximum R version for each set of imports, we should first investigate how many imports each package using a bit of dplyr
1 |
|
Using histograms we get a better idea of the overall numbers. Note that here we’re using the ipsum theme from the hrbrthemes package.
1 |
|
Maximum overall imports
As I’ve mentioned, we need to obtain not just the imported packages, but also their dependencies. Fortunately, the tools package comes to our rescue,
1 |
|
Using the package_dependencies()
function, we simply
-
Obtain a list of dependencies for a given package
-
Extract the maximum version of R for all packages in the list
At the end of this post, there are two helper functions:
max_r_version()
– takes a vector of R versions, and returns a maximum version. E.g.
max_r_version(c(“3.2.0”, “3.3.2”, “3.2.0”))
3.3.2
get_r_ver()
– calls package_dependencies()
and returns the maximum R version out of all of the dependencies.
Also, we have simplified some the details and what we’ve done isn’t quite right – it’s more of a first approximation. See the end of the post for details
Now that’s done, we can pass the list of tidyverse packages then compare their stated R version, with the actual required R version
1 |
|
We then select the packages where there is a difference
1 |
|
The largest difference in R versions is for readr (which feeds into the tidyverse). readr claims to only need R version 3.0.2 but a bit more investigation shows that readr depends on the tibble package which is version 3.1.0. Although, it is worth noting that 3.1.0 is fairly old!
Take away lessons
The takeaway message is that dependencies matter. A single change affects everything in the package dependency tree. The other lesson is that the tidyverse team have been very careful about there dependencies. In fact, all of their packages are checked on R 3.1, 3.2, …, devel
Simplifications: skipping package versions
In this analysis, we’ve completely ignored version numbers and always assumed we need the latest version of a package. This clearly isn’t correct. So to do this analysis properly, we would need the historical DESCRIPTION files for packages and use that to determine versions.
Thanks to Jim Hester who spotted an error in a previous version of this post.
Functions
1 |
|
The post What R version do you really need for a package? appeared first on Jumping Rivers.
Related