Teaching R to New Users - From tapply to the Tidyverse

Roger Peng ** 2018/07/12

Abstract

The intentional ambiguity of the R language, inherited from the S language, is one of its defining features. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? The ability of R to cater to users who do not see themselves as programmers, but then allow them to slide gradually into programming, is an enduring quality of the language and is what has allowed it to gain significance over time. As the R community has grown in size and diversity, R’s ability to match the needs of the community has similarly grown. However, this growth has raised interesting questions about R’s value proposition today and how new users to R should be introduced to the system.

NOTE: A video of this keynote address is now available on YouTube if you would prefer to watch it instead.

Introduction

If we go to the R web site in order to discover what R is all about, the first sentence we see is

R is a free software environment for statistical computing and graphics.

I haven’t been to the R web site in quite some time, but it struck me that the word “data” does not appear in that first sentence.

If we similarly travel to the tidyverse web site, we find that

The tidyverse is an opinionated collection of R packages designed for data science

It turns out that these two sentences, found on these two web sites. say a lot about the past, present, and future of R.

Intentional ambiguity

R inherits many features from the original S language developed at Bell Labs and subsequently sold as S-PLUS by Insightful (and now owned by Tibco). So it’s useful look back into some of the S history to gain some insight into R. John Chambers, one of the creators of the S language, said in the “Stages in the Evolution of S” (sadly no longer available online except via the Internet Archive):

The ambiguity [of the S language] is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.

Chambers was referring to the difficulty in naming and characterizing the S system. Is it a programming language? An environment? A statistical package? Eventually, it seems they settled on “quantitative programming environment”, or in other words, “it’s all the things.” Ironically, for a statistical environment, the first two versions did not contain much in the way of specific statistical capabilities. In addition to a more full-featured statistical modeling system, versions 3 and 4 of the language added the class/methods system for programming (outlined in Chambers’ Programming with Data).

In discussing the rationale for developing the S system, Chambers writes

We were looking for a system to support the research and the substantial data analysis projects in the statistics research group at Bell Labs.

He further writes

…little or none of our analysis was standard, so flexibility and the ability to program were essential from the start.

From the beginning flexibility and the ability to program were two key needs that had to be satisfied by the S language.

R Enters the Fray

It was into this world that R came to be. As a dialect of the S language, R proved easy to adopt for previous users of S-PLUS. The fact that R was also free software didn’t hurt adoption either. But it’s worth noting that for the most part, people already had tools for analyzing data. They came in the form of SAS, Stata, SPSS, Minitab, Microsoft Excel, and my personal favorite, XLisp-Stat (thanks Luke Tierney!). But the commonly used data analysis packages had some key downsides:

The graphics were too “quick and dirty” and did not allow much control over the details; they plotted the data, but that was about it;
There was relatively little ability to build custom tools on top of what was available (although some capability was added to most packages later).

R solved these two key problems by providing an “environment for statistical computing and graphics”. Like with S, R provided flexibility and the ability to program whatever one needed. The graphical model was similar to the S-PLUS system, whereby total control was given to the user and data graphics could be built up piece by piece. In other words, few decisions were made by the system when creating data graphics. The user was free to make all the decisions.

With the non-graphics aspects of the R system, the user was similarly free to handle data however the language would allow. This feature was well-suited to the academic audience that initially adopted R for implementing new statistical methodology. Furthermore, those who already knew programming languages like C (or Lisp!) found R’s programming model very familiar and easily extended. R’s package system allowed developers to extend the existing functionality by adding new kinds of models, plots, and tools.

R for Data Analysis?

Going back to Chambers’ philosophy of S and the “user-developer” spectrum, it’s interesting to think about how R fit into this spectrum, given the population of users that were drawn to it.

While Chambers thought of S as covering the entire spectrum at the time, it seems clear in retrospect that R actually fell quite squarely on the “developer” end of the spectrum. That is, the features that made R stand apart, given the context in which R was initially released, were really developer features. The programming language and the ability to take fine control over all the details are things that would appeal to people who are developing new things.

What this meant is that initially, R was not a great system for doing data analysis. The truth is it was harder to do data analysis in R than it was in most other systems. Programs like SAS, Stata, and SPSS had been designed as interactive data analysis environments and had simple ways to do complex data wrangling.

Imagine a new user to R who’s interested in doing data analysis. Consider the basic task of splitting a data frame according to the levels of some grouping (factor) variable and then taking the mean of another variable within the groups. Today, you might do something like

library(dplyr)
group_by(airquality, Month) %>% 
summarize(o3 = mean(Ozone, na.rm = TRUE))

This code uses the dplyr package from the tidyverse set of packages to take the monthly mean of ozone in the airquality dataset in R. To understand it, you need to understand the concept of a data frame and of tidy data. But beyond that, the code is reasonably self-explanatory.

Using just the functionality that came with the base R system, you’d have to do something like

aggregate(airquality[, “Ozone”], 
 list(Month = airquality[, “Month”]), 
 mean, na.rm = TRUE)

Right away, questions arise:

What are the square brackets [ for? They are used for subsetting a data frame.
What is list()? It’s a type of R object.
Why is the mean() function just sitting there all by itself? It is being passed to the subsets of the data frame via a kind of internal lapply(). (Follow up: What is lapply()?)
Is na.rm = TRUE an argument to aggregate()? No, it is an argument to mean(), but it is being passed to mean() via the ... argument of aggregate().

In other words, in order to take the mean within groups of a variable in a data frame, a pretty basic data summary operation for almost all other packages, you had to explain

different kinds of R objects;
subset operators;
functional mapping;
and variable argument lists.

At this point, you were basically describing R as a programming language to people who likely had no interest in programming (at least not yet).

As confusing (and possibly ridiculous) as this list of requirements might sound, it often was not a barrier to people back in the year 2000. The reason is that people who were new to R had not chosen to learn R because they needed a system for doing data analysis. They already had a system for doing data analysis. Rather, they were looking for a system to do the things their current system didn’t allow them to do. In other words, they were looking for a system that had “flexibility and the ability to program.” R was a great system for that.

Making R a “Real Programming Language” Almost Killed It

As is so often the case with technological products, the aspects of those products that are considered its advantages can quickly become their key weaknesses. R’s flexibility and focus on programming language power served as a major barrier to those who simply wanted to analyze data. However, like with many classic case studies of disruption, this barrier was not immediately recognized as such. The result was continuing efforts to make R a better programming language.

One of the major developments of the 2000s was the development of the S4 class/methods system. In many ways, the S4 system was a major advance, because it provided a “real” system for doing object oriented programming in R. The system was formal, unlike the ad hoc S3 system which contained within it many ambiguities. Other features added to R during this time (to name a few) were the namespace system for packages and the localization/internationalization features. All of these features made up for key deficiencies in the system and provided powerful new tools for developers to create more sophisticated R packages. They brought R closer to a “real” programming languages that could be discussed in the same breath as python or perl. While these features benefited users (particularly the localization and internationalization efforts), none of them directly empowered non-programming users to do more with R. They were primarily developer-focused.

At this point in R’s development the user who simply wanted to analyze some data was being “over-served”. They did not need a formal system for object-oriented programming and did not need a mechanism to allow others to provide translations for their R packages. They just wanted to take the mean of a bunch of columns within groups in a data frame. We still had no simple solution for that problem. This state of affairs left a substantial gap to be filled for simple R-based data analytic tools.

Tidyverse Disruption with a Twist

When tidyverse-related tools, including ggplot2, first started showing up (long before the name “tidyverse” existed), long-time users of R did not see why they were needed. It was difficult for many of us to see the perspective of the new user who had basic data analytic needs. These new users

were not familiar with a collection of existing data analysis packages;
had not used S-PLUS previously;
did not necessarily have experience with other programming languages like C or Perl;
were likely encountering data analysis and statistics for the first time

The people with this set of characteristics represented a new and fast-growing population of potential R users. Many were being turned away from R because of

flexibility –> complexity; and
ability to program –> requirement for programming

Suddenly, the flexibility and ability to program so valued by a previous population of R users were lead weights dragging the system down for new users. A different interface was needed.

The tidyverse set of tools, in addition to providing that different interface for new users, adopted a particular point of view on how the various tools would work together. The focus on “tidy data” as a unifying principle allowed a relatively small set of tools to provide a wide range of operations when it came to data wrangling. The opinionated nature of the tools naturally limited somewhat the flexibility of how things could be done. But this reduced complexity was what made the tools so appealing to new users. There were just fewer things to learn. The use of non-standard evaluation was (perhaps ironically) more intuitive for data wrangling tasks because there was no need to quote variable names and see everything from the perspective of a developer or programmer.

Furthermore, the tidyverse did not abandon all that had come before it. Rather, it built on top of the infrastructure that had been built previously:

The tidyverse tools are built as R packages, eliminating the need to make direct changes to the core R system (and to get permission to do so);
Although the packages eschew the formal S4 class/methods system, they make extensive use of the S3 class/method system, which when coupled with the namespace mechanism for R packages are a powerful alternative.

In addition to all this, the ggplot2 package exploded in popularity, largely because it provided an easy way to make good graphics quickly (one might dare say “quick and dirty”). It “shortened the distance between the mind and the page” and removed the need for R users to learn intricate plot arguments and incantations. Rather than carefully building plots piece by piece as with the base R system, most of the defaults for ggplot2 were good enough and the beginner did not have to make a lot of decisions. Between ggplot2 and the initial tidyverse tools, R users finally had a system for doing data analysis in the manner of packages that came before it. However, that functionality was not simply replicated; it greatly improved upon it, as is evidenced by the popularity of both ggplot2 and the tidyverse.

With traditional technological disruption, the members of the old class are left behind. But far from wiping the slate clean, as perhaps might have occurred with other software packages, the tidyverse provided functionality that all R users could choose to take advantage of or not. People with long-standing workflows could keep their workflows. The tidyverse essentially built a new “language” on top of the existing language without breaking anything fundamental for existing users. Ultimately, the tidyverse was able to disrupt the existing R system without needing to leave previous users behind. Everyone could come along for the ride and benefit.

Intentional Ambiguity Redux

With the tidyverse “interface” to the R language, along with its existing programming paradigm dating back to the system’s origin, we could argue that R has achieved Chambers’ original vision of “intentional ambiguity”. For some users, R is a flexible system allowing the ability to program and develop new things. For other users, R is a powerful data analysis system that contains tools for data wrangling, visualization, and statistical modeling. The system spans the entire “user-developer” spectrum without anyone having to make a hard choice between either end. Rather, R users are free to wander back and forth between the two ends.

But nothing ever stands still, and users of the tidyverse system will need a system for programming too. However, they will not want to program in the style of the original R system. This would be like asking a user of laptop computers to go back to using a desktop computer. No, they want more power in their laptop.

One thing to be wary of going forward is that one of the tidyverse’s greatest assets–the generous use of non-standard evaluation–could end up being a critical liability. While non-standard evaluation greatly simplifies interactive work in R, programming in a world with non-standard evaluation can be confusing, especially when one must simultaneously deal with other tools that make use of standard evaluation. Tools and infrastructure are needed to allow users to become developers in this new paradigm without too much pain. Thankfully, many are working on this and I have little doubt that a new programming environment will exist in the near future that allows for the easy development of “tidyverse tools” by a large segment of the R community.

Selling R

At this point, one might have some difficulty describing the value proposition of R to someone who had never seen it before. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? Or is system for developing reproducible workflows in data analysis? Or is it a platform for developing interactive graphics, dashboards, and web apps? Or is it a language for doing complex data wrangling and data management? Or…

The confusion over how to “sell R” is reflective of the increasingly diverse audience of people using R. The open source nature of R gives it quick entry into many different fields, as long as there are interested users intent on adapting R to their needs. Over time we have seen R enter many scientific communities, including ecology, astronomy, high throughput biology, neuroimaging, and many others. Outside academia, we have seen R adopted in journalism, tech companies, and various businesses for data analysis. The spread of R is in part because members of those communities saw something attractive about R that could be applied in their field. Yes, free and open source is a big win, but not every organization is short of cash and there are other systems (like Python, or even XLisp-Stat) that share those same properties. So there must be other reasons to adopt R.

In the distant past, my pitch for using R usually involved three things:

Free. R was both free as in cost and free as in free software. The free cost part made it a highly accessible package and the free software part allowed for anyone to tinker with the package, inspect its code, and make improvements. This was particularly important in the old days when R was unknown and it’s “accuracy” was put in question.
Graphics. R was able to produce high quality “publication ready” graphics and it gave you detailed control over all the graphical elements. S-PLUS could also do this but S-PLUS didn’t come with the free part mentioned above.
Programming language. Unlike packages like SAS, Stata, or SPSS, R came with a robust and sophisticated Lisp-like programming language that was well-suited for data analysis applications. In addition, you could use it to build packages that could extend the core R system. Furthermore, it was a language designed for doing data analysis. That made it appear weird to people familiar with traditional programming languages.

Much has changed since those early days and I’ve varied my pitch quite a bit to focus on a few different things that didn’t exist back then. In particular, the audience has changed. I talk to many more people who are just getting started in data analysis and therefore are a bit more open-minded about which software to use. They don’t have years of SAS baked into their brains. They may have heard of Python as an alternate system, but they are much more likely to be open-minded about what system to use.

Some of the things I focus on now are

Reproducibility, Reporting, and Automation. With the development of knitr and its combination with R Markdown, the writing of reproducible reports was made infinitely easier. (Markdown itself, probably deserves its own discussion, but it’s not specifically R-related.)
Graphics. R still has the ability to make great data graphics and with the introduction of ggplot2, it has become easier to make and extend good graphics.
R Packages and Community. With over 10,000 packages on CRAN alone, there’s pretty much a package to do anything. More importantly, the people contributing those packages and the greater R community have expanded tremendously over time, bringing in new users and pushing R to be useful in more applications. Every year now there are probably hundreds if not thousands of meetups, conferences, seminars, tutorials, and workshops all around the world, all related to R.
RStudio. The development of the RStudio IDE has made getting started with R much easier. Having a powerful IDE was important to me for learning other languages and I’m glad R finally has something solid for itself. RStudio has significantly simplified the development of R packages via devtools and roxygen2. While it’s not yet perfect, these tools have changed what used to be a labor-intensive and finicky process into a more manageable and easier to learn work flow. In addition, RStudio has funded the development of many critical R packages, including the members of the tidyverse.

At a recent unconference held in Melbourne, a number of people had different approaches to convincing others to use R. Here is just a summary:

Someone mentioned Bioconductor, which is a huge resource for those doing research in the world of high throughput biology. Not many other analysis packages have something similar so it makes an obvious selling point to people working in this area.
The idea of using R end-to-end came up, meaning using R to clean up messy data and taking it all the way to some interactive Shiny app on the other end. The idea that you can use the same tool to do all the things in between made for a compelling case for using R.
For the spreadsheet audience, the dplyr package specifically was sometimes a good selling point. The idea here was that you could show people how much time could be saved by automating analyses and using dplyr to clean up data.
The open source nature of R came up a few times, primarily as a means for developing transferable skills. If you work at a company/institution that specializes in some proprietary package, it’s often difficult to transfer those skills somewhere else if your new job doesn’t use that package. The fact that R is open source means that, in theory, you could use it anywhere and the skills that you build up in R are (again, in theory) applicable everywhere.
Someone mentioned that if you want to convince someone to learn/use R, just show them the multitude of jobs available to R programmers (and in particular, the salaries attached to them).
The fact that R is obtainable for free is still important, given that Matlab and SAS licenses have not gotten any cheaper over time. In my experience, this is particularly important in non-industrialized countries where for many people paying for expensive licenses is not an option.

The reasons for using R are as varied as the people using R, and that’s a great thing. Moving forward, we must resist the urge to view R through a narrow window and to sell it as a single thing with a few features. The freedom given to the R community to characterize it the way they see best fit is what will give R increased relevance over time.

Conclusion

R has grown significantly over the past 20 years, but has only recently has developed all the elements of John Chambers’ original vision of a full-fledged data analysis system coupled with a sophisticated programming language. With the variety of tools that have been tacked on to the core system via packages, R’s chimera-like appearance can be a bit baffling to long-time users. But this appearance reflects R’s greatest feature, and it’s most valuable selling point, which is its ability to disrupt and reinvent itself to suit the needs of a new population of users.

It’s worth noting in closing that R is a language for data analysis. If R seems a bit confusing, disorganized, and perhaps incoherent at times, in some ways that’s because so is data analysis. The fact that we can have one language serve the infinite possible ways that people might approach any data analysis problem is pretty remarkable. But it does lead to a little bit of head scratching for people who are used to more “normal” programming languages.

As time goes on, R will be confronted with a new population of users who take different things for granted (e.g. the tidyverse) and will have very different requirements. When this happens, we must ensure that the things that make R great now do not become the very reasons for resisting change. That said, I believe we can be confident that the R language is flexible enough to meet the requirements for future new users.