Here’s a feature of dplyr
that occasionally bites me (most recently while making these graphs). It’s about to change mostly for the better, but is also likely to bite me again in the future. If you want to follow along there’s a GitHub repo with the necessary code and data.
Say we have a data frame or tibble and we want to get a frequency table or set of counts out of it. In this case, each row of our data is a person serving a congressional term for the very first time, for the years 2013 to 2019. We have information on the term year, the party of the representative, and whether they are a man or a woman.
1 |
|
When we load our data into R with read_csv
, the columns for party
and sex
are parsed as character vectors. If you’ve been around R for any length of time, and especially if you’ve worked in the tidyverse framework, you’ll be familiar with the drumbeat of “stringsAsFactors=FALSE”, by which we avoid classing character variables as factors unless we have a good reason to do so (there are several good reasons), and we don’t do so by default. Thus our df
tibble shows us instead of
for party
and sex
.
Now, let’s say we want a count of the number of men and women elected by party in each year. (Congressional elections happen every two years.) We write a little pipeline to group the data by year, party, and sex, count up the numbers, and calculate a frequency that’s the proportion of men and women elected that year within each party. That is, the frequencies of M and F will sum to 1 for each party in each year.
1 |
|
You can see that, in 2015, neither party had a woman elected to Congress for the first time. Thus, the freq
is 1 in row 5 and row 6. But you can also see that, because there are no observed F
s in 2015, they don’t show up in the table at all. The zero values are dropped. These rows, call them 5'
and 6'
don’t appear:
1 |
|
How is that going to bite us? Let’s add some graphing instructions to the pipeline, first making a stacked column chart:
1 |
|
Stacked column chart based on character-encoded values.
That looks fine. You can see in each panel the 2015 column is 100% Men. If we were working on this a bit longer we’d polish up the x-axis so that the dates were centered under the columns. But as an exploratory plot it’s fine.
But let’s say that, instead of a column plot, you looked at a line plot instead. This would be a natural thing to do given that time is on the x-axis and so you’re looking at a trend, albeit one over a small number of years.
1 |
|
A line graph based on character-encoded variables for party and sex. The trend line for Women joins up the observed (or rather, the included) values, which don’t include the zero values for 2015.
That’s not right. The line segments join up the data points in the summary tibble, but because those don’t include the zero-count rows in the case of women, the lines join the 2013 and 2017 values directly. So we miss that the count (and thus the frequency) went to zero in that year.
This issue has been recognized in dplyr for some time. It happened whether your data was encoded as character or as a factor. There’s a huge thread about it in the development version on GitHub, going back to 2014. In the upcoming version 0.8 release of dplyr, the behavior for zero-count rows will change, but as far as I can make out it will change for factors only. Let’s see what happens when we change the encoding of our data frame. We’ll make a new one, called df_f
.
1 |
|
Now we have party
and sex
encoded as unordered factors. This time, our zero rows are present (here as rows 5 and 7). The grouping and summarizing operation has preserved all the factor values by default, instead of dropping the ones with no observed values in any particular year. Let’s run our line graph code again:
1 |
|
A line graph based on factor-encoded variables for party and sex. Now the trend line for Women does include the zero values, as they are preserved in the summary.
Now the trend line goes to zero, as it should. (And by the same token the trend line for Men goes to 100%.)
What if we want to keep working with our variables encoded as characters rather than factors? There is a workaround, using the complete()
function. You will need to ungroup()
the data after summarizing it, and then use complete()
to fill in the implicit missing values. You have to re-specify the grouping structure for complete, and then tell it what you want the fill-in value to be for your summary variables. In this case it’s zero.
1 |
|
If we re-draw the line plot with the ungroup() ... complete()
step included, we’ll get the correct output in our line plot, just as in the factor case.
1 |
|
Same as before, but based on the character-encoded version.
The new zero-preserving behavior of group_by()
for factors will show up in the upcoming version 0.8 of dplyr. It’s already there in the development version if you like to live dangerously. In the meantime, if you want your frequency tables to include zero counts, then make sure you ungroup()
and then complete()
the summary tables.
Related