Since the U.S. midterm elections I’ve been playing around with some Congressional Quarterly data about the composition of the House and Senate since 1945. Unfortunately I’m not allowed to share the data, but here are two or three things I had to do with it that you might find useful.
The data comes as a set of CSV files, one for each congressional session. You download the data by repeatedly querying CQ’s main database by year. In its initial form, the top of each file looks like this:
1 |
|
The bottom of each file looks like this:
1 |
|
To make the files readable in R, the first thing we’ll want to do is strip the first two lines of each file and the last three lines of each file. (Of course I checked first to make sure each file was the same in this regard.) There are several ways to get rid of specific lines from files. The venerable sed
command is one. We loop it over each CSV file, telling it to delete (d
) lines 1 and 2:
1 |
|
The -i.orig
option makes a copy of the original file first, appending a .orig
extension to the filename.
We do the same thing to delete the last three lines of each file. You can use some versions of head
to do this quite easily, because they accept a negative number to their -n
argument. Thus, while head -n 3
usually returns the first three lines of a file, head -n -3
will show you all but the last three lines. But the version of head
that ships with macOS won’t do this. So I used sed again, this time taking advantage of Stack Overflow to find the following grotesque incantation:
1 |
|
The -e :a
is a label for the expression, and the '1,3!{P;N;D;};N;ba'
is where the work gets done, streaming through the file till it locates the end, deletes that line, and then branches (b
) back to the labeled script again (a
) until it’s done it three times. Gross.
You could also do this using a combination of wc
(to get a count of the number of lines in the file) and awk
, like this:
1 |
|
There’s a reason people used to say sed
and awk
had those names because of the sounds people made when forced to use them.
Anyway, now we have a folder full of clean CSV files. Time to fire up R and get to the fun part.
Inside R, we get a vector of our filenames:
1 |
|
Than, instead of writing a for
loop and doing a bunch of rbind
-ing, we can pipe our vector of filenames to the map_dfr()
function and we’re off to the races:
1 |
|
A little data-cleaning later and the congress
variable is properly numbered and we’re good to go. The to_snake_case()
function comes from the snakecase
package.
The data are observed at the level of congressional terms. So, for example, we can draw a heatmap of the age distribution of U.S. representatives across the dataset:
1 |
|
Age distribution heatmap
Or we can look at it a different way, using the ggbeeswarm
package. We layer a few different pieces here: a trend line for average age, a ribbon showing the 25th and 75th percentiles of the age distribution, the distribution itself (exlcuding its oldest and youngest 1%), and the names of the representatives in the oldest and youngest percentiles. We’ll create a separate dataset for each of these pieces.
1 |
|
Here’s what they look like:
1 |
|
Now we can draw a graph, faceted by Party:
1 |
|
Age trends, distributions, and outliers.
That one might be easier to see as a PDF.
Finally, here’s a neat trick. One thing I was interested in was changes in the composition of the so-called “Freshman Class” of representatives over time—that is, people elected to the House for the very first time. To extract that subset, I needed to create a term_id
nested with each person’s unique identifier (their pid
). I knew what Congressional session each person-term was in, but just needed to count from the first to the last. I’m sure there’s more than one way to do it, but here’s a solution:
1 |
|
The trick here is that mutate(data = map(data, ~ mutate(.x, term_id = 1 + congress - first(congress))))
line, which nests one mutate call inside another. We group the data by pid
and nest()
it so it’s as if we have a separate table for each representative. Then we use map()
to add a term_id
column to each subtable. Once we have a per-person term_id
, and we grab everyone’s first term, we can e.g. take a look at the breakdown of freshman representatives by gender for every session since 1945:
First-term representatives by gender, 1945-2019
And also to break that out by Party:
First-term representatives by gender and party, 1945-2019
Related