Congress Over Time

Since the U.S. midterm elections I’ve been playing around with some Congressional Quarterly data about the composition of the House and Senate since 1945. Unfortunately I’m not allowed to share the data, but here are two or three things I had to do with it that you might find useful.

The data comes as a set of CSV files, one for each congressional session. You download the data by repeatedly querying CQ’s main database by year. In its initial form, the top of each file looks like this:

Results for 79th Congress
,
Last,First,Middle,Suffix,Nickname,Born,Death,Sex,Position,Party,State,District,Start,End,Religion,Race,Educational Attainment,JobType1,JobType2,JobType3,JobType4,JobType5,Mil1,Mil2,Mil3
Abernethy,Thomas,Gerstle,,,05/16/1903,01/23/1953,M,U.S. Representative,Democrat,MS,4,01/03/1945,01/03/1953,Methodist,White,Professional degree,Law,,,,,Did not serve,,
Adams,Sherman,,,,01/08/1899,10/27/1986,M,U.S. Representative,Republican,NH,2,01/03/1945,01/03/1947,Not specified,White,Bachelor's degree,Construction/building trades,,,,,,,

The bottom of each file looks like this:

Young,Milton,Ruben,,,12/06/1897,05/31/1983,M,U.S. Senator,Republican,ND,,03/19/1945,01/03/1981,Mormon,White,Unknown,Agriculture,,,,,Did not serve,,
Zimmerman,Orville,,,,12/31/1880,04/07/1948,M,U.S. Representative,Democrat,MO,10,01/03/1945,04/07/1948,Methodist,White,Professional degree,Education,Law,,,,Army,,
,
,
"Export list of members by biographical characteristics. Washington: CQ Press. Dynamically generated November 10, 2018, from CQ Press Electronic Library, CQ Press Congress Collection: http://library.cqpress.com/congress/export.php?which=memberbioadv&congress=198&yearlimit=0"

To make the files readable in R, the first thing we’ll want to do is strip the first two lines of each file and the last three lines of each file. (Of course I checked first to make sure each file was the same in this regard.) There are several ways to get rid of specific lines from files. The venerable sed command is one. We loop it over each CSV file, telling it to delete (d) lines 1 and 2:

## Remove first two lines from each csv file
for i in *.csv; do
 sed -i.orig '1,2d' $i
done

The -i.orig option makes a copy of the original file first, appending a .orig extension to the filename.

We do the same thing to delete the last three lines of each file. You can use some versions of head to do this quite easily, because they accept a negative number to their -n argument. Thus, while head -n 3 usually returns the first three lines of a file, head -n -3 will show you all but the last three lines. But the version of head that ships with macOS won’t do this. So I used sed again, this time taking advantage of Stack Overflow to find the following grotesque incantation:

## Remove last three lines from each csv file
for i in *.csv; do
 sed -i.orig -e :a -e '1,3!{P;N;D;};N;ba' $i
done

The -e :a is a label for the expression, and the '1,3!{P;N;D;};N;ba' is where the work gets done, streaming through the file till it locates the end, deletes that line, and then branches (b) back to the labeled script again (a) until it’s done it three times. Gross.

You could also do this using a combination of wc (to get a count of the number of lines in the file) and awk, like this:

awk -v n=$(($(wc -l < file) - 3)) 'NR file

There’s a reason people used to say sed and awk had those names because of the sounds people made when forced to use them.

Anyway, now we have a folder full of clean CSV files. Time to fire up R and get to the fun part.

Inside R, we get a vector of our filenames:

filenames <- dir(path = "data/clean",
 pattern = "*.csv",
 full.names = TRUE)

filenames

#> [1] "data/clean/01_79_congress.csv" "data/clean/02_80_congress.csv" 
#> [3] "data/clean/03_81_congress.csv" "data/clean/04_82_congress.csv" 
#> [5] "data/clean/05_83_congress.csv" "data/clean/06_84_congress.csv" 
#> [7] "data/clean/07_85_congress.csv" "data/clean/08_86_congress.csv" 
#> [9] "data/clean/09_87_congress.csv" "data/clean/10_88_congress.csv" 
#> [11] "data/clean/11_89_congress.csv" "data/clean/12_90_congress.csv" 
#> [13] "data/clean/13_91_congress.csv" "data/clean/14_92_congress.csv" 
#> [15] "data/clean/15_93_congress.csv" "data/clean/16_94_congress.csv" 
#> [17] "data/clean/17_95_congress.csv" "data/clean/18_96_congress.csv" 
#> [19] "data/clean/19_97_congress.csv" "data/clean/20_98_congress.csv" 
#> [21] "data/clean/21_99_congress.csv" "data/clean/22_100_congress.csv"
#> [23] "data/clean/23_101_congress.csv" "data/clean/24_102_congress.csv"
#> [25] "data/clean/25_103_congress.csv" "data/clean/26_104_congress.csv"
#> [27] "data/clean/27_105_congress.csv" "data/clean/28_106_congress.csv"
#> [29] "data/clean/29_107_congress.csv" "data/clean/30_108_congress.csv"
#> [31] "data/clean/31_109_congress.csv" "data/clean/32_110_congress.csv"
#> [33] "data/clean/33_111_congress.csv" "data/clean/34_112_congress.csv"
#> [35] "data/clean/35_113_congress.csv" "data/clean/36_114_congress.csv"
#> [37] "data/clean/37_115_congress.csv" "data/clean/38_116_congress.csv"

Than, instead of writing a for loop and doing a bunch of rbind-ing, we can pipe our vector of filenames to the map_dfr() function and we’re off to the races:

data <- filenames %>% map_dfr(read_csv, .id = "congress")

colnames(data) <- to_snake_case(colnames(data))

data 

#> # A tibble: 20,111 x 26
#> congress last first middle suffix nickname born death sex position party
#> 
#> 1 1 Aber… Thom… Gerst… NA NA 05/1… 01/2… M U.S. Re… Demo…
#> 2 1 Adams Sher… NA NA NA 01/0… 10/2… M U.S. Re… Repu…
#> 3 1 Aiken Geor… David NA NA 08/2… 11/1… M U.S. Se… Repu…
#> 4 1 Allen Asa Leona… NA NA 01/0… 01/0… M U.S. Re… Demo…
#> 5 1 Allen Leo Elwood NA NA 10/0… 01/1… M U.S. Re… Repu…
#> 6 1 Almo… J. Linds… Jr. NA 06/1… 04/1… M U.S. Re… Demo…
#> 7 1 Ande… Herm… Carl NA NA 01/2… 07/2… M U.S. Re… Repu…
#> 8 1 Ande… Clin… Presba NA NA 10/2… 11/1… M U.S. Re… Demo…
#> 9 1 Ande… John Zuing… NA NA 03/2… 02/0… M U.S. Re… Repu…
#> 10 1 Andr… Augu… Herman NA NA 10/1… 01/1… M U.S. Re… Repu…
#> # ... with 20,101 more rows, and 15 more variables: state ,
#> # district , start , end , religion , race ,
#> # educational_attainment , job_type_1 , job_type_2 ,
#> # job_type_3 , job_type_4 , job_type_5 , mil_1 ,
#> # mil_2 , mil_3 

A little data-cleaning later and the congress variable is properly numbered and we’re good to go. The to_snake_case() function comes from the snakecase package.

The data are observed at the level of congressional terms. So, for example, we can draw a heatmap of the age distribution of U.S. representatives across the dataset:

age_counts <- data_all %>%
 filter(position == "U.S. Representative",
 party %in% c("Democrat", "Republican")) %>%
 mutate(binned_age = Hmisc::cut2(start_age, g = 30),
 binned_age2 = Hmisc::cut2(start_age, cuts = c(30:80))) %>%
 group_by(party, start_year, binned_age2) %>%
 tally() %>%
 mutate(freq = n / sum(n),
 pct = round((freq*100), 1))

age_counts

#> # A tibble: 3,952 x 6
#> # Groups: party, start_year [76]
#> party start_year binned_age2 n freq pct
#> 
#> 1 Democrat 1945-01-03 [24,30) 0 0 0 
#> 2 Democrat 1945-01-03 30 1 0.00385 0.4
#> 3 Democrat 1945-01-03 31 2 0.00769 0.8
#> 4 Democrat 1945-01-03 32 1 0.00385 0.4
#> 5 Democrat 1945-01-03 33 1 0.00385 0.4
#> 6 Democrat 1945-01-03 34 4 0.0154 1.5
#> 7 Democrat 1945-01-03 35 2 0.00769 0.8
#> 8 Democrat 1945-01-03 36 4 0.0154 1.5
#> 9 Democrat 1945-01-03 37 4 0.0154 1.5
#> 10 Democrat 1945-01-03 38 12 0.0462 4.6
#> # ... with 3,942 more rows


p <- ggplot(age_counts, aes(x = start_year, y = binned_age2, fill = n))

p_out <- p + geom_tile() +
 scale_fill_viridis_c(option = "A") +
 scale_x_date(breaks = int_to_year(seq(1950, 2010, by = 10), 01, 03),
 date_labels = "%Y",
 limits = c(int_to_year(1945), int_to_year(2019, 2, 1))) +
 scale_y_discrete(breaks = c(29, 30, 40, 50, 60, 70, 79, 80)) +
 facet_wrap(~ party) +
 labs(title = "Age Distribution of U.S. Representatives, 1945-2019",
 x = "Year", y = "Age", fill = "Number of Representatives",
 caption = caption_text) +
 theme(legend.position = "top", legend.box.just = "top")

p_out

Age distribution heatmap

Or we can look at it a different way, using the ggbeeswarm package. We layer a few different pieces here: a trend line for average age, a ribbon showing the 25th and 75th percentiles of the age distribution, the distribution itself (exlcuding its oldest and youngest 1%), and the names of the representatives in the oldest and youngest percentiles. We’ll create a separate dataset for each of these pieces.

age_counts <- data_all %>%
 filter(position == "U.S. Representative",
 party %in% c("Democrat", "Republican")) %>%
 group_by(party, start_year, start_age) %>%
 tally() %>%
 mutate(freq = n / sum(n),
 pct = round((freq*100), 1)) %>%
 arrange(desc(start_year))

mean_age_swarm <- data_all %>%
 filter(position == "U.S. Representative",
 party %in% c("Democrat", "Republican")) %>%
 group_by(congress, party) %>%
 summarize(year = first(start_year),
 mean_age = mean(start_age),
 lo = quantile(start_age, 0.25),
 hi = quantile(start_age, 0.75)) %>%
 mutate(yr_fac = factor(year(year)))

oldest_group_by_year <- data_all %>% filter(party %in% c("Democrat", "Republican"),
 position == "U.S. Representative") %>%
 group_by(congress, party) %>% filter(start_age > quantile(start_age, 0.99))

youngest_group_by_year <- data_all %>% filter(party %in% c("Democrat", "Republican"),
 position == "U.S. Representative") %>%
 group_by(congress, party) %>% filter(start_age < quantile(start_age, 0.01))

Here’s what they look like:

age_counts 

#> # A tibble: 3,410 x 6
#> # Groups: party, start_year [76]
#> party start_year start_age n freq pct
#> 
#> 1 Democrat 2019-01-03 29 1 0.00448 0.4
#> 2 Democrat 2019-01-03 30 1 0.00448 0.4
#> 3 Democrat 2019-01-03 31 1 0.00448 0.4
#> 4 Democrat 2019-01-03 32 2 0.00897 0.9
#> 5 Democrat 2019-01-03 34 2 0.00897 0.9
#> 6 Democrat 2019-01-03 35 2 0.00897 0.9
#> 7 Democrat 2019-01-03 37 3 0.0135 1.3
#> 8 Democrat 2019-01-03 38 5 0.0224 2.2
#> 9 Democrat 2019-01-03 39 4 0.0179 1.8
#> 10 Democrat 2019-01-03 40 5 0.0224 2.2
#> # ... with 3,400 more rows


mean_age_swarm

#> # A tibble: 76 x 7
#> # Groups: congress [38]
#> congress party year mean_age lo hi yr_fac
#> 
#> 1 79 Democrat 1945-01-03 51.5 42 59 1945 
#> 2 79 Republican 1945-01-03 52.8 46 59 1945 
#> 3 80 Democrat 1947-01-03 50.5 43 58 1947 
#> 4 80 Republican 1947-01-03 52.0 45 59 1947 
#> 5 81 Democrat 1949-01-03 49.4 42 56 1949 
#> 6 81 Republican 1949-01-03 53.7 47 61 1949 
#> 7 82 Democrat 1951-01-03 50.8 43 57 1951 
#> 8 82 Republican 1951-01-03 53.8 46.5 61 1951 
#> 9 83 Democrat 1953-01-03 50.7 43 57 1953 
#> 10 83 Republican 1953-01-03 52.9 46 60 1953 
#> # ... with 66 more rows


oldest_group_by_year

#> # A tibble: 181 x 38
#> # Groups: congress, party [76]
#> congress last first middle suffix nickname born death sex 
#> 
#> 1 79 Doug… Robe… Lee NA NA 1863-11-07 1954-10-01 M 
#> 2 79 Mans… Jose… J. NA NA 1861-02-09 1947-07-12 M 
#> 3 79 Eaton Char… Aubrey NA NA 1868-03-29 1953-01-23 M 
#> 4 79 Welch Rich… Joseph NA NA 1869-02-13 1949-09-10 M 
#> 5 80 Doug… Robe… Lee NA NA 1863-11-07 1954-10-01 M 
#> 6 80 Mans… Jose… J. NA NA 1861-02-09 1947-07-12 M 
#> 7 80 Saba… Adol… Joach… NA NA 1866-04-04 1952-11-06 M 
#> 8 80 Eaton Char… Aubrey NA NA 1868-03-29 1953-01-23 M 
#> 9 80 Lewis Will… NA NA NA 1868-09-22 1959-08-08 M 
#> 10 81 Bloom Sol NA NA NA 1870-03-09 1949-03-07 M 
#> # ... with 171 more rows, and 29 more variables: position , party ,
#> # state , district , start , end , religion ,
#> # race , educational_attainment , job_type_1 ,
#> # job_type_2 , job_type_3 , job_type_4 , job_type_5 ,
#> # mil_1 , mil_2 , mil_3 , start_year , end_year ,
#> # name_dob , pid , start_age , poc , days_old ,
#> # months_old , full_name , end_career , entry_age ,
#> # yr_fac 


youngest_group_by_year

#> # A tibble: 163 x 38
#> # Groups: congress, party [76]
#> congress last first middle suffix nickname born death sex 
#> 
#> 1 79 Beck… Lind… Gary NA NA 1913-06-30 1984-03-09 M 
#> 2 79 Foga… John Edward NA NA 1913-03-23 1967-01-10 M 
#> 3 79 Ryter John Franc… NA NA 1914-02-04 1978-02-05 M 
#> 4 79 Benn… Mari… Tinsl… NA NA 1914-06-06 2000-09-06 M 
#> 5 79 Byrn… John Willi… NA NA 1913-06-12 1985-01-12 M 
#> 6 80 Bent… Lloyd Milla… Jr. NA 1921-02-11 2006-05-23 M 
#> 7 80 Kenn… John Fitzg… NA NA 1917-05-29 1963-11-22 M 
#> 8 80 Will… John Bell NA NA 1918-12-04 1983-03-25 M 
#> 9 80 Nodar Robe… Joseph Jr. NA 1916-03-23 1974-09-11 M 
#> 10 80 Pott… Char… Edward NA NA 1916-10-30 1979-11-23 M 
#> # ... with 153 more rows, and 29 more variables: position , party ,
#> # state , district , start , end , religion ,
#> # race , educational_attainment , job_type_1 ,
#> # job_type_2 , job_type_3 , job_type_4 , job_type_5 ,
#> # mil_1 , mil_2 , mil_3 , start_year , end_year ,
#> # name_dob , pid , start_age , poc , days_old ,
#> # months_old , full_name , end_career , entry_age ,
#> # yr_fac 

Now we can draw a graph, faceted by Party:

## Don't show points for the people we're naming
exclude_pid <- c(oldest_group_by_year$pid, youngest_group_by_year$pid)


party_names <- c(`Democrat` = "Democrats",
 `Republican` = "Republicans")

p <- ggplot(data = subset(data_all,
 party %in% c("Democrat", "Republican") &
 position == "U.S. Representative" &
 pid %nin% exclude_pid),
 mapping = aes(x = yr_fac, y = start_age,
 color = party,
 label = last))

p_out <- p + geom_quasirandom(size = 0.1, alpha = 0.4,
 method = "pseudorandom", dodge.width = 1) +
 # mean age trend 
 geom_line(data = mean_age_swarm,
 mapping = aes(x = yr_fac, y = mean_age,
 color = party, group = party),
 inherit.aes = FALSE, size = 1, alpha = 0.5) +
 # 25/75 percentile ribbon 
 geom_ribbon(data = mean_age_swarm,
 mapping = aes(x = yr_fac, ymin = lo, ymax = hi,
 color = NULL, fill = party, group = party),
 inherit.aes = FALSE, alpha = 0.2) +
 # Named outliers 
 geom_text(data = oldest_group_by_year,
 geom_text(data = youngest_group_by_year,
 size = 0.9, alpha = 1, position = position_jitter(width = 0.4, height = 0.4)) +
 # Hackish compromise to label years:
 # we can't use a date object with the beeswarm plot, only a factor
 scale_x_discrete(breaks = levels(data_all$yr_fac)[c(T, rep(F, 4))]) +
 scale_color_manual(values = party_colors) +
 scale_fill_manual(values = party_colors) +
 guides(color = FALSE, fill = FALSE) +
 labs(x = "Year", y = "Age", title = "Age Distribution of Congressional Representatives, 1945-2019",
 subtitle = "Trend line is mean age; bands are 25th and 75th percentiles of the range.\n\n Youngest and oldest percentiles are named instead of being shown by points.",
 caption = caption_text) +
 facet_wrap( ~ party, nrow = 1, labeller = as_labeller(party_names)) +
 theme(plot.subtitle = element_text(size = 10))

Age trends, distributions, and outliers.

That one might be easier to see as a PDF.

Finally, here’s a neat trick. One thing I was interested in was changes in the composition of the so-called “Freshman Class” of representatives over time—that is, people elected to the House for the very first time. To extract that subset, I needed to create a term_id nested with each person’s unique identifier (their pid). I knew what Congressional session each person-term was in, but just needed to count from the first to the last. I’m sure there’s more than one way to do it, but here’s a solution:

first_terms <- data_all %>%
 filter(position == "U.S. Representative", start > "1945-01-01") %>%
 group_by(pid) %>% nest() %>%
 mutate(data = map(data, ~ mutate(.x, term_id = 1 + congress - first(congress)))) %>%
 unnest() %>% filter(term_id == 1)

first_terms

#> > # A tibble: 2,998 x 39
#> pid congress last first middle suffix nickname born death sex 
#> 
#> 1 1 79 Aber… Thom… Gerst… NA NA 1903-05-16 1953-01-23 M 
#> 2 2 79 Adams Sher… NA NA NA 1899-01-08 1986-10-27 M 
#> 3 4 79 Allen Asa Leona… NA NA 1891-01-05 1969-01-05 M 
#> 4 5 79 Allen Leo Elwood NA NA 1898-10-05 1973-01-19 M 
#> 5 6 79 Almo… J. Linds… Jr. NA 1898-06-15 1986-04-14 M 
#> 6 7 79 Ande… Herm… Carl NA NA 1897-01-27 1978-07-26 M 
#> 7 9 79 Ande… John Zuing… NA NA 1904-03-22 1981-02-09 M 
#> 8 10 79 Andr… Augu… Herman NA NA 1890-10-11 1958-01-14 M 
#> 9 13 79 Andr… Walt… Gresh… NA NA 1889-07-16 1949-03-05 M 
#> 10 14 79 Ange… Homer Daniel NA NA 1875-01-12 1968-03-31 M 
#> # ... with 2,988 more rows, and 29 more variables: position , party ,
#> # state , district , start , end , religion ,
#> # race , educational_attainment , job_type_1 ,
#> # job_type_2 , job_type_3 , job_type_4 , job_type_5 ,
#> # mil_1 , mil_2 , mil_3 , start_year , end_year ,
#> # name_dob , start_age , poc , days_old ,
#> # months_old , full_name , end_career , entry_age ,
#> # yr_fac , term_id 

The trick here is that mutate(data = map(data, ~ mutate(.x, term_id = 1 + congress - first(congress)))) line, which nests one mutate call inside another. We group the data by pid and nest() it so it’s as if we have a separate table for each representative. Then we use map() to add a term_id column to each subtable. Once we have a per-person term_id, and we grab everyone’s first term, we can e.g. take a look at the breakdown of freshman representatives by gender for every session since 1945:

First-term representatives by gender, 1945-2019

And also to break that out by Party:

First-term representatives by gender and party, 1945-2019

Related