Talk like a pirate day 2017

Q: What has 8 legs, 8 arms and 8 eyes?A: 8 pirates.

Avast, ye scurvy scum! Today be September 19, the International Talk Like A Pirate Day. While it’s a silly “holiday”, it’s a great chance to switch your Facebook to Pirate English (“Arr, this be pleasing to me eye!”). Also, it’s as good a reason as any to go sailing the Internet Main, scouring for booty (data)!

Well, any excuse, y’know…

What is today known (more or less) as “Pirate English” (the bit with all the “arrr!”, “ahoy, ye matey!”, etc) is in fact the accent employed by Robert Newton, the actor who played Long John Silver in the first non-silent film (“talkie”) Treasure Island. He speaks with a Bristol accent, which was in fact the main trading port with the West Indies (Bristol, not the accent, of course). The accent we today associate with pirates could well have been historical!

More useless information on pirates (such as that there is no historical evidence of a pirate ever owning a parrot as a pet – again this stereotype evolved from the Treasure Island movie) is on the wonderful as always treasure trove of superfluous information, the QI page on pirates.

First, let’s get some data on pirates. I have some hazy memories on seeing data on historical shipping, and also piracy somewhere a while ago, but I cannot find it any more. I must be getting older. So, I’ll do what any self-respecting young whippersnapper would do and grab data from Wikipedia. There is a list of famous pirates; it’s biased, of course. I see relatively few entries on Barbary pirates, for instance. No matter, the list covers most of the Golden Age of Piracy in the Caribbean, and I’ll use it to generate a dataset on these pirates.

We’ll grab the wiki page, linking to the static ID of the page for reproducability.

## retrieve wikipedia list of pirates page
pirates.url <- "https://en.wikipedia.org/w/index.php?title=List_of_pirates=799791558"

## read the page into R
pirates.wiki <- read_html(pirates.url)

We then use purrr::map_dfr to grab all tables in the HTML, and push them into a single dataframe and do some basic cleaning. Man, I hate Windows; utf-8 and Windows still in 2017 is a major pain in the stern.

## grab all tables on the page; tables 3-5 are the ones of main interest
# 3: Rise of the English Sea Dogs and Dutch Corsairs: 1560-1650
# 4: Age of the Buccaneers: 1650-1690
# 5: Golden Age of Piracy: 1690-1730
pirates.raw <- pirates.wiki %>% 
 # grab all data tables; they all have the class "wikitable"
 html_nodes(".wikitable") %>% 
 # extract all tables and put them into one dataframe
 purrr::map_dfr(html_table, fill = TRUE) %>% 
 # clean column names
 setNames(make.names(names(.))) %>% 
 mutate(
 # stupid wiki has capital and non-capital letters in headers...
 Years.active = if_else(is.na(Years.active), Years.Active, Years.active),
 # empty table cells should be NA
 Years.active = parse_character(Years.active),
 Life = parse_character(Life), 
 # god, I hate windows... R on Win will chocke on ‒ and – characters, so replace them with ASCII
 # http://unicode-search.net/unicode-namesearch.pl?term=dash
 # U+2012 to U+2015 are dashes
 Life = stringr::str_replace_all(Life, "(\u2012|\u2013)", "-"),
 Years.active = stringr::str_replace_all(Years.active, "(\u2012|\u2013)", "-")
 ) %>% 
 # keep columns we want and re-order
 select(Name, Life, Country.of.origin, Years.active, Comments)

With the raw data pulled, we now need to clean it a bit. Our goal is to generate a data set of the “pirating times” of each individual pirate. Wiki lists the active times, the “flourished” times, and/or the life times. We’ll split the age columns (usually yyyy-yyyy) into two, providing some padding if there is only one of the dates available. After this, we’ll clean up the columns into integer columns, ignoring the rows with did not provide accurate enough year data (e.g. “17th century”). This is a fun project, not a scientific one.

## generate our pirating times dataset
pirates.dta <- pirates.raw %>% 
 mutate(
 # born and died forced into the "yyyy-yyyy" schedule for later tidyr::separate()
 Life.parsed = gsub("(b.|born) *([[:digit:]]+)", "\2-", x = Life),
 Life.parsed = gsub("(d.|died) *([[:digit:]]+)", "-\2", x = Life.parsed),
 Years.active = gsub("(to) *([[:digit:]]+)", "-\2", x = Years.active),
 # move the "flourished" dates to its own column
 flourished = if_else(grepl("fl.", Life.parsed), gsub("fl. *", "", Life.parsed), as.character(NA)),
 # remove "flourished" dates from life
 Life.parsed = if_else(grepl("fl.", Life.parsed), as.character(NA), Life.parsed)
 ) %>% 
 # split the life, active, and flourished years into 2 columns (start/stop), each
 separate(Life.parsed, c("born", "died"), sep = "-", extra = "merge", fill = "right") %>% 
 separate(Years.active, c("active.start", "active.stop"), sep = "-", extra = "merge", fill = "right", remove = FALSE) %>% 
 separate(flourished, c("fl.start", "fl.stop"), sep = "-", extra = "merge", fill = "right", remove = FALSE) %>% 
 # 
 mutate_at(
 vars(born, died, active.start, active.stop, fl.start, fl.stop), 
 f.keep.last.four.digits
 ) %>% 
 mutate_at(
 vars(born, died, active.start, active.stop, fl.start, fl.stop),
 #as.integer
 parse_integer
 ) %>%
 # get the best guess of piracing time: active > flourished > estimated from life
 mutate(
 # take active if available, if not take flourished
 piracing.start = if_else(is.na(active.start), fl.start, active.start),
 piracing.stop = if_else(is.na(active.stop), fl.stop, active.stop),
 # neither active, nor flourished date, estimate from life
 piracing.start = if_else(is.na(piracing.start)&is.na(piracing.stop), f.activity.from.life(b = born, d = died), piracing.start),
 piracing.stop = if_else(is.na(piracing.start)&is.na(piracing.stop), f.activity.from.life(b = born, d = died, return.b = FALSE), piracing.stop),
 # finally, let's get single-year start/stops cleaned up
 piracing.start = if_else(is.na(piracing.start), piracing.stop, piracing.start), 
 piracing.stop = if_else(is.na(piracing.stop), piracing.start, piracing.stop)
 ) %>% 
 # reorder
 select(Name, Life, piracing.start, piracing.stop, Country.of.origin, everything()) 

With the dataset now sufficiently cleared up, we will now change the dates to decades only – so change 1643 to 1640. We’ll graph our data by decade. We’ll use some rowwise() %>% do() magic to generate one row per pirate per activity decade (so two rows if he was active from 1648 to 1651). We will add up the pirates active in each decade to get to our visualisation of pirate activity, and some pirates were active for far longer than others!

## building the pirate activity, with each pirate having one row per decade in which he was active
pirate.activity <- pirates.dta %>% 
 mutate(
 piracing.start = piracing.start %/%10L*10L,
 piracing.stop = piracing.stop %/%10L*10L,
 arrr = piracing.stop - piracing.start,
 country = if_else(Country.of.origin %in% top.n.countries(10), Country.of.origin, "other")
 ) %>% 
 filter(piracing.start > 1000) %>% 
 rowwise() %>% 
 do(
 data.frame(
 Name = .$Name,
 Life = .$Life, 
 piracing.start = .$piracing.start,
 piracing.stop = .$piracing.stop,
 # need one row per decade active 
 piracing.years = seq(.$piracing.start, .$piracing.stop, by = 10L),
 arrr = .$arrr,
 Country.of.origin = .$Country.of.origin,
 country = .$country, 
 stringsAsFactors = FALSE
 )
 )

With the data now set up, we can set out to graph the pirate activity, and then also to build a random pirate generator!

Let’s take a look at the countries these famous pirates came from:

pirates.raw %>% 
 group_by(Country.of.origin) %>% 
 tally(sort = TRUE) %>% 
 head(15)

Country.of.origin	n
England	108
Netherlands	47
France	39
Unknown	25
Colonial America	22
United States	19
Spain	10
China	9
Venezuela	8
Germany	7
Ireland	7
Wales	7
Scotland	6
Ottoman Empire	4
Portugal	4

England is over-represented by quite a large margin, especially if you group Wales/Scotland/Ireland, and potentially Colonial America. The Netherlands and France come in second and third place, respectively. No big surprises here.

We’ll pull the data on pirate activity to build a graph of the overall pirate activity over the decades. This is a rather straightforward bar chart:

## pirate activity per decade
pirate.activity %>% 
 ungroup() %>% 
 #ungroup.rowwise_df() %>% 
 group_by(piracing.years, country) %>% 
 tally() %>% 
 ggplot(aes(x=piracing.years, y=n))+
 geom_bar(aes(fill=country, colour=country), stat = "identity", position = "stack")+
 # scale_fill_brewer(type = "qual", palette = 3)+
 # scale_colour_brewer(type = "qual", palette = 3)+
 scale_fill_manual(values = my.colours)+
 scale_colour_manual(values = my.colours)+
 theme_bw()+
 theme(plot.title = element_text(lineheight=.8, face="bold"))+
 labs(
 title = "Arrr!tivity of famous pirates per decade\n \u2620\u2620\u2620", 
 x = "", 
 y = "number of pirates active"
 )

One can nicely see the increase in pirate’s activity during the Golden Age (roughly 1560-1700). It drops shortly in 1700, building up again shortly thereafter. The middle of the 18th century is more or less free from pirates (at least those infamous enough to end up on Wikipedia), until piracy starts again around 1800 with the English-French wars. The Somalian pirates of the 2000s are the last on the graph!

This leads us to the next questions: Somalia was the heyday of pirates in the 2000s, but we cannot reasonably expect it to be in the 1600s. When was a country’s individual golden age of pirating?

To answer this, we’ll build a ridgeplot (formaly also known as a joyplot, from the Joy Division band t-shirt; the name is now discouraged for obvious reasons), showcasing the density plots of pirate activity per country.

## piracy golden age by top n (15) countries
pirates.dta %>% 
 filter(piracing.start > 1000) %>% 
 filter(Country.of.origin %in% top.n.countries(15)) %>% 
 ggplot()+
 geom_density_ridges(aes(x = piracing.start, y = Country.of.origin, fill = Country.of.origin))+
 theme_bw()+
 theme(plot.title = element_text(lineheight=.8, face="bold"))+
 labs(
 title = "Pirate activity by country\n \u2620\u2620\u2620", 
 x = "", 
 y = "country"
 )

The “usual suspects” add up nicely – England, the Netherlands, France. It’s interesting to see Germany show up as early (around 1350-1400). These are the pirates of the Hansa, mostly. Venezuela also had a “late blooming” pirate life in the 19th century, mostly. Interesting, I had not known about that!

With the historical data on pirates and their activity times, we can also try to generate a random pirate story! For this, we will restrict the data to the golden age of piracy, dropping all pirates which started their pirate life before 1500, and after 1750. We will want to create a random pirate with a statistically “correct” starting year of piracy, a statistically “correct” piracy tenure, and varnish him (or her) with a random pirate name!

Golden Age Pirate: starting piracy year

Let’s start with the pirating starting activity year.

pirates.fit <- pirates.dta %>% 
 filter(piracing.start > 1500) %>% 
 filter(piracing.start < 1750) %>% 
 mutate(
 arrr = piracing.stop - piracing.start
 )
 
pirates.fit %>% 
 ggplot()+
 geom_density(aes(x=piracing.start))

Unfortunately, it’s a heavily left-skewed distribution, and after some hours of looking into distributions, I couldn’t find a good fit, and we’ll have to resort to pulling from the empirical distribution (ie, the data itself) for generating a random start year for a pirate:

f.gen.pirating.startyear <- function() {
 # generate a startyear by pulling from the empirical distribution (pirates.fit$piracing.start)
 base::sample(pirates.fit$piracing.start, 1)
}

Golden Age Pirate: piracing tenure

We have better luck with the tenure of a pirate: the distribution follows a nice count-data distribution, I’m guessing a negative binomial.

pirates.fit %>% 
 ggplot()+
 geom_density(aes(x=piracing.start))

As it’s discrete data, we’ll use the negative binomial distribution to estimate the theoretical distribution of a pirate’s tenure using library(fitdistrplus).

#library(fitdistrplus)
# check which distribution would be a good fit: 
#fitdistrplus::descdist(pirates.fit$arrr, boot = 1000)

# negative binomial
f1 <- fitdistrplus::fitdist(pirates.fit$arrr, "nbinom")
fitdistrplus::plotdist(pirates.fit$arr, "nbinom", para = list(size=f1$estimate[1], mu=f1$estimate[2]))

The results are excellent: we can clearly see the oversampling of “round” years (10, in particular), which comes from the fact that with uncertain dates, Wikipedia noted a pirate’s activity in decades, rather than actual years. But still, our distribution of rnbinom(x, size = 0.3876683, mu = 6.395578) works nicely.

We can thus draw a random tenure for a pirate by using this function:

f.gen.pirating.tenure <- function() {
 # generate a tenure time from the estimated theoretical neg binomial distribution (pirates.fit$arrr)
 rnbinom(1, size=f1$estimate[1], mu=f1$estimate[2])
}

Golden Age Pirate: silly name generator

All that is missing now is a random pirate name generator. Fantasy name generator have a nice collection of name generators, and a pirate one is included! They allow the use of the names “to name anything in any of your own works”, so let’s use them here!

We source the names into R and create a simple function pulling a random name, allowing you to specify male or female.

source("pirate-names.R")

f.gen.pirate.name <- function(female = FALSE) {
 # generate a pirate name from a list of 
 # - first names (male/female)
 # - nick names (male/female)
 # - last names
 if(female) {
 paste(
 base::sample(names.female, 1), 
 base::sample(names.nickf, 1), 
 base::sample(names.last, 1)
 )
 } else {
 paste(
 base::sample(names.male, 1), 
 base::sample(names.nickm, 1), 
 base::sample(names.last, 1)
 )
 }
}

Golden Age Pirate: a pirate’s story!

With all the pieces set up, we can now generate a random pirate:

f.pirate.story <- function(female = FALSE) {
 # build the pirate story by joining name, startyear, and tenure time
 yy <- f.gen.pirating.startyear()
 rr <- paste0(
 "Your name is ", 
 f.gen.pirate.name(female), 
 " and you roamed the Caribbean from ", 
 yy, " to ", yy+f.gen.pirating.tenure(),
 ". Arrr! \u2620"
 )
 return(rr)
}

\u2620 is the utf-8 code for the skull-and-crossbones emoji. Calling the function several times then results in:

female	story
FALSE	Your name is Gerard ‘The Fierce’ Eastoft and you roamed the Caribbean from 1603 to 1605. Arrr! ☠
FALSE	Your name is Orton ‘Grommet’ Drachen and you roamed the Caribbean from 1657 to 1657. Arrr! ☠
FALSE	Your name is Ascot ‘Shadow’ Artemis and you roamed the Caribbean from 1650 to 1652. Arrr! ☠”
TRUE	Your name is Ethyl ‘Temptress’ Mitchell and you roamed the Caribbean from 1610 to 1625. Arrr! ☠
TRUE	Your name is Christine ‘Daffy’ Ward and you roamed the Caribbean from 1657 to 1657. Arrr! ☠
TRUE	Your name is Huldah ‘Twitching’ Vome and you roamed the Caribbean from 1718 to 1718. Arrr! ☠

The code is available on github, as always.

I wish I could find those historical shipping and piracy data again – any hints or pointers are very much appreciated!