Build your own offshore company

Hackathons are not alike

Recently, a number of this blog’s authors were at a data hackathon, the strangest one we’ve been to so far. It was more of a startup pitch gathering, complete with pitch training and whatnot. I was repeatedly asked by other participants “so, how do you want to monetise your idea?”. My answer was simple: I don’t. I already have a job.

The topic of the hackathon was sufficiently vague (“ideas on economy, media, news, content, social media, digitilisation, knowledge management, content creation and distribution, and media transformation”), so we decided we wanted to take a look at the recently published Panama Papers data.

Since the entire setup for the project is in R – the data preparation, the analysis, the dashboard, and even the presentation – I think it might be of interest to other people of the R community, hence this blogpost. For those impatient, here’s the direct link to the shiny app, and you can ignore my rambings: https://safferli.shinyapps.io/hackathon_shiny/. I will only post the most intersting code snippets, the full code (for all preparatory data hacking, the shiny dashboard, and the presentation) is available on github, as always.

Panama Papers

For those that have been living under a rock in April, the Panama Papers are a leak of data from Mossack Fonseca, a law firm in Panama specialising in setting up offshore companies. It is so far the largest leak (source: The Economist) in history with 2.6TB of (original) data, and 11.5 million documents. It contains references to 29 billionaires (as ranked in Forbes), and 12 current or former country leaders.

The data on the network of firms and persons has recently been made available, and contains roughly 320 thousand companies. The hard work of cleaning the raw data and putting it into a useful format (.csv and a graph database format (neo4j) are provided) was done by newspaper companies and journalists world-wide.

Random Name Generator

My initial impulse to take a look at the data more closely came when a friend of mine and I realised that a lot of the company names in the Panama Papers actually kind of funny – and that it is extremely clear that nothing reputable can be expected from these companies. For instance, there is a “Moonlight Import/Export”, and a “You’ll See Ltd.”.

So, we agreed that we’d build a random name generator for offshore companies, using the most popular company names from the Panama papers. To make things more “web-two-zero-y”, we would build one of these “modern Facebook viral thingies”, where you can generate the name from your birth date – e.g. your day of birth will always point to a fixed name part. We built a nice infographic for this, which is also available on github to download.

How do we get to the names? Quite easily. We first grab the names from provided data, use the excellent tidyr::separate to split the names into their parts (e.g. “My Company” becoming “My” and “Company”), remove a couple of common stopwords and then pick the 31 most prevalent company name parts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# read company data provided by the Panama Papers
Entities <- read_csv(paste0(csv_folder, "Entities.csv"))

Entities %<>%
 # get lower-case names
 mutate(n = name %>% tolower) %>%
 # get last "word" of name for corporation form (e.g. "ltd.")
 mutate(form = gsub(".* ", "", n))

# split the names into one column per "word"
Entities %<>% separate(n, paste0("n", 1:20), fill = "right")

# stopwords -- we don't want these in our companies
stop <- c(letters, "the", "com", "and", "of", "int", "pty", "samoa", "sdn", "europe", "ptc", "")

# first "word" of the company name
n <- "n1"
filter_criteria <- lazyeval::interp(~ ! col %in% stop & ! col %in% Entities$form, col = as.name(n))
first <- Entities %>% 
 group_by_(.dots = n) %>% 
 summarize(n = n()) %>% 
 filter_(filter_criteria) %>% 
 arrange(-n) %>% 
 .[[n]] %>% 
 head(31)
# birthdates: 31 days in a month, get top 31
paste(1:31, "=", first, "\n") %>% cat

This we do for the first three parts of the company name, and voilà, the company name generator is done!

Shiny dashboard

The next logical step was to build an app for this.

  1. randomly generate a name, until you’re satisfied with the result

pick an intermediary – the person who will help you set up the offshore company

  1. we’ll let you pick a country you want to start with, and then show the top five intermediaries, ranked by who opened up the most offshore companies

  2. for convenience, we add a google maps link to the intermediary’s address, so you can fire up your GPS easily

  3. check out intermediaries_from_country.R on github

pick a jurisdiction for the offshore company by showing pictures of the available beaches

  1. beach pictures were taken from the first result of a google image search of "$countryname+beach", with some fine tuning by hand (the US beach had girls in bikinis, which we did not think were appropriate for such a serious business proposal).

  2. check out jurisdiction_images.R on github

  • beach pictures were taken from the first result of a google image search of "$countryname+beach", with some fine tuning by hand (the US beach had girls in bikinis, which we did not think were appropriate for such a serious business proposal).

  • check out jurisdiction_images.R on github

All output is parsed into a fluidRow() UI element, step by step adding a new “row” to the dashboard.

The Shiny app is available on shinyapps.io, on the free plan, so it will be a bit slow: https://safferli.shinyapps.io/hackathon_shiny/. Head on over and build your own offshore company!

Conclusion

Ironically, we won the “Best Pitch” award for our project, even though we didn’t really pitch anything, and I also did not participate in the pitch training. If you speak German and are not put off by a shaky handcam recording, you can check out my presentation here. My talk starts at 1hr15min, roughly.

Writing this article, I just found out that there is a Panama Papers R package on github. So there you go, you can also get the data the easy way!

Code and data for this analysis is available on github, as always.