In this blog post, similar to a previous blog postI am going to show you how we can go from an Excel workbook that contains data to flat file. I willtaking advantage of the structure of the tables inside the Excel sheets by writing a functionthat extracts the tables and then mapping it to each sheet!
Last week, October 14th, Luxembourguish nationals went to the polls to elect the Grand Duke! No,actually, the Grand Duke does not get elected. But Luxembourguish citizen did go to the pollsto elect the new members of the Chamber of Deputies (a sort of parliament if you will).The way the elections work in Luxembourg is quite interesting; you can vote for a party, or votefor individual candidates from different parties. The candidates that get the most votes willthen seat in the parliament. If you vote for a whole party,each of the candidates get a vote. You get as many votes as there are candidates to vote for. So,for example, if you live in the capital city, also called Luxembourg, you get 21 votes to distribute.You could decide to give 10 votes to 10 candidates of party A and 11 to 11 candidates of party B.Why 21 votes? The chamber of Deputies is made up 60 deputies, and the country is divided into fourlegislative circonscriptions. So each voter in a circonscription gets an amount of votes that isproportional to the population size of that circonscription.
Now you certainly wonder why I put the flag of Gambia on top of this post? This is because thegovernment that was formed after the 2013 elections was made up of a coalition of 3 parties;the Luxembourg Socialist Worker’s Party, the Democratic Party and The Greens.The LSAP managed to get 13 seats in the Chamber, while the DP got 13 and The Greens 6,meaning 32 seats out of 60. So because they made this coalition, they could form the government,and this coalition was named the Gambia coalition because of the colors of these 3 parties:red, blue and green. If you want to take a look at the ballot from 2013 for the southern circonscription,click here.
Now that you have the context, we can go back to some data science. The results of the electionsof last week can be found on Luxembourg’s Open Data portal, right here.The data is trapped inside Excel sheets; just like I explained in a previous blog postthe data is easily read by human, but not easily digested by any type of data analysis software.So I am going to show you how we are going from this big Excel workbook to a flat file.
First of all, if you open the Excel workbook, you will notice that there are a lot of sheets; thereis one for the whole country, named “Le Grand-Duché de Luxembourg”, one for the four circonscriptions,“Centre”, “Nord”, “Sud”, “Est” and 102 more for each commune of the country (a commune is anadministrative division). However, the tables are all very similarly shaped, and roughly at thesame position.
This is good, because we can write a function to extracts the data and then map it overall the sheets. First, let’s load some packages and the data for the country:
1 |
|
1 |
|
{brotools}
is my own package. You can install it with:
1 |
|
it contains a function that I will use down below. The function I wrote to extract the tablesis not very complex, but requires that you are familiar with how {tidyxl}
imports Excel workbooks.So if you are not familiar with it, study the imported data frame for a few minutes. It will makeunderstanding the next function easier:
1 |
|
This function has three arguments, dataset
, starting_col
and target_rows
. dataset
is thedata I loaded with xlsx_cells
from the {tidyxl}
package. I think the following picture illustrateseasily what the function does:
So the function first filters only the rows we are interested in, then the cols. I then selectthe columns I want which are called character
and numeric
(if the Excel cell contains characters thenyou will find them in the character column, if it contains numbers you will them in the numericcolumn), then I fill the empty cells with the values from the numeric
column and the I removethe NA’s. These two last steps might not be so clear; this is how the data looks like up until theselect()
function:
1 |
|
So by filling the NA’s in the numeric the data now looks like this:
1 |
|
And then I filter out the NA’s from the character column, and that’s almost it! I simply needto add a new column with the party’s name and rename the other columns. I also add a “Year” colmun.
Now, each party will have a different starting column. The table with the data for the first partystarts on column 1, for the second party it starts on column 4, column 7 for the third party…So the following vector contains all the starting columns:
1 |
|
(If you study the Excel workbook closely, you will notice that I do not extract the last two parties.This is because these parties were not present in all of the 4 circonscriptions and are very, very,very small.)
The target rows are always the same, from 11 to 19. Now, I simply need to map this function tothis list of positions and I get the data for all the parties:
1 |
|
I also added the locality
and division
columns to the data.
Let’s take a look:
1 |
|
1 |
|
Very nice.
Now we need to do the same for the 4 electoral circonscriptions. First, let’s load the data:
1 |
|
Now things get trickier. Remember I said that the number of seats is proportional to the populationof each circonscription? We simply can’t use the same target rows as before. For example, for the“Centre” circonscription, the target rows go from 12 to 37, but for the “Est” circonscriptiononly from 12 to 23. Ideally, we would need a function that would return the target rows.
This is that function:
1 |
|
This function needs a dataset
, a sheet_to_extract
and a reference_address
. The referenceaddress is a cell that actually contains the number of seats in that circonscription, in ourcase “B5”. We can easily get the list of target rows now:
1 |
|
1 |
|
Now, let’s split the data we imported into a list, where each element of the list is a dataframewith the data from one circonscription:
1 |
|
Now I can easily map the function I defined above, extract_party
to this list of datasets. Well,I say easily, but it’s a bit more complicated than before because I have now a list of datasetsand a list of target rows:
1 |
|
The way to understand this is that for each element of list_data_districts
and list_targets
,I have to map extract_party
to each element of position_parties_national
. This gives the intentedresult:
1 |
|
1 |
|
I now need to add the locality
and division
columns:
1 |
|
We’re almost done! Now we need to do the same for the 102 remaining sheets, one for each communeof Luxembourg. This will now go very fast, because we got all the building blocks from before:
1 |
|
Let me introduce the following function: %-l%
. This function removes elements from lists:
1 |
|
1 |
|
You can think of it as “minus for lists”. This is called an infix operator.
So this function is very useful to get the list of communes, and is part of my package, {brotools}
.
As before, I load the data:
1 |
|
Then get my list of targets, but I need to change the reference address. It’s “B8” now, not “B7”.
1 |
|
I now create a list of communes by mapping a filter function to the data:
1 |
|
And just as before, I get the data I need by using extract_party
, and adding the “locality” and“division” columns:
1 |
|
The steps are so similar for the four circonscriptions and for the 102 communes that I couldhave write a big wrapper function and the use it for the circonscription and communes at once.But I was lazy.
Finally, I bind everything together and have a nice, tidy, flat file:
1 |
|
1 |
|
This blog post is already quite long, so I will analyze the data now that R can easily ingest itin a future blog post.
If you found this blog post useful, you might want to follow me on twitterfor blog post updates.
Related