Looking into 19th century ads from a Luxembourguish newspaper with R

The national library of Luxembourg publishedsome very interesting data sets; scans of historical newspapers! There are several data sets thatyou can download, from 250mb up to 257gb. I decided to take a look at the 32gb “ML Starter Pack�.It contains high quality scans of one year of the L’indépendence Luxembourgeoise (Luxembourguishindependence) from the year 1877. To make life easier to data scientists, the national libraryalso included ALTO and METS files (which is a XML schema that is used to describe the layout andcontents of physical text sources, such as pages of a book or newspaper) which can be easily parsedby R.

L’indépendence Luxembourgeoise is quite interesting in that it is a Luxembourguish newspaper writtenin French. Luxembourg always had 3 languages that were used in different situations, French, Germanand Luxembourguish. Luxembourguish is the language people used (and still use) for day to day lifeand to speak to their baker.Historically however, it was not used for the press or in politics. Instead it was German thatwas used for the press (or so I thought) and French in politics (only in1984 was Luxembourguish madean official Language of Luxembourg).It turns out however that L’indépendence Luxembourgeoise, a daily newspaper that does not existanymore, was in French. This piqued my interest, and it also made analysis easier, for 2 reasons:I first started with the Luxemburger Wort (Luxembourg’s Word I guess would be a translation), whichstill exists today, but which is in German. And at that time, German was written using the Frakturfont, which makes it barely readable. Look at the alphabet in Fraktur:

1
2
� � � � � � � � � � � � � � � � � � � � � � � � � �
� � � � � � � � � � � � � � � � � � � � � � � � � �

It’s not like German is already hard enough, they had to invent the least readable font ever to writeGerman in, to make extra sure it would be hell to decipher.

So basically I couldn’t be bothered to try to read a German newspaper in Fraktur. That’s when I noticedthe L’indépendence Luxembourgeoise… A Luxembourguish newspaper? Written in French? Soundsinteresting.

And oh boy. Interesting it was.

19th century newspapers articles were something else. There’s this article for instance:

For those of you that do not read French, this article relates that in France, the ministry ofjustice required priests to include prayers on the Sunday that follows the start of the new seasonof parliamentary discussions, in order for God to provide senators his help.

There this gem too:

This article presents the tallest soldier of the German army, called Emhke, and nominated by theGerman Emperor himself to accompany him during his visit to Palestine. Emhke was 2.08 meters talland weighted 236 pounds (apparently at the time Luxembourg was not fully sold on the metric system).

Anyway, I decided to take a look at ads. The last paper of this 4 page newspaper always containedads and other announcements. For example, there’s this ad for a pharmacy:

that sells tea, and mineral water. Yes, tea and mineral water. In a pharmacy. Or this one:

which is literally upside down in the newspaper (the one from the 10th of April 1877). I don’tknow if it’s a mistake or if it’s a marketing ploy, but it did catch my attention, 140 years later,so bravo. This is an announcement made by a shop owner that wants to sell all his merchandisefor cheap, perhaps to make space for new stuff coming in?

So I decided brush up on my natural language processing skills with R and do topic modeling on these ads.The challenge here is that a single document, the 4th page of the newspaper, contains a lot of ads.So it will probably be difficult to clearly isolate topics. But let’s try nonetheless.First of all, let’s load all the .xml files that contain the data. These files look like this:

1
2
3

 
 

I’m interested in the “CONTENT� tag, which contains the words. Let’s first get that into R.

Load the packages, and the files:

1
2
3
4
5
6
7
library(tidyverse)
library(tidytext)
library(topicmodels)
library(brotools)

ad_pages <- str_match(list.files(path = "./", all.files = TRUE, recursive = TRUE), ".*4-alto.xml") %>%
 discard(is.na)

I save the path of all the pages at once into the ad_pages variables. To understand how and whythis works, you must take a look at the hierarchy of the folder:

Inside each of these folder, there is a text folder, and inside this folder there are the .xmlfiles. Because this structure is bit complex, I use the list.files() function with theall.files and recursive argument set to TRUE which allow me to dig deep into the folderstructure and list every single file. I am only interested into the 4th page though, so that’s whyI use str_match() to only keep the 4th page using the ".*4-alto.xml" regular expression. Thisis the right regular expression, because the files are named like so:

1
1877-12-29_01-00004-alto.xml

So in the end, ad_pages is a list of all the paths to these files. I then write a functionto extract the contents of the “CONTENT� tag. Here is the function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
get_words <- function(page_path){
 page <- read_file(page_path)
 page_name <- str_extract(page_path, "1.*(?=-0000)") 
 
 page %>% 
 str_split("\n", simplify = TRUE) %>% 
 keep(str_detect(., "CONTENT")) %>% 
 str_extract("(?<=CONTENT)(.*?)(?=WC)") %>% 
 discard(is.na) %>% 
 str_extract("[:alpha:]+") %>% 
 tolower %>% 
 as_tibble %>% 
 rename(tokens = value) %>% 
 mutate(page = page_name)
}

This function takes the path to a page as argument, and returns a tibble with the two columns: onecontaining the words, which I called tokens and the second the name of the document this wordwas found. I uploaded on .xml filehereso that you can try the function yourself. The difficult part is str_extract("(?<=CONTENT)(.*?)(?=WC)")which is were the words inside the “CONTENT� tag get extracted.

I then map this function to all the pages, and get a nice tibble with all the words:

1
ad_words <- map_dfr(ad_pages, get_words)
1
ad_words
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 1,114,662 x 2
## tokens page 
## 
## 1 afin 1877-01-05_01/text/1877-01-05_01
## 2 de 1877-01-05_01/text/1877-01-05_01
## 3 mettre 1877-01-05_01/text/1877-01-05_01
## 4 mes 1877-01-05_01/text/1877-01-05_01
## 5 honorables 1877-01-05_01/text/1877-01-05_01
## 6 clients 1877-01-05_01/text/1877-01-05_01
## 7 à 1877-01-05_01/text/1877-01-05_01
## 8 même 1877-01-05_01/text/1877-01-05_01
## 9 d 1877-01-05_01/text/1877-01-05_01
## 10 avantages 1877-01-05_01/text/1877-01-05_01
## # ... with 1,114,652 more rows

I then do some further cleaning, removing stop words (French and German, because there are someads in German) and a bunch of garbage characters and words, which are probably when the OCR failed.I also remove some German words from the few German ads that are in the paper, because they havea very high tf-idf (I’ll explain below what that is).I also remove very common words in ads that were just like stopwords. Every ad of a shop mentioned theirclients with honorable clientèle, or used the word vente, and so on. This is what you see belowin the very long calls to str_remove_all. I also compute the tf_idf and I am grateful toThinkR blog post on that, which you can read here.It’s in French though, but the idea of the blog post is to present topic modeling with Wikipediaarticles. You can also read the section on tf-idf from the Text Mining with R ebook, here.tf-idf gives a measure of how common words are. Very common words, like stopwords, have a tf-idfof 0. So I use this to further remove very common words, by only keeping words with a tf-idfgreater than 0.01. This is why I manually remove garbage words and German words below, because theyare so uncommon that they have a very high tf-idf and mess up the rest of the analysis. To find these wordsI had to go back and forth between the tibble of cleaned words and my code, and manually add allthese exceptions. It took some time, but definitely made the results of the next steps better.I then use cast_dtm to cast the tibble into a DocumentTermMatrix object, whichis needed for the LDA() function that does the topic modeling:

1
2
stopwords_fr <- read_csv("https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/master/stopwords-fr.txt",
 col_names = FALSE)
1
2
3
4
1
2
3
4
## Parsed with column specification:
## cols(
## X1 = col_character()
## )
1
2
stopwords_de <- read_csv("https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt",
 col_names = FALSE)
1
2
3
4
1
2
3
4
## Parsed with column specification:
## cols(
## X1 = col_character()
## )
1
2
3
## Warning: 1 parsing failure.
## row col expected actual file
## 157 -- 1 columns 2 columns 'https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ad_words2 <- ad_words %>% 
 filter(!is.na(tokens)) %>% 
 mutate(tokens = str_remove_all(tokens, 
 '[|\|!|"|#|$|%|&|\*|+|,|-|.|/|:|;|<|=|>|?|@|^|_|`|’|

To read more details on this, I suggest you take a look at the following section of theText Mining with R ebook: Latent Dirichlet Allocation.

I choose to model 10 topics (k = 10), and set the alpha parameter to 5. This hyperparamater controls howmany topics are present in one document. Since my ads are all in one page (one document), Iincreased it. Let’s fit the model, and plot the results:

1
lda_model_long <- LDA(dtm_long, k = 10, control = list(alpha = 5))

I plot the per-topic-per-word probabilities, the “beta� from the model and plot the 5 words thatcontribute the most to each topic:

1
2
3
4
5
6
7
8
9
10
11
12
13
result <- tidy(lda_model_long, "beta")

result %>%
 group_by(topic) %>%
 top_n(5, beta) %>%
 ungroup() %>%
 arrange(topic, -beta) %>% 
 mutate(term = reorder(term, beta)) %>%
 ggplot(aes(term, beta, fill = factor(topic))) +
 geom_col(show.legend = FALSE) +
 facet_wrap(~ topic, scales = "free") +
 coord_flip() +
 theme_blog()

So some topics seem clear to me, other not at all. For example topic 4 seems to be about shoes madeout of leather. The word semelle, sole, also appears.Then there’s a lot of topics that reference either music, bals, or instruments.I guess these are ads for local music festivals, or similar events. There’s also an ad for whatseems to be bundles of sticks, topic 3: chêne is oak, copeaux is shavings and you knowwhat fagots is. The first word stère which I did not know is a unit of volume equal to onecubic meter (see Wikipedia). So they were likely sellingbundle of oak sticks by the cubic meter. For the other topics, I eitherlack context or perhaps I just need to adjust k, the number of topics to model, and alpha to get betterresults. In the meantime, topic 1 is about shoes (chaussures), theatre, fuel (combustible)and farts (pet). Really wonder what they were selling in that shop.

In any case, this was quite an interesting project. I learned a lot about topic modelingand historical newspapers of my country! I do not know if I will continue exploring it myself,but I am really curious to see what others will do with it!

Hope you enjoyed! If you found this blog post useful, you might want to followme on twitter for blog post updates andbuy me an espresso or paypal.me.

|‘|(|)|\||~|=|]|°|<|>|«|»|\d{1,100}|©|®|•|—|�|“|-|¦\\|�')) %>% mutate(tokens = str_remove_all(tokens, "j'|j’|m’|m'|n’|n'|c’|c'|qu’|qu'|s’|s'|t’|t'|l’|l'|d’|d'|luxembourg|honneur|rue|prix|maison|frs|ber|adresser|unb|mois|vente|informer|sann|neben|rbudj|artringen|salz|eingetragen|ort|ftofjenb|groifdjen|ort|boch|chem|jahrgang|uoa|genannt|neuwahl|wechsel|sittroe|yerlorenkost|beichsmark|tttr|slpril|ofto|rbudj|felben|acferftücf|etr|eft|sbege|incl|estce|bes|franzosengrund|qne|nne|mme|qni|faire|id|kil")) %>% anti_join(stopwords_de, by = c("tokens" = "X1")) %>% filter(!str_detect(tokens, "§")) %>% mutate(tokens = ifelse(tokens == "inédite", "inédit", tokens)) %>% filter(tokens != "") %>% anti_join(stopwords_fr, by = c("tokens" = "X1")) %>% count(page, tokens) %>% bind_tf_idf(tokens, page, n) %>% arrange(desc(tf_idf)) dtm_long <- ad_words2 %>% filter(tf_idf > 0.01) %>% cast_dtm(page, tokens, n)

To read more details on this, I suggest you take a look at the following section of theText Mining with R ebook: Latent Dirichlet Allocation.

I choose to model 10 topics (k = 10), and set the alpha parameter to 5. This hyperparamater controls howmany topics are present in one document. Since my ads are all in one page (one document), Iincreased it. Let’s fit the model, and plot the results:

1
lda_model_long <- LDA(dtm_long, k = 10, control = list(alpha = 5))

I plot the per-topic-per-word probabilities, the “beta� from the model and plot the 5 words thatcontribute the most to each topic:

1
2
3
4
5
6
7
8
9
10
11
12
13
result <- tidy(lda_model_long, "beta")

result %>%
 group_by(topic) %>%
 top_n(5, beta) %>%
 ungroup() %>%
 arrange(topic, -beta) %>% 
 mutate(term = reorder(term, beta)) %>%
 ggplot(aes(term, beta, fill = factor(topic))) +
 geom_col(show.legend = FALSE) +
 facet_wrap(~ topic, scales = "free") +
 coord_flip() +
 theme_blog()

So some topics seem clear to me, other not at all. For example topic 4 seems to be about shoes madeout of leather. The word semelle, sole, also appears.Then there’s a lot of topics that reference either music, bals, or instruments.I guess these are ads for local music festivals, or similar events. There’s also an ad for whatseems to be bundles of sticks, topic 3: chêne is oak, copeaux is shavings and you knowwhat fagots is. The first word stère which I did not know is a unit of volume equal to onecubic meter (see Wikipedia). So they were likely sellingbundle of oak sticks by the cubic meter. For the other topics, I eitherlack context or perhaps I just need to adjust k, the number of topics to model, and alpha to get betterresults. In the meantime, topic 1 is about shoes (chaussures), theatre, fuel (combustible)and farts (pet). Really wonder what they were selling in that shop.

In any case, this was quite an interesting project. I learned a lot about topic modelingand historical newspapers of my country! I do not know if I will continue exploring it myself,but I am really curious to see what others will do with it!

Hope you enjoyed! If you found this blog post useful, you might want to followme on twitter for blog post updates andbuy me an espresso or paypal.me.