In the last article in our series on data science applied to Philippine politics, we extracted Senate bill data from the Senate website and did some basic token visualization using word clouds. We saw which words appeared frequently in the bills filed per Congress, but we also noted word clouds don’t really allow us to identify the topics and quantify the topic proportions each Congress focused on. Since word clouds are mainly for visualization and not for actual modeling, we have to resort to a different method.
Enter topic modeling.
Topic modeling is an unsupervised learning technique to discover hidden topics in a set of text documents. The math behind topic models is, dare I say, quite involved, even for someone with a math degree like me. In practice, it’s more important that one knows how to use the model and understand what the results mean. Let’s leave the math to the academics.
To make it easier for everyone, let me give an concrete example of a topic model in action. Suppose you have a digitized set of Philippine news articles from Manila Bulletin. Feed it into a topic model, and it will discover clusters of words that constitute ‘topics’. For example, one cluster could contain words like ‘GDP’, ‘stocks’, ‘PSEI’ – these words signify the topic ‘economy’. Another cluster could contain words like ‘celebrity’, ‘Kris Aquino’, ‘KathNiel,” which are all about entertainment.
*A technical note: a topic model doesn’t actually “cluster” words together. For each topic, the model defines a probability distribution over all words and assign high probability to words that are relevant to the topic.
For an example study on a topic model applied to Hackernews, check out Diving into Data.
So what can you do with this set of discovered topics? For one, you can decompose an article into its constituent topics. An article talking about Kris Aquino entering Philippine politics might be labeled by a topic model to be 60% entertainment and 40% politics. This is somewhat quite remarkable since you didn’t give the topic model any context on entertainment and politics – you just fed it a set of unstructured and unlabeled text. The combination of probability theory and a ton of text data allow us to discover meaning behind text.
When someone talks about topic modeling, 95% of the time he means the Latent Dirichlet Allocation (LDA) model, a specific type of topic model that works quite well with minimal training effort. For a very intuitive explanation of how LDA works, check out Edwin Chen’s article. Python has a very nice library called Gensim, dubbed ‘Topic Modeling for Humans’, that makes it 100x easier to build topic models out of raw text data.
For some additional information on topic models, check out Text Mining with Probabilistic Topic Models by Chemudugunta.
Implementation and Technical Notes
To use a topic model, you need to specify the number of topics K to discover. Here I arbitrarily set K = 15. Not all topics discovered will be interpretable, and I will only retain those that appear to be coherent.
I’m using a cool tool called pyLDAvis to visualize the discovered topics in a two-dimensional bubble map. pyLDAvis also helps in the interpretation of the discovered topics.
I didn’t really focus that much on tuning in this post. But assuming I had unlimited computing power, I would probably try out different random seeds and pick the model that has the most interpretable topics. Some would advocate using the coherence model to arrive at the optimal model, but that didn’t really work that well by experience.
The focus of this post is to demonstrate how advanced NLP techniques can be used in quantitative social science, as applied in the Philippine setting. As such, we can make do with a less-than-perfect model.
My Jupyter notebook contains code detailing how to construct the pyLDAvis bubble plot of the discovered topics. Let’s look at some sample topics.
Based on the top-ranking words, it’s not difficult to see that the first topic is on education.
Topic two looks like it focuses on health.
We can go through each bubble and interpret what the discovered topics mean. Note that not all topics are coherent. For example, topic six seems to be confused.
From the 15 topics discovered, I was able to interpret 12 of them. Based on the numbering in the bubble chart, these are:
3: public works
5: tax and benefits
7: child/women protection
8: family/human/cultural rights
12: roads/ transport
13: regional court
Temporal Dynamics of Senate Bill Topics
Given the discovered topics in the last section, we can plot how bill topics change from Congress to Congress.
Remember how the topic model is able to split a text into its constituent topics? In our analysis, a bill is considered to be of topic X if it is at least 40% topic X. By counting up all the bills belonging to topic X for each Congress, we are able to see to what extent Senate focuses on topic X. Note that the number of bills filed by each Congress differs (due to a number of reasons). In order to put everything in same terms, I normalized counts by total bills filed.
Here are the top six bill topics. I didn’t plot everything because a lot of the lines overlapped and I didn’t want to confuse people.
In terms of magnitude, education bills dominate the Senate, followed by health. We can see here that focus on education picked up during the 15th and 16th Congress, whereas health and crime dwindled over the years. There seems to be a resurgence in public works and worker bills based on the analysis. Lastly, tax and benefits seem to be quite stagnant.
Some sample bills the model discovered to be focused on education are as follows:
On public works:
On tax and benefits:
What’s the Point?
One may think, why not just label each bill’s topics manually? Why go through the hassle of learning Gensim and topic models to discover and label topics?
Well, go right ahead then. Label the topics for all 14,000 bills. Let’s see how long that will take you.
Although the topic model cannot guarantee 100% performance in accurate labeling (it is unsupervised afterall), the knowledge discovered and the speed associated with automatic labeling make it worth the effort.
This type of study generates a lot of discussion points. Do you agree with the model? Or does it need some fine-tuning? Is the focus of the Congress right for the nation, or is there something else that needs to be given more attention? These are some of the topics interested readers can sound off on.
Some Other Things to Do
Again, I have a newfound interest in quantitative social science. I had tremendous fun playing around with this sort of data. Maybe I should pivot the blog to Data Meets Social Science? (Probably not.)
Here are some other things that I’m planning to explore in future articles in this series.
Take into account the temporal differences of the Senate bills. In other words, we can integrate the information on which Congress filed which bill into the model. This will allow us to see how topics change over the years. For this case, we can use dynamic topic models or DTMs.
Use author (senator) information in some way. It would also be interesting to see author information used in the model. This could be something basic like a comparison of the number of bills proposed by each senator, to something advanced like an application of the author-topic model, which discovers a distribution of topics filed by each author similar to how a vanilla topic model assigns a distribution of words to each topic.
Leave a comment below if you have any insights or questions.