Let’s segue a bit from the main focus of the blog (which if you don’t know is data science applied to entertainment) to focus on something closer to home – data science on the Philippine government.
Don’t get me wrong. I have no intention of turning my blog into a political platform. In fact, I hate talking about politics. There are two reasons why I’m starting a series on data science on Philippine politics. One, I found a fabulous dataset, which is rare in the Philippines as my last intern can attest. Two, I have a newfound interest in quantitative social science, particularly of the applied NLP flavor.
In this series of posts, I zoom in on Senate bills over the 13th Congress (2004) to the 17th Congress (2016). The nature of Philippine elections dictates that Congress constituency changes three years at a time. Senators serve six-year terms, with half elected every three years.
I focus on two ideas:
-
What general topics do Philippine senators write bills on?
-
How do these topics change from Congress to Congress?
By topic in the first bullet, I mean the general gist of the bill. Is the bill about education? About prohibition? And by the second bullet, what I mean is whether the number of bills about health, for example, change as Congress constituency change. Does it increase from 2004 to 2016?
Answers to these two questions give us insight on the issues that the Philippine Senate is focusing on. Does the Senate prioritize new taxes over healthcare? New holidays over new public projects? These are some sample questions that I aim to answer in this series.
##
Web Scraping and Text Preprocessing
Conveniently, the Philippine government maintains a database of proposed Senate and House bills from the 13th to 17th Congress. With a wave of my BeautifulSoup wand, I was able to extract the pertinent information from the website: the title of the bill, the description of the bill, the year it was filed, and the authors of the bill. In total, I recovered 14,741 Senate bills proposed from June 30, 2004 to May 23, 2018.
For this article, my focus is on the textual content of the bill, and so I concatenated the title and description of each bill and used that to characterize the bill. There is rich information contained in the author data, but I will save that for another blog post. For now I focus on the title and description.
The canonical framework for Natural Language Processing in Python is NLTK. For a first course on the subject, check out University of Michigan’s Coursera course. Craving for more? Check out Natural Language Processing by Bird et.al.
Standard preprocessing steps I applied are as follows:
-
removal of punctuation
-
word tokenization
-
P(art) O(f) S(peech) tagging (retaining only nouns and adjectives)
-
lemmatization with Wordnet
After getting the data in clean and pristine form, I removed some words that appeared a lot of times in the corpus. For example, I removed the word ‘act’ since it appeared in almost all Senate bills. These are called domain-specific stop words, and it’s important to remove these since these tend to dominate less frequent but more informative words.
After preprocessing, I converted the words into their Bag-of-Words representation for further processing.
Word Cloud Analysis
For this first post, let me just do something basic to understand the data: visualization. Let us look at the word clouds for each Congress.
Keywords for the 13th Congress are service, child and penalty.Some bills filed by this Congress are
-
Vision Care for Kids Act by Miriam Defensor-Santiago : “… establish a grant program to provide vision care to children …”
-
Anti-Neglected Children’s Act of 2007 by Miriam Defensor-Santiago : “… penalize the neglect of a child by parents …”
-
Service of Warrants of Arrests in Certain Cases Prohibition by Juan Flavier: “… prohibiting the service of warrants of arrests in certain cases and providing penalties for violations …”
Most words in the 14th Congress are in an even playing field, but notice that trial, court, judiciary get repeated a bunch of times. This is because quite a number of bills filed during this time, mostly by Chiz Escudero, are concerned with the addition of regional or metropolitan trial courts.
Here, apparent keywords are education, service, local. Some example bills passed during this time are:
-
Scholarship Grants to PNP, AFP, Bfp, Bjmp Family Members by Ramon Revilla Jr.
-
Universally Accessible Cheaper and Quality Medicines Act by Loren Legarda
-
Silent Mode Act by Miriam Defensor-Santiago
Again, service, local and education are apparent keywords during the 16th Congress. Some sample bills filed during the Congress are:
-
Consumer Protection Act by MDS
-
Anti-Distracted Driving Act by Ejercito-Estrada
-
Sustainable Transportation Act by Pia Cayetano, etal.
Similar to the 14th Congress, no keyword pops out. Well, except for service. This is the current ruling Congress, and here are some representative bills:
-
Anti-Hazing Act by Honasan
-
Public Safety Act by Sotto
-
Transportation Crisis Act of 2016 by Drilon
Thoughts and What’s Next
The word clouds in the last section are pretty… but not really very informative. At this point, we’re at the visualization phase of our data science pipeline. What comes after is actual modeling.
In the next post in the series, we’ll try to apply topic modeling, an unsupervised learning technique, on the Philippine Senate bills. This will hopefully help us decompose the topic constituency of each Congress and give us insight on how it changes over time.
Leave your thoughts and comments below.