From the first time I listened to Radiohead’s The Bends, the band has been my favorite. I was a grad student in England at the time, and I recall listening to “Fake Plastic Trees” on repeat as I made my way to and from the library each day. By the time OK Computer came out, I was hooked. I remain hooked to this day.
As a writer-turned-data-science practitioner, text mining and natural language processing fascinate me. So in this post I’ll combine these two passions—words and Radiohead—to analyze the band’s lyrics.
The first thing to do is create the dataset, using Josiah Parry’s wonderful geniusR
package to obtain the lyrics. I’ve seen other people get the discography from Wikipedia, but it’s a short enough list that it would be more trouble for me to do it that way than to just create the list from scratch.
So the first thing I want to do is just get a sense of overall word frequency, so I’ll use the wordcloud
package.
If you’ve listened to Radiohead for any length of time, you don’t need me to tell you that their songs are often dark. This is evidenced in the word cloud of their lyrics. The word cloud, though, includes all of their albums together, which can’t then give us a real sense of what each album is about or of the progression over time. Even doing a word cloud for each one isn’t going to give us an adequate picture of meaning because often a word that is unimportant to meaning will nonetheless occur frequently. So, for example, the word there’ll appears as a frequent word in the above word cloud, but without context that word is virtually meaningless. And while we could add that to our list of stopwords to remove, that’s a pretty inelegant and unsophisticated way to adjust term frequency.
But term frequency (tf) is only one way of discerning the importance of specific terms. Another way is to look at inverse document frequency (idf), a method of weighting infrequent words higher than frequent words within a collection or corpus of documents. In turn, term frequency-inverse document frequency (tf-idf), which multiplies the two scores together, is a widely regarded statistical measure for identifying terms important to a document within a collection of documents. Importance increases in proportion to the number of occurrences in a document, but it is then offset by its frequency in the collection as a whole. Hence, a word that is rare in a collection, but frequent in a document, will have a higher weight.
So let’s run the model and use ggplot2
to visualize the results.
We can see that the term raindrops, which appeared as the most prominent term in the word cloud, is indeed quite important for the Hail to the Thief record, with the highest weight of any term in the entire corpus. We can also see that the terms hurt and haunt from The King of Limbs are also highly weighted for that album. Lastly, we can see that the most important words on A Moon-shaped Pool are non-English words. So let’s look at one of them to see it’s context.
Next, I’m curious about how the band has evolved, so I want to look at which words the band has used at a higher or lower rate over time. I’m relying on code that Julia Silge and David Robinson developed to analyze changes in their own Twitter output for the book Text Mining with R. As I create the dataset, I’m going to filter()
the results so that I’m only keeping words that occur 15 or more times.
Each row in the dataset corresponds to the use of a given word on a particular album (represented for now by the year
value). Count
represents how many times the word is used on an album. Time_total
represents the total number of words in the album’s lyrics, and word_total
represents the total number of times the word is used in the complete collection of albums.
Now I’ll use nest()
from the tidyr
package to create a new listed data frame, and then I’ll use map()
from the purrr
package to apply a regression model, a family = "binomial"
glm()
model since this is count data.
No I have a models
column that holds the glm
results, and I want to extract from that the slopes for each word/year. I’ll use map()
and tidy()
from the broom
package and apply an adjustment to the p-values to account for multiple comparisons.
So there are four words that have changed significantly over time: broken, light, mess, and arms. Let’s plot the changes to see when they occur.
All four words become more frequent over time. Broken, light, and mess all seem self-explanatory. Personally, I’m curious about arms and how that’s used, i.e., does it signify appendages or weapons?
Well, that’s interesting. Early instances of arms in the lyrics actually comes from alarms.
There’s a lot one can do with lyrics using NLP. I’ve seen several posts that have done sentiment analysis of Radiohead’s dark, dark lyrics, like this one here that figured out Radiohead’s saddest song. (Spoiler: It’s “True Love Waits.”) I actually started this project using lyrics from The Decemberists, thinking that their involuted lyrics would be interesting to look at, but their songs have so many la-la-las that it was more trouble than it was worth.
The post Prophets of gloom: Using NLP to analyze Radiohead lyrics appeared first on my (mis)adventures in R programming.
Related