Do you subscribe to the Data is Plural newsletter from Jeremy Singer-Vine? You probably should, because it is a treasure trove of interesting datasets arriving in your email inbox. In the November 28 edition, Jeremy linked to the Small World of Words project, and I was entranced. I love stuff like that, all about words and how people think of them. I have been mulling around a blog post ever since, and today I finally have my post done, so let’s see what’s up!
It’s a Small World
The Small World of Words project focuses on word associations. You can try it out for yourself to see how it works, but the general idea is that the participant is presented with a word (from “telephone� to “journalist� to “yoga�) and is then asked to give their immediate association with that word. The project has collected more than 15 million responses to date, and is still collecting data. You can check out some pre-built visualizations the researchers have put together to explore the dataset, or you can download the data for yourself.
1 |
|
1 |
|
The available dataset as it exists when I downloaded it includes 1,228,200 word associations (each of which involve four words, i.e. three connections) by 83,864 unique participants. When a participant starts on a word association, this project has them move forward through three hops in a chain, from cue
to R1
to R2
to R3
, and then start over with a new cue
. Participants can go through many cues in any given session.
Participants can also report other information about themselves. For example, what is the age distribution?
1 |
|
There are lots of young folks represented in this project, as is typical for online surveys. What about gender?
1 |
|
In this project, women were more likely to participate than other genders.
This project is international, pulling participants from many native languages. It also allows folks to specify whether they are a US English speaker, a UK English speaker, etc.
1 |
|
So that’s a little bit of EDA to understand this project and its participants. Now let’s dig into the word associations!
Building forward associations
This is a rich, detailed dataset and there are so many directions we could go with it. In taking a first stab, let’s look at all the forward associations in the whole project. This means we will treat the “hop� from the cue to the first association the same as the “hop� from the first to second association, which certainly isn’t entirely correct. It’s a choice to start from, though.
1 |
|
1 |
|
Now that we have all the forward associations, we can find the most common associations for any individual word with some simple dplyr operations. What about… coffee? ☕
1 |
|
1 |
|
via GIPHY
Or… maybe you are in a holiday Christmas celebratory mood, and want to know what people associate with the word “Christmas�.
1 |
|
1 |
|
via GIPHY
Comparing groups
This project recorded information about the participants themselves, so we can dig into how different kinds of people associate words. For example, let’s start with gender and comparing folks who identify as men and women. What differences do we see with the word “water�?
1 |
|
Notice the dramatic contrasts between domestic water uses like sinks and baths with more scientific word about water like steam. We see how socialized and differentiated women’s language is, even with something that seems neutral like water.
What about differences between US and UK English?
1 |
|
Well, alrighty then. 😳
Changes with age
We can apply some functional programming and modeling to look at how these word associations change with age. Let’s take the word “money�, and start by calculating, for 5-year bins, the number and proportion of words associated for each bin.
1 |
|
1 |
|
Now let’s fit some models using glm()
since this is count data to predict the counts out of the total for each age bin from the age. We can then tidy()
the output of the modeling, adjust the p-values for multiple comparisons since we looked at a bunch of words at one time, and make a volcano-style plot to compare the effect size with the p-value.
1 |
|
The younger someone is, the more likely they are to associate money with a job, or the color green or gold. The older someone is, the more likely they are to associate money with dollars and stocks. Let’s look at the top terms associated with money that exhibit change with age in terms of a small p-value.
1 |
|
Younger respondents were more likely to associate words like DEBT with money. 😵
The End
There is so much more that could be done with this dataset. You could build a network data structure between the words and do various kinds of network analysis, and I didn’t touch any of the differences in the intial cue vs. the later hops. Notice that this dataset is all about words, but I didn’t ever load the tidytext package for this analysis. The researchers who built this dataset have already done much of the hard work of processing this data, and it is more like structured data that happens to be about language, rather than unstructured text data. Let me know if you have any questions!
Related