A while ago I developed and shared an emoji decoder because I was facing problems when retrieving data from Twitter and Instragram. In a nutshell, the issue is that R encodes emojis in a way that makes it a hassle identifying them. This is where the decoder/dictionary comes into play.
After I put together my decoder, new emojis have been released. For example, certain emojis came with different skin colours, which adds a little bit of complexity to the analysis. In fact, emojis with skin tone information consist of two unicode codepoints: the codepoint for the emoji (e.g. “U+1F466” for boy) and the codepoint for the respective skin tone (e.g. “U+1F3FB” for light skin tone). The emoji “boy: light kin tone” thus appears as “U+1F466 U+1F3FB”. The descriptions between the plain emoji and the emoji with skin tone information differ slightly (e.g. “princess” vs “princess: light skin tone”), in which case simple string matching would fail to identify these as being the same emoji. It’s gonna require some special attention when cleaning the data. More on that in the code. This list is a more complete version than the one I used for my decoder. Felipe released a new decoder based on the new list which I will use in the post.
Alright, so with the decoder at hand, we’re able to identify the emojis in, say, a tweet retrieved with the twitteR
package. What now? Quite some people contacted me since I released the article asking for advice concerning emojis analysis and I’d like to cover some questions in this post. The whole code I used for the analysis in this article is available here.
Most used emoji
One such question was how to determine a users most used emoji.
We’ll start collecting some sample data to perform our analysis on. In case you didn’t already know, Paris Hilton is my favorite victim when it comes to emojis analysis or social media analysis, for the simple reason that she uses as lot ot them and shares a lot of content. The emojis_matching
function is the heart of all the analysis performed in this post.
1 |
|
Paris Hilton’s favorite emoji is: SPARKLES! Who would have thought ;) Her alltime favorite MUSICAL NOTES landed on place 7 (status quo 2017-03-24).
Another possible use case is to determine which tweet contains the most emojis. We can use the emojis_matching
function for this question again and than arrange by descending count. Done. Easy peasy lemon squeezy. This is how the output looks like:
From here, it’s easy to calculate the average number of emojis per tweet:
1 |
|
Sentiment analysis with emojis
Doing some research for this article, I came accross this paper, which extensively analyzes the valence (positive, negative, neutral) of emojis in a scientific manner. The authors made a csv file available containing all the emojis with their respective valences. I will base the sentiment analysis on this file.
For a reason I ignore, the csv file available doesn’t contain the sentiment score. The article gives guidance as how to compute the senteiment score of each emoji based on the data in the csv file, but at this point I prefer to scrape the list with the sentiment scores. It’s available here. After merging it with our firt emoji list to get the R encoding info, it looks like this:
1 |
|
Ok, so we have a list of emojis with their respective unicode codepoints and sentiment scores. The next step consists of matching sentiments to the tweets.
1 |
|
What we get is a list of the tweets with there respective, aggregated sentiment scores:
Note that the score od a single tweet is the sum of the sentiment score of all emojis in the tweet. The higher the score, the more positive the tweet. One could put this number in relation to the number of emojis in the tweet or choose a more binary format like “positve” “negative”. Most of Paris Hilton’s tweets are positive, which was to be expected, with only two tweets identified as having a negative valence. Some tweets don’t have any sentiment score, this is due to the fact that they didn’t contain any (identifiable) emoji.
One question that came to my mind was: what words appear in the same tweets as emojis? What are the top n words associated with each emoji? Before we can perform this kind of text analysis, we need to do the usual house keeping: clean the texts from links, strange characters, punctuation etc. I recommand having a look at the code to see what I exactly did, especially at the cleaning pipe. Furthermore, I wrote the function wordFreqEmojis
that outputs a data frame of emojis with the top 5 words (default value, can be changed) they are used most frequently with.
1 |
|
Browsing through words_emojis
, it’s obvious that the data makes a lot of sense. The emojis associated with the word “adios” are for instance “sun”, “water wave”, “bikini” and “airplane”. These words all indicate the twitter user is travelling.
In natural language programming, there are literally endless possibilities. One could fine tune the results by using other stopwords, working with word stems or considering ngrams instead of single words just to name a few. Besides words, one can also find cooccuring emojis.
List of further emojis analysis ideas
Summing up, here are some ideas for further analysis I didn’t implement in this article but can be done with emojis:
-
combine traditional text based sentiment analysis with emojis basd sentiment analysis
-
analyse coocccurence of emojis
-
identify topics based on emojis
-
track trends in emojis use (weekdays, time of the day, “happy” emojis vs “sad” emojis, etc.)
Final thoughts
Emoji analysis is unlikely to make a good job at replacing natural language processing in a sentiment analysis context. Emojis can help easily identify positive content, but they’re not so good at identifying negative or serious, business related content as far as I can tell. It makes sense since most of the emojis have a positive meaning. Also, not everyone makes the same use of emoji and not every positive tweet contains an emoji, so again, I don’t think it’s a good idea to base your sentiment analysis exclusively on emojis. I’d rather suggest to perform traditional sentiment analysis and enrich it with emoji data. Also, emoji analysis can become quite difficult due to the constantly growing number of emojis. Some of them have more than one unicode codepoint, this can be a challenge in the analysis.
In this article, I only showed how to perform simple positive/negative sentiment analysis and very basic association with words, but one could come up with much more detailed approaches. One could harvest much more information from emojis than just their level of positiveness. I’m thinking activities, different sentiments, patriotism, fondness for children, frequency of travelling, etc.