If you have ever retrieved data from Twitter, Facebook or Instagram with R, you might have noticed a strange phenomenon. While R seems to be able to display some emoticons properly, many other times it doesn’t, making any further analysis impossible unless you get rid of them. With a little hack, I decoded these emoticons and put them all in a dictionary for further use. I’ll explain how I did it and share the decoder with you.
The meaning of Emoticons and why you should analyze them
As far as I remember, all the sentiment analysis codes I came across dealt with emoticons by simply getting rid of them. Now, it might be ok if you’re interested in analyzing someone’s vocabulary or do some fancy wordclouds. But if you want to perform sentiment analysis, than these emoticons are probably the most meaningful part of your data! Sentiments, emotions, emoticons, you know, there’s a link ;) Not only are they full of meaning by themselves, they also have the virtue to change the meaning of the sentences they are appended to. Think of a tweet like “I’m going to bed”. It has a dramatically different meaning, depending on whether a happy smiley or a sad smiley is associated to it. Long story short: If you’re interested in sentiment, you MUST capture emoticons!
Emoticons and R
As mentioned earlier, R seems to be totally capable of properly displaying some emoticons, while it fails displayng others. Try inputing \xE2\x9D\xA4
(heavy black heart) to the console and this is how the output will look like:
[1] “❤”
1 |
|
The display problem seems related to the length of the code. Apparently, if an emoticon’s UTF-8 code is longer then 12 characters, it won’t be displayed properly. Find a list of emoticons with their respective encodings here.
Now, the real problem occurs when you retrieve data from social media. This is how tweets look like when retrieved with the userTimeline()
function and parsed to a data frame with the twListToDF()
function from the twitteR package:
For demonstration purposes, I needed to find a twitter user who integrates a lot of different emoticons in his or her tweets. And who is better suited than emoticons queen Paris Hilton herself? Exactly, nobody :D As you can see, many emoticons aren’t displayed correctly but appear as strange question marks symbols instead, while others are displayed as emoticons.
Printed to the console, the strange question marks symbols aka emoticons look like unicode except that it is not unicode and doesn’t match the actual, right UTF-8 encoding I expected. Take this tweet, for example:
1 |
|
The first emoticon consists of multiple musical notes, it’s corresponding UTF-8 encoding is \xF0\x9F\x8E\xB6
but it is being rendered as \xed\xa0\xbc\xed\xbe\xb6\
.
Also, try to pass the tweet to the str()
function and you will get an error message:
1 |
|
If you don’t need the emoticons, the most reasonable thing to do is to get rid of them. But given that I wanted to include them in my analysis, I was facing two problems: First, it was difficult to handle the tweets because their format messed up with some functions and lead to error messages. I needed to change their encoding. Second, I couldn’t identify the emoticons as the way R encodes them doesn’t correspond to their actual UTF-8 encoding. It’s a real pain in the neck, especially (but not only) if you’re analyzing Paris Hilton’s tweets, because she just LOVES her musical notes emoticons ;) The solution that came to my mind was to decode them.
First things first, I used the iconv()
function with the following parameters to be able to further handle the tweets. As you can see, the multiple musical notes emoticon \xF0\x9F\x8E\xB6
now appears as <ed><a0><bc><ed><be><b6>
. This is an encoding I could work with and this was going to be the basis of the decoder.
1 |
|