How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis?
Getting: Error in sort.list(y) : invalid input
Thanks
and this is how the emoticons come out looking from twitter and into r:
\xed��\xed�\u0083\xed��\xed��
\xed��\xed�\u008d\xed��\xed�\u0089
Remove mentions as they also do not weigh in sentiment analyzing. Replace any emojis with the text they represent as emojis or emoticons plays an important role in representing a sentiment. Replace contractions with their full forms. Remove any URLs present in tweets as they are not significant in sentiment analysis.
Tap your profile photo at the top-left corner of the screen and select Profile. Tap Edit profile at the top-right corner of the screen. To insert an emoji into the name field, tap that field, tap the emoji key on the keyboard, then insert your desired emoji.
On Twitter, hit command-control-space to bring up an emoji keyboard. It will look, conveniently, like the emoji keyboard on iPhone.
This should get rid of the emoticons, using iconv
as suggested by ndoogan.
Some reproducible data:
require(twitteR)
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem")
# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))
# inspect, yes there are some odd characters in row five
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania ;-)
2 “@teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
Here's the key line that will remove the emoticons:
# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
Now inspect again, to see if the odd characters are gone (see row 5)
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania ;-)
2 @teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
I recommend the function: ji_replace_all <- function (string, replacement)
From the package: install_github (" hadley / emo ")
.
I needed to remove the emojis from tweets that were in the Spanish language. Tried several options, but some messed up the text for me. However this is a marvel that works perfectly:
library(emo)
text="#VIDEO 😢💔🙏🏻,Alguien sabe si en Afganistán hay cigarro?"
ji_replace_all(text,"")
Result:
"#VIDEO ,Alguien sabe si en Afganistán hay cigarro?"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With