Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FUN-error after running 'tolower' while making Twitter wordcloud

Tags:

r

twitter

tm

Trying to create wordcloud from twitter data, but get the following error:

Error in FUN(X[[72L]], ...) : 
  invalid input '������������❤������������ "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs' 

This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code

mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())

mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))

I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:

mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x)    iconv(enc2utf8(x), sub = "bytes")))

These do not help. I also get that it can't find function content_transformer even though the tm-package is checked off and running.

I'm running this on OS X 10.6.8 and using the latest RStudio.

like image 566
Anne Boysen Avatar asked Jan 03 '15 16:01

Anne Boysen


2 Answers

I use this code to get rid of the problem characters:

tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
like image 142
RUser Avatar answered Oct 13 '22 23:10

RUser


A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.

library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)


#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")

#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())

#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))

#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

#Convert as matrix
m = as.matrix(tdm)

#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE) 

#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

enter image description here

like image 2
deb2015 Avatar answered Oct 14 '22 00:10

deb2015