Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Twitter Mining using R (twitteR + tm): error using tolower conversion

Tags:

r

twitter

tm

I'm having some trouble working with twitter data I extracted using the CRAN Version of the twitteR package. In particular, the tolower conversion from the tm package.

I'm following this example

This is what I'm currently doing:

#oauth handshake and so on work fine 
google_8.10<- searchTwitter("#Google", n=1500, cainfo="cacert.pem")
google_8.10_text <- sapply(google_8.10, function(x) x$getText())
google_8.10_text_corpus <- Corpus(VectorSource(google_8.10_text))
google_8.10_text_corpus <- tm_map(google_8.10_text_corpus, tolower) 
google_8.10_text_corpus <- tm_map(google_8.10_text_corpus, removePunctuation)
google_8.10_text_corpus <- tm_map(google_8.10_text_corpus,            function(x)removeWords(x,stopwords()))

The other conversions complete just fine (if tolower isn't run). However the tolower conversion returns:

google_8.10_text_corpus <- tm_map(google_8.10_text_corpus, tolower)
    Warnmeldung:
In parallel::mclapply(x, FUN, ...) :
  all scheduled cores encountered errors in user code

I'm having the suspicion that this might be caused by some character in one of the tweets but how can I track the problem down?

edit: Indeed, certain characters seem to cause this, eg.:

"#Google #TheInternship THE BEST MOVIE EVER @Jeennyy01 @dylanobrien    I love this part \ud83d\ude1c http://t.co/iok5vm83cP"

Here the "\ud83d\ude1c" part causes the error. Any idea on how to automatically strip these phrases (this one is: http://www.charbase.com/1f61c-unicode-face-with-stuck-out-tongue-and-winking-eye) from the tweets?

like image 299
Matthias Avatar asked Jan 26 '26 03:01

Matthias


1 Answers

According to the source tolower can give an error:

Support for "bytes" marked encoding

nzchar and nchar(, "bytes") are independent of the encoding.

nchar(, "char") nchar(, "width") give NA (if allowed) or error. substr substr<- work in bytes

abbreviate chartr make.names strtrim tolower toupper give error.

Here is an example where an error is thrown using an invalid UTF code point:

tolower("\udc80")
Error in tolower("<ed><U+00B2><U+0080>") : 
  invalid input 'í²€' in 'utf8towcs'
like image 136
James Avatar answered Jan 28 '26 18:01

James