Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Twitter Data Analysis - Error in Term Document Matrix

Tags:

r

Trying to do some analysis of twitter data. Downloaded the tweets and created a corpus from the text of the tweets using the below

# Creating a Corpus
wim_corpus = Corpus(VectorSource(wimbledon_text)) 

In trying to create a TermDocumentMatrix as below, I am getting an error and warnings.

tdm = TermDocumentMatrix(wim_corpus, 
                       control = list(removePunctuation = TRUE, 
                                      stopwords =  TRUE, 
                                      removeNumbers = TRUE, tolower = TRUE)) 

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),    : 'i, j, v' different lengths


In addition: Warning messages:
1: In parallel::mclapply(x, termFreq, control) :
 all scheduled cores encountered errors in user code
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In TermDocumentMatrix.VCorpus(corpus) : invalid document identifiers
4: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
NAs introduced by coercion

Can anyone point to what this error indicates?Could this be related to the tm package?

The tm library has been imported. I am using R Version: R 3.0.1 and RStudio: 0.97

like image 812
BRZ Avatar asked Aug 29 '13 07:08

BRZ


3 Answers

I had the same problem and it turns out it is an issue with package compatibility. Try installing

install.packages("SnowballC")

and load with

library(SnowballC)

before calling DocumentTermMatrix.

It solved my problem.

like image 169
Guillaume Avatar answered Nov 11 '22 23:11

Guillaume


I think the error is due to some "exotic" characters within the tweet messages, which the tm function cannot handle. I'v got the same error using tweets as a corpus source. Maybe the following workaround helps:

# Reading some tweet messages (here from a text file) into a vector

rawTweets <- readLines(con = "target_7_sample.txt", ok = TRUE, warn = FALSE, encoding = "utf-8") 

# Convert the tweet text explicitly into utf-8

convTweets <- iconv(rawTweets, to = "utf-8")

# The above conversion leaves you with vector entries "NA", i.e. those tweets that can't be handled. Remove the "NA" entries with the following command:

tweets <- (convTweets[!is.na(convTweets)])

If the deletion of some tweets is not an issue for your solution (e.g. build a word cloud) then this approach may work, and you can proceed by calling the Corpus function of the tm package.

Regards--Albert

like image 7
Albert Avatar answered Nov 11 '22 23:11

Albert


I have found a way to solve this problem in an article about TM.

An example in which the error follows below:

getwd()
require(tm)

# Importing files
files <- DirSource(directory = "texts/",encoding ="latin1" )

# loading files and creating a Corpus
corpus <- VCorpus(x=files)

# Summary

summary(corpus)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation)
matrix_terms <- DocumentTermMatrix(corpus)
Warning messages:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers

This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.

However, if you add one more command before using the function TermDocumentMatrix you can keep going.

Below follows the code with the new command:

getwd()
require(tm)  

files <- DirSource(directory = "texts/",encoding ="latin1" )

# loading files and creating a Corpus
corpus <- VCorpus(x=files)

# Summary 
summary(corpus)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation)

# COMMAND TO CHANGE THE CLASS AND AVOID THIS ERROR
corpus <- Corpus(VectorSource(corpus))
matriz_terms <- DocumentTermMatrix(corpus)

Therefore, you won't have more problems with this.

like image 6
3 revs, 2 users 72% Avatar answered Nov 12 '22 01:11

3 revs, 2 users 72%