Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RTextTools create_matrix returns non-character argument error

Tags:

r

text-mining

I am new to text processing with R. I'm trying the simple code below

library(RTextTools) texts <- c("This is the first document.", "This is the second file.", "This is the third text.") matrix <- create_matrix(texts,ngramLength=3)

which is one of the answers in the question Finding 2 & 3 word Phrases Using R TM Package

However, it gives an error Error in FUN(X[[2L]], ...) : non-character argument instead.

I can generate a document term matrix when I drop the ngramLength parameter, but I do need to search for phrases of certain word length. Any suggestions of alternative or corrections?

like image 891
Ricky Avatar asked Jul 31 '14 08:07

Ricky


1 Answers

ngramLength seems not to work. Here is a workaround:

library(RTextTools)
library(tm)
library(RWeka) # this library is needed for NGramTokenizer
library 
texts <- c("This is the first document.", 
           "Is this a text?", 
           "This is the second file.", 
           "This is the third text.", 
           "File is not this.") 
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)),
                         control=list(
                                      weighting = weightTf,
                                      tokenize = TrigramTokenizer))

as.matrix(dtm)

The tokenizer uses RWeka's NGramTokenizer instead of the tokenizer called by create_matrix. You can now use dtm in the other RTextTools functions, like training a classification model below:

isText <- c(T,F,T,T,F)
container <- create_container(dtm, isText, virgin=F, trainSize=1:3, testSize=4:5)

models=train_models(container, algorithm=c("SVM","BOOSTING"))
classify_models(container, models)
like image 163
user3631991 Avatar answered Oct 02 '22 00:10

user3631991