I am new to text processing with R. I'm trying the simple code below
library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)
which is one of the answers in the question Finding 2 & 3 word Phrases Using R TM Package
However, it gives an error Error in FUN(X[[2L]], ...) : non-character argument
instead.
I can generate a document term matrix when I drop the ngramLength
parameter, but I do need to search for phrases of certain word length. Any suggestions of alternative or corrections?
ngramLength seems not to work. Here is a workaround:
library(RTextTools)
library(tm)
library(RWeka) # this library is needed for NGramTokenizer
library
texts <- c("This is the first document.",
"Is this a text?",
"This is the second file.",
"This is the third text.",
"File is not this.")
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)),
control=list(
weighting = weightTf,
tokenize = TrigramTokenizer))
as.matrix(dtm)
The tokenizer uses RWeka
's NGramTokenizer
instead of the tokenizer called by create_matrix
. You can now use dtm
in the other RTextTools functions, like training a classification model below:
isText <- c(T,F,T,T,F)
container <- create_container(dtm, isText, virgin=F, trainSize=1:3, testSize=4:5)
models=train_models(container, algorithm=c("SVM","BOOSTING"))
classify_models(container, models)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With