Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.
I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.
Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.
However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?
Specifically, what I am trying to find out is that how do I create the exact DTM on new data?
If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.
Any ideas/thoughts?
tm
has so many pitfalls... See much more efficient text2vec and vectorization vignette which fully answers to the question.
For tm
here is probably one more simple way to reconstruct DTM matrix for second corpus:
crude2.dtm <- DocumentTermMatrix(crude2, control = list
(dictionary=Terms(crude1.dtm), wordLengths = c(3,10)) )
If I understand correctly, you have made a dtm, and you want to make a new dtm from new documents that has the same columns (ie. terms) as the first dtm. If that's the case, then it should be a matter of sub-setting the second dtm by the terms in the first, perhaps something like this:
First set up some reproducible data...
This is your training data...
library(tm)
# make corpus for text mining (data comes from package, for reproducibility)
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10)))
And this is your testing data...
corpus2 <- Corpus(VectorSource(crude[15:20]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10)))
Here is the bit that does what you want:
Now we keep only the terms in the testing data that are present in the training data...
# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing
# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
colnames(crude1.dtm.mat))])
Finally add to the testing data all the empty columns for terms in the training data that are not in the testing data...
# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
colClasses = "integer")
# add incols of NAs for terms absent in the
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)
So zz
is a data frame of the testing documents, but has the same structure as the training documents (ie. same columns, though many of them contain NA, as SchaunW notes).
Is that along the lines of what you want?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With