My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.
Consider the following code:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
The output:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.
I have heard that I could use data.table
but I am not sure how.
I have also looked at the qdap
package, but it gives me an error when trying to load the package, plus I don't even know if it will work.
Ref. http://cran.r-project.org/web/packages/qdap/qdap.pdf
What is a DTM? It is a matrix with rows and columns, where each document in some sample of texts (called a corpus) are the rows and the columns are all the unique words (often called types or vocabulary) in the corpus.
Using Text Mining Initializing function for making term-document matrix.
A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document. Alternatively, one can also build a document-term matrix by swapping row and column.
This is better than my earlier answer.
The quanteda package has evolved significantly and is now faster and much simpler to use given its built-in tools for this sort of problem -- which is exactly what we designed it for. Part of the OP asked how to prepare the texts for a Bayesian classifier. I've added an example for this too, since quanteda's textmodel_nb()
would crunch through 300k documents without breaking a sweat, plus it correctly implements the multinomial NB model (which is the most appropriate for text count matrices -- see also https://stackoverflow.com/a/54431055/4158274).
Here I demonstrate on the built-in inaugural corpus object, but the functions below would also work with a plain character vector input. I've used this same workflow to process and fit models to 10s of millions of Tweets in minutes, on a laptop, so it's fast.
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
# use a built-in data object
data <- data_corpus_inaugural
data
## Corpus consisting of 58 documents and 3 docvars.
# here we input a corpus, but plain text input works fine too
dtm <- dfm(data, tolower = TRUE, remove_numbers = TRUE, remove_punct = TRUE) %>%
dfm_wordstem(language = "english") %>%
dfm_remove(stopwords("english"))
dtm
## Document-feature matrix of: 58 documents, 5,346 features (89.0% sparse).
tail(dtm, nf = 5)
## Document-feature matrix of: 6 documents, 5 features (83.3% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs bleed urban sprawl windswept nebraska
## 1997-Clinton 0 0 0 0 0
## 2001-Bush 0 0 0 0 0
## 2005-Bush 0 0 0 0 0
## 2009-Obama 0 0 0 0 0
## 2013-Obama 0 0 0 0 0
## 2017-Trump 1 1 1 1 1
This is a rather trivial example, but for illustration, let's fit a Naive Bayes model, holding out the Trump document. This was the last inaugural speech at the time of this posting ("2017-Trump"), equal in position to the ndoc()
th document.
# fit a Bayesian classifier
postwar <- ifelse(docvars(data, "Year") > 1945, "post-war", "pre-war")
textmod <- textmodel_nb(dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)], prior = "docfreq")
The same sorts of commands that work with other fitted model objects (e.g. lm()
, glm()
, etc.) will work with a fitted Naive Bayes textmodel object. So:
summary(textmod)
##
## Call:
## textmodel_nb.dfm(x = dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)],
## prior = "docfreq")
##
## Class Priors:
## (showing first 2 elements)
## post-war pre-war
## 0.2982 0.7018
##
## Estimated Feature Scores:
## fellow-citizen senat hous repres among vicissitud incid
## post-war 0.02495 0.4701 0.2965 0.06968 0.213 0.1276 0.08514
## pre-war 0.97505 0.5299 0.7035 0.93032 0.787 0.8724 0.91486
## life event fill greater anxieti notif transmit order
## post-war 0.3941 0.1587 0.3945 0.3625 0.1201 0.3385 0.1021 0.1864
## pre-war 0.6059 0.8413 0.6055 0.6375 0.8799 0.6615 0.8979 0.8136
## receiv 14th day present month one hand summon countri
## post-war 0.1317 0.3385 0.5107 0.06946 0.4603 0.3242 0.307 0.6524 0.1891
## pre-war 0.8683 0.6615 0.4893 0.93054 0.5397 0.6758 0.693 0.3476 0.8109
## whose voic can never hear vener
## post-war 0.2097 0.482 0.3464 0.2767 0.6418 0.1021
## pre-war 0.7903 0.518 0.6536 0.7233 0.3582 0.8979
predict(textmod, newdata = dtm[ndoc(dtm), ])
## 2017-Trump
## post-war
## Levels: post-war pre-war
predict(textmod, newdata = dtm[ndoc(dtm), ], type = "probability")
## post-war pre-war
## 2017-Trump 1 1.828083e-157
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With