Using the tm-package in R I create a Document-Term-Matrix:
dtm <- DocumentTermMatrix(cor, control = list(dictionary=c("someTerm")))
Whichs results in something like this:
A document-term matrix (291 documents, 1 terms)
Non-/sparse entries: 48/243
Sparsity : 84%
Maximal term length: 8
Weighting : term frequency (tf)
Terms
Docs someTerm
doc1 0
doc2 0
doc3 7
doc4 22
doc5 0
Now I would like to filter this Document-Term-Matrix according to the number of the occurrences of someTerm in the documents. E.g. filter out only the documents where someTerm appears at least once. I.e. doc3 and doc4 here.
How can I achieve this?
It's very similar to how you would subset a regular R matrix. For example, to create a document term matrix from the example Reuters dataset with only rows where the term "would" appears more than once:
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain))
dtm <- DocumentTermMatrix(reuters)
v <- as.vector(dtm[,"would"]>1)
dtm2 <- dtm[v, ]
> inspect(dtm2[, "would"])
A document-term matrix (3 documents, 1 terms)
Non-/sparse entries: 3/0
Sparsity : 0%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs would
246 2
489 2
502 2
A tm
document term matrix is a simple triplet matrix from package slam
so the slam
documentation helps in figuring out how to manipulate dtms.
Alternatively, you could use removeSparseTerms function, which remove empty elements (check out the documentation here).
dtm <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With