Filter rows/documents from Document-Term-Matrix in R

Question

Using the tm-package in R I create a Document-Term-Matrix:

dtm <- DocumentTermMatrix(cor, control = list(dictionary=c("someTerm")))

Whichs results in something like this:

A document-term matrix (291 documents, 1 terms)

Non-/sparse entries: 48/243
Sparsity           : 84%
Maximal term length: 8 
Weighting          : term frequency (tf) 

                   Terms
Docs                someTerm
doc1                       0
doc2                       0
doc3                       7
doc4                       22
doc5                       0

Now I would like to filter this Document-Term-Matrix according to the number of the occurrences of someTerm in the documents. E.g. filter out only the documents where someTerm appears at least once. I.e. doc3 and doc4 here.

How can I achieve this?

James King · Accepted Answer

It's very similar to how you would subset a regular R matrix. For example, to create a document term matrix from the example Reuters dataset with only rows where the term "would" appears more than once:

reut21578 <- system.file("texts", "crude", package = "tm")

reuters <- VCorpus(DirSource(reut21578),
    readerControl = list(reader = readReut21578XMLasPlain))

dtm <- DocumentTermMatrix(reuters)
v <- as.vector(dtm[,"would"]>1)
dtm2 <- dtm[v, ]

> inspect(dtm2[, "would"])
A document-term matrix (3 documents, 1 terms)

Non-/sparse entries: 3/0
Sparsity           : 0%
Maximal term length: 5 
Weighting          : term frequency (tf)

     Terms
Docs  would
  246     2
  489     2
  502     2

A tm document term matrix is a simple triplet matrix from package slam so the slam documentation helps in figuring out how to manipulate dtms.

ElenaZhebel · Answer

Alternatively, you could use removeSparseTerms function, which remove empty elements (check out the documentation here).

dtm <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum

Filter rows/documents from Document-Term-Matrix in R

Tags:

r

matrix

text-mining

tm

user3316599

2 Answers

James King

ElenaZhebel

Recent Activity

Donate For Us

Filter rows/documents from Document-Term-Matrix in R

Tags:

r

matrix

text-mining

tm

user3316599

2 Answers

James King

ElenaZhebel

Related questions

Recent Activity

Donate For Us