Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter rows/documents from Document-Term-Matrix in R

Using the tm-package in R I create a Document-Term-Matrix:

dtm <- DocumentTermMatrix(cor, control = list(dictionary=c("someTerm")))

Whichs results in something like this:

A document-term matrix (291 documents, 1 terms)

Non-/sparse entries: 48/243
Sparsity           : 84%
Maximal term length: 8 
Weighting          : term frequency (tf) 

                   Terms
Docs                someTerm
doc1                       0
doc2                       0
doc3                       7
doc4                       22
doc5                       0

Now I would like to filter this Document-Term-Matrix according to the number of the occurrences of someTerm in the documents. E.g. filter out only the documents where someTerm appears at least once. I.e. doc3 and doc4 here.

How can I achieve this?

like image 602
user3316599 Avatar asked Jun 14 '14 21:06

user3316599


2 Answers

It's very similar to how you would subset a regular R matrix. For example, to create a document term matrix from the example Reuters dataset with only rows where the term "would" appears more than once:

reut21578 <- system.file("texts", "crude", package = "tm")

reuters <- VCorpus(DirSource(reut21578),
    readerControl = list(reader = readReut21578XMLasPlain))

dtm <- DocumentTermMatrix(reuters)
v <- as.vector(dtm[,"would"]>1)
dtm2 <- dtm[v, ]

> inspect(dtm2[, "would"])
A document-term matrix (3 documents, 1 terms)

Non-/sparse entries: 3/0
Sparsity           : 0%
Maximal term length: 5 
Weighting          : term frequency (tf)

     Terms
Docs  would
  246     2
  489     2
  502     2

A tm document term matrix is a simple triplet matrix from package slam so the slam documentation helps in figuring out how to manipulate dtms.

like image 150
James King Avatar answered Oct 26 '22 21:10

James King


Alternatively, you could use removeSparseTerms function, which remove empty elements (check out the documentation here).

dtm <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum
like image 31
ElenaZhebel Avatar answered Oct 26 '22 21:10

ElenaZhebel