Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to remove words from a DocumentTermMatrix in order to use topicmodels

So, I am trying to use the topicmodels package for R (100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory.

So I try to shrink the size of the document term matrix that the lda() function takes as input; I figure I can do that do using the minDocFreq function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code:

Here is the relevant bit of code:

> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1]  6423 41613

Same dimensions, and same number of columns (that is, same number of terms).

Any sense what I'm doing wrong? Thanks.

like image 327
cforster Avatar asked Dec 07 '22 08:12

cforster


1 Answers

The answer to your question is over here: https://stackoverflow.com/a/13370840/1036500 (give it an upvote!)

In brief, more recent versions of the tm package do not include minDocFreq but instead use bounds, for example, your

smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))

should now be

require(tm)
data("crude")

smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17
like image 80
Ben Avatar answered Apr 25 '23 17:04

Ben