Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a sparse matrix from a TermDocumentMatrix

I've created a TermDocumentMatrix from the tm library in R. It looks something like this:

> inspect(freq.terms)

A document-term matrix (19 documents, 214 terms)

Non-/sparse entries: 256/3810
Sparsity           : 94%
Maximal term length: 19 
Weighting          : term frequency (tf)

Terms
Docs abundant acid active adhesion aeropyrum alternative
  1         0    0      1        0         0           0
  2         0    0      0        0         0           0
  3         0    0      0        1         0           0
  4         0    0      0        0         0           0
  5         0    0      0        0         0           0
  6         0    1      0        0         0           0
  7         0    0      0        0         0           0
  8         0    0      0        0         0           0
  9         0    0      0        0         0           0
  10        0    0      0        0         1           0
  11        0    0      1        0         0           0
  12        0    0      0        0         0           0
  13        0    0      0        0         0           0
  14        0    0      0        0         0           0
  15        1    0      0        0         0           0
  16        0    0      0        0         0           0
  17        0    0      0        0         0           0
  18        0    0      0        0         0           0
  19        0    0      0        0         0           1

This is just a small sample of the matrix; there are actually 214 terms that I'm working with. On a small scale, this is fine. If I want to convert my TermDocumentMatrix into an ordinary matrix, I'd do:

data.matrix <- as.matrix(freq.terms)

However the data that I've displayed above is just a subset of my overall data. My overall data has probably at least 10,000 terms. When I try to create a TDM from the overall data, I run an error:

> Error cannot allocate vector of size n Kb

So from here, I'm looking into alternative ways of finding efficient memory allocation for my tdm.

I tried turning my tdm into a sparse matrix from the Matrix library but ran into the same problem.

What are my alternatives at this point? I feel like I should be investigating one of:

  • bigmemory/ff packages as talked about here (although the bigmemory package doesn't seem available for Windows at the moment)
  • the irlba package for computing partials SVD of my tdm as mentioned here

I've experimented with functions from both libraries but can't seem to arrive at anything substantial. Does anyone know what the best way forward is? I've spent so long fiddling around with this that I thought I'd ask people who have much more experience than myself working with large datasets before I waste even more time going in the wrong direction.

EDIT: changed 10,00 to 10,000. thanks @nograpes.

like image 896
user1988898 Avatar asked Nov 11 '22 13:11

user1988898


1 Answers

The package qdap seems to be able to handle a problem this large. The first part is recreating a data set that matches the OP's problem followed by the solution. As of qdap version 1.1.0 there is compatibility with the tm package:

library(qdapDictionaries)

FUN <- function() {
   paste(sample(DICTIONARY[, 1], sample(seq(100, 10000, by=1000), 1, TRUE)), collapse=" ")
}

library(qdap)
mycorpus <- tm::Corpus(tm::VectorSource(lapply(paste0("doc", 1:15), function(i) FUN())))

This gives a similar corpus...

Now the qdap approach. You have to first convert the Corpus to a dataframe (tm_corpus2df) and then use the tdm function to create a TermDocumentMatrix.

out <- with(tm_corpus2df(mycorpus), tdm(text, docs))
tm::inspect(out)

## A term-document matrix (19914 terms, 15 documents)
## 
## Non-/sparse entries: 80235/218475
## Sparsity           : 73%
## Maximal term length: 19 
## Weighting          : term frequency (tf)
like image 173
Tyler Rinker Avatar answered Nov 15 '22 07:11

Tyler Rinker