So I have a very large term-document matrix:
> class(ph.DTM)
[1] "TermDocumentMatrix" "simple_triplet_matrix"
> ph.DTM
A term-document matrix (109996 terms, 262811 documents)
Non-/sparse entries: 3705693/28904453063
Sparsity : 100%
Maximal term length: 191
Weighting : term frequency (tf)
How do I get the rowSum (frequency) of each term? I tried:
> apply(ph.DTM, 1, sum)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow
Obviously, I know about removeSparseTerms
:
ph.DTM2 <- removeSparseTerms(ph.DTM, 0.99999)
Which cuts down the size a bit:
> ph.DTM2
A term-document matrix (28842 terms, 262811 documents)
Non-/sparse entries: 3612620/7576382242
Sparsity : 100%
Maximal term length: 24
Weighting : term frequency (tf)
But I still cannot apply any matrix-related functions to it:
> as.matrix(ph.DTM2)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow
How can I just get a simple row sum on this object?? Thanks!!
OK, after some more Google'ing, I came across the slam
package, which enables:
ph.DTM3 <- rollup(ph.DTM, 2, na.rm=TRUE, FUN = sum)
Which works.
As alluded to by @badpanda in one of the comments, slam
now has the row_sums
and col_sums
functions for sparse arrays:
slam::row_sums(dtm, na.rm = T)
slam::col_sums(tdm, na.rm = T)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With