Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Document Term Matrix (DTM) to Data Frame (R Programming)

Tags:

r

I am a beginner at R programming language and currently try to work on a project. There's a huge Document Term Matrix (DTM) and I would like to convert it into a Data Frame. However due to the restrictions of the functions, I am not able to do so.

The method that I have been using is to first convert it into a matrix, and then convert it to data frame.

DF <- data.frame(as.matrix(DTM), stringsAsFactors=FALSE)

It was working perfectly with smaller size DTM. However when the DTM is too large, I am not able to convert it to a matrix, yielding the error as shown below:

Error: cannot allocate vector of size 2409.3 Gb

Tried looking online for a few days however I am not able to find a solution. Would be really thankful if anyone is able to suggest what is the best way to convert a DTM into a DF (especially when dealing with large size DTM).

like image 1000
Jeffrey Avatar asked May 17 '17 01:05

Jeffrey


2 Answers

In the tidytext package there is actually a function to do just that. Try using the tidy function which will return a tibble (basically a fancy dataframe that will print nicely). The nice thing about the tidy function is it'll take care of the pesky StringsAsFactors=FALSE issue by not converting strings to factors and it will deal nicely with the sparsity of your DTM.

as.matrix is trying to convert your DTM into a non-sparse matrix with an entry for every document and term even if the term occurs 0 times in that document, which is causing your memory usage to ballon. tidy` will convert it into a dataframe where each document only has the counts for the term found in them.

In your example here you'd run

library(tidytext)
DF <- tidy(DTM)

There's even a vignette on how to use the tidytext packages (meant to work in the tidyverse) here.

like image 168
beigel Avatar answered Sep 28 '22 00:09

beigel


It's possible that as.data.frame(as.matrix(DTM), stringsAsFactors=False) instead of data.frame(as.matrix(DTM), stringsAsFactors=False) might do the trick.

The API documentation notes that as.data.frame() simply coerces a matrix into a dataframe, whereas data.frame() creates a new data frame from the input.

as.data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.data.frame.html

data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

like image 22
Glitch253 Avatar answered Sep 28 '22 01:09

Glitch253