Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use tm's Corpus function with big data in R

I'm trying to do text mining on big data in R with tm.

I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as

  • using 64-bit R
  • trying different OS's (Windows, Linux, Solaris, etc)
  • setting memory.limit() to its maximum
  • making sure that sufficient RAM and compute is available on the server (which there is)
  • making liberal use of gc()
  • profiling the code for bottlenecks
  • breaking up big operations into multiple smaller operations

However, when trying to run Corpus on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:

> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)

Can (and should) I run Corpus incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?

The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude dataset and replicate the documents until it's large enough, then you can replicate the error.

UPDATE

I've been experimenting with trying to combine smaller corpa, i.e.

test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

and while I haven't been successful, I did discover tm_combine which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm can't find the function tm_combine. Perhaps it was removed from the package for some reason? I'm investigating...

> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"
like image 479
Hack-R Avatar asked Aug 27 '14 17:08

Hack-R


People also ask

How to map a corpus with TM library?

With the tm library, this can be done easily. Transformations are done via the tm_map () function which applies a function to all elements of the corpus. Basically, all transformations work on single text documents and tm_map () just applies them to all documents in a corpus.

What are the biggest problems with big data in R?

But this is still a real problem for almost any data set that could really be called big data. The fact that R runs on in-memory data is the biggest issue that you face when trying to use Big Data in R. The data has to fit into the RAM on your machine, and it’s not even 1:1.

What is a vcorpus in R?

2 Answers 2. "Corpus" is a collection of text documents. VCorpus in tm refers to "Volatile" corpus which means that the corpus is stored in memory and would be destroyed when the R object containing it is destroyed. Contrast this with PCorpus or Permanent Corpus which are stored outside the memory say in a db.

What is the best way to do text mining in R?

This can also help with different verb tenses with the same semantic meaning such as digs, digging, and dig. One very useful library to perform the aforementioned steps and text mining in R is the “tm” package. The main structure for managing documents in tm is called a Corpus, which represents a collection of text documents.


1 Answers

I don't know if tm_combine became deprecated or why it's not found in the tm namespace, but I did find a solution through using Corpus on smaller chunks of the dataframe then combining them.

This StackOverflow post had a simple way to do that without tm_combine:

test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)

which gives you:

ds.12

<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>

Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.

like image 103
Hack-R Avatar answered Oct 01 '22 06:10

Hack-R