I'm trying to do text mining on big data in R with tm
.
I run into memory issues frequently (such as can not allocation vector of size....
) and use the established methods of troubleshooting those issues, such as
memory.limit()
to its maximumgc()
However, when trying to run Corpus
on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:
> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)
Can (and should) I run Corpus
incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?
The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude
dataset and replicate the documents until it's large enough, then you can replicate the error.
UPDATE
I've been experimenting with trying to combine smaller corpa, i.e.
test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]
ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))
and while I haven't been successful, I did discover tm_combine
which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm
can't find the function tm_combine
. Perhaps it was removed from the package for some reason? I'm investigating...
> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"
With the tm library, this can be done easily. Transformations are done via the tm_map () function which applies a function to all elements of the corpus. Basically, all transformations work on single text documents and tm_map () just applies them to all documents in a corpus.
But this is still a real problem for almost any data set that could really be called big data. The fact that R runs on in-memory data is the biggest issue that you face when trying to use Big Data in R. The data has to fit into the RAM on your machine, and it’s not even 1:1.
2 Answers 2. "Corpus" is a collection of text documents. VCorpus in tm refers to "Volatile" corpus which means that the corpus is stored in memory and would be destroyed when the R object containing it is destroyed. Contrast this with PCorpus or Permanent Corpus which are stored outside the memory say in a db.
This can also help with different verb tenses with the same semantic meaning such as digs, digging, and dig. One very useful library to perform the aforementioned steps and text mining in R is the “tm” package. The main structure for managing documents in tm is called a Corpus, which represents a collection of text documents.
I don't know if tm_combine
became deprecated or why it's not found in the tm
namespace, but I did find a solution through using Corpus
on smaller chunks of the dataframe then combining them.
This StackOverflow post had a simple way to do that without tm_combine
:
test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]
ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))
#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)
which gives you:
ds.12
<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>
Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With