Use tm's Corpus function with big data in R

Tags:

I'm trying to do text mining on big data in R with tm.

I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as

using 64-bit R
trying different OS's (Windows, Linux, Solaris, etc)
setting memory.limit() to its maximum
making sure that sufficient RAM and compute is available on the server (which there is)
making liberal use of gc()
profiling the code for bottlenecks
breaking up big operations into multiple smaller operations

However, when trying to run Corpus on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:

> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)

Can (and should) I run Corpus incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?

The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude dataset and replicate the documents until it's large enough, then you can replicate the error.

UPDATE

I've been experimenting with trying to combine smaller corpa, i.e.

test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

and while I haven't been successful, I did discover tm_combine which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm can't find the function tm_combine. Perhaps it was removed from the package for some reason? I'm investigating...

> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"

479

asked Aug 27 '14 17:08

Hack-R

1 Answers

I don't know if tm_combine became deprecated or why it's not found in the tm namespace, but I did find a solution through using Corpus on smaller chunks of the dataframe then combining them.

This StackOverflow post had a simple way to do that without tm_combine:

test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)

which gives you:

ds.12

<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>

Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.

103

answered Oct 01 '22 06:10

Hack-R

Related questions
                            
                                Adding a legend to scatter3d plot
                            
                                animation in knitr document with ggplot figures
                            
                                ggplot2: faceting on a function of column
                            
                                parallel computations on Reference Classes
                            
                                Resize/manually enter breaks on colorbar guide of geom_tile AND replace y-axis labels
                            
                                R shiny app with inputs depending on updated data
                            
                                R foreach issue (some processes returning NULL)
                            
                                How can I add symbols in slider labels?
                            
                                Reproduce well-log plot with ggplot?
                            
                                Transform from class "simple_triplet_matrix" to class "matrix"
                            
                                Bug in R align.time/aggregate?
                            
                                Get facebook public page rating and review
                            
                                Speeding up wilcox.test in R
                            
                                How to hide selected correlations for corrplot?
                            
                                How to identify fully connected node clusters with igraph?
                            
                                data.table::fread doesn't like missing values in first column
                            
                                unsupervised semantic clustering of phrases
                            
                                ggplot2 specify point size in axis units
                            
                                RStudio server - Hangs when switching projects
                            
                                Get Google Chrome's Inspect Element into R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use tm's Corpus function with big data in R

Tags:

r

text-mining

bigdata

tm

Hack-R

People also ask

1 Answers

Hack-R

Recent Activity

Donate For Us