R: how to rbind two huge data-frames without running out of memory

Tags:

I have two data-frames df1 and df2 that each have around 10 million rows and 4 columns. I read them into R using RODBC/sqlQuery with no problems, but when I try to rbind them, I get that most dreaded of R error messages: cannot allocate memory. There have got to be more efficient ways to do an rbind more efficiently -- anyone have their favorite tricks on this they want to share? For instance I found this example in the doc for sqldf:

# rbind
a7r <- rbind(a5r, a6r)
a7s <- sqldf("select * from a5s union all select * from a6s")

Is that the best/recommended way to do it?

UPDATE I got it to work using the crucial dbname = tempfile() argument in the sqldf call above, as JD Long suggests in his answer to this question

443

asked Jan 21 '11 16:01

Prasad Chalasani

1 Answers

Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.

# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)

DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)

# now we do the real work
library(sqldf)

data1 <- file("data1.dat")
data2 <- file("data2.dat")

sqldf(c("select * from data1", 
 "insert into data1 select * from data2", 
 "select * from data1"), 
 dbname = tempfile())

This gives:

>  sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
   A  B
1  1  2
2  2  3
3 10 12
4 11 13

This shorter version also works if row order is unimportant:

sqldf("select * from data1 union select * from data2", dbname = tempfile())

See the sqldf home page http://sqldf.googlecode.com and ?sqldf for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table. Here we have used the defaults so it was less of an issue.

answered Sep 29 '22 12:09

G. Grothendieck

Related questions
                            
                                How to show processing animation / spinner during ajax request?
                            
                                When is the constructor called?
                            
                                Is php's 'include' a function or a statement?
                            
                                Select lines whose date-field is in a given month and year
                            
                                Streaming a CSV file in Django
                            
                                How do I tell what type my shell is
                            
                                How can get a substring from a string in C#?
                            
                                CUDA __threadfence()
                            
                                Multiple $inc updates in MongoDB
                            
                                Can i use VS2010 PrivateObject to access a static field inside a static class?
                            
                                Use of Timer in Windows Service
                            
                                Garbage collector in java - set an object null

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With