Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big Merge / Memory management

I've hit a wall trying to merge a large file and a smaller one. I have read many other posts about memory management in R, and haven't been able to find a non-extreme (go 64bit, upload to a cluster, etc) method of resolving it. I've tried a bit with the bigmemory package, but not been able to find a solution. I thought I'd try here before I throw my hands up in frustration.

The code I'm running is like the below:

#rm(list=ls())
localtempdir<- "F:/Temp/"
memory.limit(size=4095)
[1] 4095
    memory.size(max=TRUE)
[1] 487.56
gc()
         used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 170485  4.6     350000   9.4   350000   9.4
Vcells 102975  0.8   52633376 401.6 62529185 477.1

client_daily<-read.csv(paste(localtempdir,"client_daily.csv",sep=""),header=TRUE)
object.size(client_daily)
>130MB

sbp_demos<-read.csv(paste(localtempdir,"sbp_demos",sep=""))
object.size(demos)
>0.16MB
client_daily<-merge(client_daily,sbp_demos,by.x="OBID",by.y="OBID",all.x=TRUE)
Error: cannot allocate vector of size 5.0 MB

I guess I'm asking are there any clever ways around this which don't involve buying new hardware?

  1. I need to be able to merge to create a bigger object.
  2. I'll then need to be doing regressions etc with that bigger object.

Should I give up? Should bigmemory be able to help solve this?

Any guidance greatly appreciated.

Details: R version 2.13.1 (2011-07-08) Platform: i386-pc-mingw32/i386 (32-bit) Intel 2 Duo Core @2.33GHz, 3.48GB RAM

like image 679
Daniel Egan Avatar asked Dec 21 '11 19:12

Daniel Egan


1 Answers

As Chase already mentioned, you can try data.table or perhaps sqldf.

For either one, you will likely get more juice out of it if you set the indexes appropriately.

With data.table you would:

dt1 <- data.table(sbp_demos, key='OBID')
dt2 <- data.table(client_daily, key='OBID')

## Do an INNER JOIN-like operation, where non-matching rows are removed
mi <- dt1[dt2, nomatch=0]

## Do a RIGHT JOIN(?)-like operation ... all rows in dt2 will be returned.
## If there is no matching row in dt1, the values in the dt1 columns for
## the merged row will be NA
mr <- dt1[dt2]

If you go the sqldf route, look at example 4i on its website ... again, make sure you use indexes correctly.

like image 135
Steve Lianoglou Avatar answered Sep 30 '22 16:09

Steve Lianoglou