I am trying to merge two fairly large - but not ridiculously so (360,000 X 4, 57,000 X 4) - datasets by one common ID. I have tried a regular merge()
, merge.data.table()
, and sqldf()
. Every time I keep running out of memory (cannot allocate vector of size...
). Is there any solution to this? Or is R a bad tool for merging data? head()
is given below (I am trying to merge on STUDENT.NAME):
ID10 STUDENT.NAME FATHER.NAME MOTHER.NAME
1 1 DEEKSHITH J JAYANNA SWARNA
2 4 MANIKANTHA D DEVARAJ MANJULA
3 5 NAGESH T THIMMAIAH N SHIVAMMA
4 6 NIZAMUDDIN R NOOR MOHAMMED BIBI
5 7 PRABHU YELLAPPA YELLAPPA MALLAMMA
6 8 SADDAM PASHA NISAR AHMED ZAREENA
From the nature of your problem it is bound to be that you're doing a many-by-many merge, where each student occurs many times in every dataframe. You might want to check how many times. If each student occurs twice in every data frame, that means one student will make 4 rows. if a student occurs 10 times, the merge will add 100 rows. First check how many rows you'll get. This is the function I use for that:
count.rows <- function(x,y,v,all=FALSE){
tx <- table(x[[v]])
ty <- table(y[[v]])
val <- val <- names(tx)[match(names(tx),names(ty),0L) > 0L]
cts <- rbind(tx[match(val,names(tx))],ty[match(val,names(ty))])
colnames(cts) <- val
sum(apply(cts,2,prod,na.rm=all),na.rm=TRUE)
}
count.rows(DF1,DF2,"STUDENT.NAME")
If you would do what you asked me (read up the R documentation), you'd see that the complexity is dependent on the length of the answer. This is not due to the merge algorithm itself, but the binding of all the results together. If you really want a less memory hungry solution, you need especially to get rid of that binding. Following algorithm does that for you. I wrote it out so you can find the logic, and it can be optimized. Mind you that it does not give the same result, it copies all columns of both dataframes. So you might want to adapt that a little.
mymerge <- function(x,y,v,count.only=FALSE){
ix <- match(v,names(x))
iy <- match(v,names(y))
xx <- x[,ix]
yy <- y[,iy]
ox <- order(xx)
oy <- order(yy)
xx <- xx[ox]
yy <- yy[oy]
nx <- length(xx)
ny <- length(yy)
val <- unique(xx)
val <- val[match(val,yy,0L) > 0L]
cts <- cbind(table(xx)[val],table(yy)[val])
dimr <- sum(apply(cts,1,prod),na.rm=TRUE)
idx <- vector("numeric",dimr)
idy <- vector("numeric",dimr)
ndx <- embed(c(which(!duplicated(xx)),nx+1),2)[unique(xx) %in% val,]
ndy <- embed(c(which(!duplicated(yy)),ny+1),2)[unique(yy) %in% val,]
count = 1
for(i in 1:nrow(ndx)){
nx <- abs(diff(ndx[i,]))
ny <- abs(diff(ndy[i,]))
ll <- nx*ny
idx[count:(count+ll-1)] <-
rep(ndx[i,2]:(ndx[i,1]-1),ny)
idy[count:(count+ll-1)] <-
rep(ndy[i,2]:(ndy[i,1]-1),each=nx)
count <- count+ll
}
x <- x[ox[idx],]
names(y) <- paste("y.",names(y),sep="")
x[names(y)] <- y[oy[idy],]
rownames(x) <- 1:nrow(x)
x
}
Some testing code so you can see it works :
DF1 <- data.frame(
ID = 1:10,
STUDENT.NAME=letters[1:10],
SCORE = 1:10
)
id <- c(3,11,4,6,6,12,1,4,7,10,5,3)
DF2 <- data.frame(
ID = id,
STUDENT.NAME=letters[id],
SCORE = 1:12
)
mymerge(DF1,DF2,"STUDENT.NAME")
Doing the same with two dataframes of 0.5 million rows and 4 columns with up to 10 matches per student name, it returns a dataframe with 5.8 million rows and 8 columns andd gives following picture on the memory :
The yellow box is the merge call, the green box is the mymerge call. Memory ranges from 2.3Gb to 3.74Gb, so the merge call uses 1.45 Gb and mymerge a bit over 0.8 Gb. Still no "out of memory" errors... The testing code for this is below :
Names <- sapply(
replicate(120000,sample(letters,4,TRUE),simplify=FALSE),
paste,collapse="")
DF1 <- data.frame(
ID10 = 1:500000,
STUDENT.NAME = sample(Names[1:50000],500000,TRUE),
FATHER.NAME = sample(letters,500000,TRUE),
SCORE1 = rnorm(500000),
stringsAsFactors=FALSE
)
id <- sample(500000,replace=TRUE)
DF2 <- data.frame(
ID20 = DF1$ID10,
STUDENT.NAME = DF1$STUDENT.NAME[id],
SCORE = rnorm(500000),
SCORE2= rnorm(500000),
stringsAsFactors=FALSE
)
id2 <- sample(500000,20000)
DF2$STUDENT.NAME[id2] <- sample(Names[100001:120000],20000,TRUE)
gc()
system.time(X <- merge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
Sys.sleep(1)
rm(X)
gc()
Sys.sleep(3)
system.time(X <- mymerge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
rm(X)
gc()
Have you tried the data.table package? It is more memory efficient and can be many times faster. But, as others have noted, this question has no code provided so it's possible you are just using merge incorrectly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With