Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient alternatives to merge for larger data.frames R

I am looking for an efficient (both computer resource wise and learning/implementation wise) method to merge two larger (size>1 million / 300 KB RData file) data frames.

"merge" in base R and "join" in plyr appear to use up all my memory effectively crashing my system.

Example
load test data frame

and try

test.merged<-merge(test, test) 

or

test.merged<-join(test, test, type="all")   
    -

The following post provides a list of merge and alternatives:
How to join (merge) data frames (inner, outer, left, right)?

The following allows object size inspection:
https://heuristically.wordpress.com/2010/01/04/r-memory-usage-statistics-variable/

Data produced by anonym

like image 613
Etienne Low-Décarie Avatar asked Jun 21 '12 21:06

Etienne Low-Décarie


People also ask

What is faster than merge in R?

For large tables dplyr join functions is much faster than merge().

Can you merge more than 2 Dataframes in R?

The merge function in R allows you to combine two data frames, much like the join function that is used in SQL to combine data tables. Merge , however, does not allow for more than two data frames to be joined at once, requiring several lines of code to join multiple data frames.

Can you merge data frames in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.


1 Answers

Here are some timings for the data.table vs. data.frame methods.
Using data.table is very much faster. Regarding memory, I can informally report that the two methods are very similar (within 20%) in RAM use.

library(data.table)  set.seed(1234) n = 1e6  data_frame_1 = data.frame(id=paste("id_", 1:n, sep=""),                           factor1=sample(c("A", "B", "C"), n, replace=TRUE)) data_frame_2 = data.frame(id=sample(data_frame_1$id),                           value1=rnorm(n))  data_table_1 = data.table(data_frame_1, key="id") data_table_2 = data.table(data_frame_2, key="id")  system.time(df.merged <- merge(data_frame_1, data_frame_2)) #   user  system elapsed  # 17.983   0.189  18.063    system.time(dt.merged <- merge(data_table_1, data_table_2)) #   user  system elapsed  #  0.729   0.099   0.821  
like image 175
bdemarest Avatar answered Sep 22 '22 07:09

bdemarest