What's the fastest way to merge/join data.frames in R?

Tags:

For example (not sure if most representative example though):

N <- 1e6 d1 <- data.frame(x=sample(N,N), y1=rnorm(N)) d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

This is what I've got so far:

d <- merge(d1,d2) # 7.6 sec  library(plyr) d <- join(d1,d2) # 2.9 sec  library(data.table) dt1 <- data.table(d1, key="x") dt2 <- data.table(d2, key="x") d <- data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] ) # 4.9 sec  library(sqldf) sqldf() sqldf("create index ix1 on d1(x)") sqldf("create index ix2 on d2(x)") d <- sqldf("select * from d1 inner join d2 on d1.x=d2.x") sqldf() # 17.4 sec

250

asked Dec 01 '10 07:12

datasmurf

2 Answers

The match approach works when there is a unique key in the second data frame for each key value in the first. If there are duplicates in the second data frame then the match and merge approaches are not the same. Match is, of course, faster since it is not doing as much. In particular it never looks for duplicate keys. (continued after code)

DF1 = data.frame(a = c(1, 1, 2, 2), b = 1:4) DF2 = data.frame(b = c(1, 2, 3, 3, 4), c = letters[1:5]) merge(DF1, DF2)     b a c   1 1 1 a   2 2 1 b   3 3 2 c   4 3 2 d   5 4 2 e DF1$c = DF2$c[match(DF1$b, DF2$b)] DF1$c [1] a b c e Levels: a b c d e  > DF1   a b c 1 1 1 a 2 1 2 b 3 2 3 c 4 2 4 e

In the sqldf code that was posted in the question, it might appear that indexes were used on the two tables but, in fact, they are placed on tables which were overwritten before the sql select ever runs and that, in part, accounts for why its so slow. The idea of sqldf is that the data frames in your R session constitute the data base, not the tables in sqlite. Thus each time the code refers to an unqualified table name it will look in your R workspace for it -- not in sqlite's main database. Thus the select statement that was shown reads d1 and d2 from the workspace into sqlite's main database clobbering the ones that were there with the indexes. As a result it does a join with no indexes. If you wanted to make use of the versions of d1 and d2 that were in sqlite's main database you would have to refer to them as main.d1 and main.d2 and not as d1 and d2. Also, if you are trying to make it run as fast as possible then note that a simple join can't make use of indexes on both tables so you can save the time of creating one of the indexes. In the code below we illustrate these points.

Its worthwhile to notice that the precise computation can make a huge difference on which package is fastest. For example, we do a merge and an aggregate below. We see that the results are nearly reversed for the two. In the first example from fastest to slowest we get: data.table, plyr, merge and sqldf whereas in the second example sqldf, aggregate, data.table and plyr -- nearly the reverse of the first one. In the first example sqldf is 3x slower than data.table and in the second its 200x faster than plyr and 100 times faster than data.table. Below we show the input code, the output timings for the merge and the output timings for the aggregate. Its also worthwhile noting that sqldf is based on a database and therefore can handle objects larger than R can handle (if you use the dbname argument of sqldf) while the other approaches are limited to processing in main memory. Also we have illustrated sqldf with sqlite but it also supports the H2 and PostgreSQL databases as well.

library(plyr) library(data.table) library(sqldf)  set.seed(123) N <- 1e5 d1 <- data.frame(x=sample(N,N), y1=rnorm(N)) d2 <- data.frame(x=sample(N,N), y2=rnorm(N))  g1 <- sample(1:1000, N, replace = TRUE) g2<- sample(1:1000, N, replace = TRUE) d <- data.frame(d1, g1, g2)  library(rbenchmark)  benchmark(replications = 1, order = "elapsed",    merge = merge(d1, d2),    plyr = join(d1, d2),    data.table = {        dt1 <- data.table(d1, key = "x")       dt2 <- data.table(d2, key = "x")       data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )       },    sqldf = sqldf(c("create index ix1 on d1(x)",       "select * from main.d1 join d2 using(x)")) )  set.seed(123) N <- 1e5 g1 <- sample(1:1000, N, replace = TRUE) g2<- sample(1:1000, N, replace = TRUE) d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)  benchmark(replications = 1, order = "elapsed",    aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean),     data.table = {       dt <- data.table(d, key = "g1,g2")       dt[, colMeans(cbind(x, y)), by = "g1,g2"]    },    plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),    sqldf = sqldf(c("create index ix on d(g1, g2)",       "select g1, g2, avg(x), avg(y) from main.d group by g1, g2")) )

The outputs from the two benchmark call comparing the merge calculations are:

Joining by: x         test replications elapsed relative user.self sys.self user.child sys.child 3 data.table            1    0.34 1.000000      0.31     0.01         NA        NA 2       plyr            1    0.44 1.294118      0.39     0.02         NA        NA 1      merge            1    1.17 3.441176      1.10     0.04         NA        NA 4      sqldf            1    3.34 9.823529      3.24     0.04         NA        NA

The output from the benchmark call comparing the aggregate calculations are:

        test replications elapsed  relative user.self sys.self user.child sys.child 4      sqldf            1    2.81  1.000000      2.73     0.02         NA        NA 1  aggregate            1   14.89  5.298932     14.89     0.00         NA        NA 2 data.table            1  132.46 47.138790    131.70     0.08         NA        NA 3       plyr            1  212.69 75.690391    211.57     0.56         NA        NA

170

answered Oct 04 '22 23:10

G. Grothendieck

The 132 seconds reported in Gabor's results for data.table is actually timing base functions colMeans and cbind (the memory allocation and copying induced by using those functions). There are good and bad ways of using data.table, too.

benchmark(replications = 1, order = "elapsed",    aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean),   data.tableBad = {      dt <- data.table(d, key = "g1,g2")       dt[, colMeans(cbind(x, y)), by = "g1,g2"]   },    data.tableGood = {      dt <- data.table(d, key = "g1,g2")       dt[, list(mean(x),mean(y)), by = "g1,g2"]   },    plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),   sqldf = sqldf(c("create index ix on d(g1, g2)",       "select g1, g2, avg(x), avg(y) from main.d group by g1, g2"))   )               test replications elapsed relative user.self sys.self 3 data.tableGood            1    0.15    1.000      0.16     0.00 5          sqldf            1    1.01    6.733      1.01     0.00 2  data.tableBad            1    1.63   10.867      1.61     0.01 1      aggregate            1    6.40   42.667      6.38     0.00 4           plyr            1  317.97 2119.800    265.12    51.05  packageVersion("data.table") # [1] ‘1.8.2’ packageVersion("plyr") # [1] ‘1.7.1’ packageVersion("sqldf") # [1] ‘0.4.6.4’ R.version.string # R version 2.15.1 (2012-06-22)

Please note that I don't know plyr well so please do check with Hadley before relying on the plyr timings here. Also note that the data.table do include the time to convert to data.table and set the key, for fareness.

This answer has been updated since originally answered in Dec 2010. The previous benchmark results are below. Please see revision history of this answer to see what changed.

              test replications elapsed   relative user.self sys.self 4   data.tableBest            1   0.532   1.000000     0.488    0.020 7            sqldf            1   2.059   3.870301     2.041    0.008 3 data.tableBetter            1   9.580  18.007519     9.213    0.220 1        aggregate            1  14.864  27.939850    13.937    0.316 2  data.tableWorst            1 152.046 285.800752   150.173    0.556 6 plyrwithInternal            1 198.283 372.712406   189.391    7.665 5             plyr            1 225.726 424.296992   208.013    8.004

answered Oct 04 '22 23:10

Matt Dowle

Related questions
                            
                                How many rows in a database are TOO MANY?
                            
                                Why is memmove faster than memcpy?
                            
                                What is the best way to profile javascript execution? [closed]
                            
                                What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
                            
                                Split a large dataframe into a list of data frames based on common value in column
                            
                                Blazor performance
                            
                                Should a developer aim for readability or performance first? [closed]
                            
                                What is the overhead of Rust's Option type?
                            
                                atomic operation cost
                            
                                Is using java Map.containsKey() redundant when using map.get()
                            
                                How do you test running time of VBA code?
                            
                                What is the real overhead of try/catch in C#?
                            
                                C# vs C - Big performance difference
                            
                                Is Python faster and lighter than C++? [closed]
                            
                                External CSS vs inline style performance difference?
                            
                                Visual Studio 2017 is too slow during building and debugging
                            
                                Weird performance increase in simple benchmark
                            
                                The HashSet<T>.removeAll method is surprisingly slow
                            
                                What is the purpose of the EBP frame pointer register?
                            
                                MySQL "IN" operator performance on (large?) number of values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the fastest way to merge/join data.frames in R?

Tags:

performance

merge

join

r

data.table

datasmurf

People also ask

2 Answers

G. Grothendieck

Matt Dowle

Recent Activity

Donate For Us