Evaluate at which size data.table is faster than data.frame

Tags:

Can someone please help me evaluate at which size of a data frame using data.table is faster for searches? In my use case the data frames will be 24,000 rows and 560,000 rows. Blocks of 40 rows are always singled out for further use.

Example: DF is a data frame with 120 rows, 7 columns (x1 to x7); "string" occupies the first 40 rows of x1.

DF2 is 1000 times DF => 120,000 rows

For the size of DF data.table is slower, for the size of DF2 it is faster.

Code:

> DT <- data.table(DF)
> setkey(DT, x1)
> 
> DT2 <- data.table(DF2)
> setkey(DT2, x1)
> 
> microbenchmark(DF[DF$x1=="string", ], unit="us")
Unit: microseconds
                    expr     min       lq   median       uq     max neval
 DF[DF$x1 == "string", ] 282.578 290.8895 297.0005 304.5785 2394.09   100
> microbenchmark(DT[.("string")], unit="us")
Unit: microseconds
            expr      min       lq  median      uq      max neval
 DT[.("string")] 1473.512 1500.889 1536.09 1709.89 6727.113   100
> 
> 
> microbenchmark(DF2[DF2$x1=="string", ], unit="us")
Unit: microseconds
                      expr     min       lq   median       uq      max neval
 DF2[DF2$x1 == "string", ] 31090.4 34694.74 35537.58 36567.18 61230.41   100
> microbenchmark(DT2[.("string")], unit="us")
Unit: microseconds
             expr      min       lq   median       uq      max neval
 DT2[.("string")] 1327.334 1350.801 1391.134 1457.378 8440.668   100

204

asked Nov 24 '13 15:11

chriscross

1 Answers

library(microbenchmark)
library(data.table)
timings <- sapply(1:10, function(n) {
  DF <- data.frame(id=rep(as.character(seq_len(2^n)), each=40), val=rnorm(40*2^n), stringsAsFactors=FALSE)
  DT <- data.table(DF, key="id")     
  tofind <- unique(DF$id)[n-1]
  print(microbenchmark( DF[DF$id==tofind,],
                        DT[DT$id==tofind,],
                        DT[id==tofind],
                        `[.data.frame`(DT,DT$id==tofind,),
                        DT[tofind]), unit="ns")$median
})

matplot(1:10, log10(t(timings)), type="l", xlab="log2(n)", ylab="log10(median (ns))", lty=1)
legend("topleft", legend=c("DF[DF$id == tofind, ]",
                           "DT[DT$id == tofind, ]",
                           "DT[id == tofind]",
                           "`[.data.frame`(DT,DT$id==tofind,)",
                           "DT[tofind]"),
       col=1:5, lty=1)

enter image description here

Jan. 2016: Update to `data.table_1.9.7`

data.table has made a few updates since this was written (a bit more overhead added to [.data.table as a few more arguments / robustness checks have been built in, but also the introduction of auto-indexing). Here's an updated version as of the January 13, 2016 version of 1.9.7 from GitHub:

jan_2016

The main innovation is that the third option now leverages auto-indexing. The main conclusion remains the same -- if your table is of any nontrivial size (roughly larger than 500 observations), data.table's within-frame calling is faster.

(notes about the updated plot: some minor things (un-logging the y-axis, expressing in microseconds, changing the x-axis labels, adding a title), but one non-trivial thing is I updated the microbenchmarks to add some stability in the estimates--namely, I set the times argument to as.integer(1e5/2^n))

133

answered Oct 15 '22 09:10

6 revs, 2 users 70%

Related questions
                            
                                How to set the R_HOME environment variable to the R home directory?
                            
                                R and Windows Authentication
                            
                                How do I access/print/track the current tab selection in a Shiny app?
                            
                                Legends for multiple fills in ggplot
                            
                                How to download a PDF file in a Shiny app
                            
                                Difference between Linear Regression Coefficients between Python and R
                            
                                How to calculate centroid of polygon using sf::st_centroid?
                            
                                Kruskal-Wallis test with details on pairwise comparisons
                            
                                multiple bquote items in legend of an R plot
                            
                                k-means return value in R
                            
                                Get Emacs to ignore contents of \Sexpr{} command in Sweave document to prevent incorrect $-based syntax highlighting
                            
                                How to use a non-ASCII symbol (e.g. £) in an R package function?
                            
                                How to break ties with order function in R
                            
                                sum of two lists with lists in R
                            
                                R - converting date and time fields to POSIXct with HHMMSS format
                            
                                closing unused RODBC handle
                            
                                Start new R package development on github
                            
                                How to show bars in ggplot2 in descending order of a numeric vector?
                            
                                Equivalent of transform in R/ddply in Python/pandas?
                            
                                How to list all graph vertex attributes in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Evaluate at which size data.table is faster than data.frame

Tags:

r

data.table

chriscross

People also ask

1 Answers

Jan. 2016: Update to `data.table_1.9.7`

6 revs, 2 users 70%

Recent Activity

Donate For Us

Evaluate at which size data.table is faster than data.frame

Tags:

r

data.table

chriscross

People also ask

1 Answers

Jan. 2016: Update to data.table_1.9.7

6 revs, 2 users 70%

Related questions

Recent Activity

Donate For Us

Jan. 2016: Update to `data.table_1.9.7`