How do I evaluate columns inside data.table with different conditions

Tags:

1 Answers

The approach I came up with looks like the following:

dt[, c("a", "b", "c") := lapply(.SD, `==`, 3), 
   .SDcols = c("a", "b", "c")][, d := (d == 6)][]
#        a     b     c           x          y     d
# 1: FALSE FALSE FALSE -0.50219235  0.1169713 FALSE
# 2: FALSE FALSE FALSE  0.13153117  0.3186301  TRUE
# 3:  TRUE FALSE FALSE -0.07891709 -0.5817907  TRUE
# 4: FALSE  TRUE  TRUE  0.88678481  0.7145327 FALSE

It doesn't win any points in terms of readability, but seems to be OK in terms of performance.

Here's some sample data to test:

library(data.table)
set.seed(100)
Nrow = 3000000
dt <- data.table(a = sample(10, Nrow, TRUE), 
                 b = sample(10, Nrow, TRUE), 
                 c = sample(10, Nrow, TRUE), 
                 x = rnorm(Nrow), 
                 y = rnorm(Nrow),
                 d = sample(10, Nrow, TRUE))

... some functions to test...

fun1 <- function(indt) {
  indt[, c("a", "b", "c") := lapply(.SD, `==`, 3), 
     .SDcols = c("a", "b", "c")][, d := (d == 6)][]
}

fun2 <- function(indt) {
  for (i in c("a","b","c")) indt[, (i):=get(i)==3]
  for (i in c("d"))         indt[, (i):=get(i)==6]
  indt
}

fun3 <- function(indt) {
  f <- function(col,x) indt[,(col):=(.SD==x),.SDcols=col]
  lapply(list("a","b","c"), f, 3)
  lapply(list("d"), f, 6)
  indt
}

... and some timings...

microbenchmark(fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)), times = 10)
# Unit: milliseconds
#            expr      min        lq    median        uq       max neval
#  fun1(copy(dt)) 518.6034  535.0848  550.3178  643.2968  695.5819    10
#  fun2(copy(dt)) 830.5808 1037.8790 1172.6684 1272.6236 1608.9753    10
#  fun3(copy(dt)) 922.6474 1029.8510 1097.7520 1145.1848 1340.2009    10

identical(fun1(copy(dt)), fun2(copy(dt)))
# [1] TRUE
identical(fun2(copy(dt)), fun3(copy(dt)))
# [1] TRUE

At this scale, I would go for whatever is most readable to you (unless those milliseconds really count), but if your data are larger, you might want to experiment a little more with the different options.

Addition from Matt

Agreed. To follow up comment, here's fun4 but it's only a smidgen fastest on this size (3e6 rows, 90MB)

fun4 <- function(indt) {
  for (i in c("a","b","c")) set(indt,NULL,i,indt[[i]]==3)
  for (i in c("d"))         set(indt,NULL,i,indt[[i]]==6)
  indt
}

microbenchmark(copy(dt), fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)), 
               fun4(copy(dt)), times = 10)
# Unit: milliseconds
#            expr        min         lq     median         uq       max neval
#        copy(dt)   64.13398   65.94222   68.32217   82.39942  110.3293    10
#  fun1(copy(dt))  601.84611  618.69288  690.47179  713.56760  766.1534    10
#  fun2(copy(dt))  887.99727  950.33821  978.98988 1071.31253 1180.1281    10
#  fun3(copy(dt)) 1566.90858 1574.30635 1603.55467 1673.38625 1771.4054    10
#  fun4(copy(dt))  566.43528  568.91103  575.06881  672.44021  692.9839    10

> identical(fun1(copy(dt)), fun4(copy(dt)))
[1] TRUE

Next I increased the data size by 10 times to 30 million rows, 915MB.

Note these timings are now in seconds, and on my slow netbook.

set.seed(100)
Nrow = 30000000
dt <- data.table(a = sample(10, Nrow, TRUE), 
              b = sample(10, Nrow, TRUE), 
              c = sample(10, Nrow, TRUE), 
              x = rnorm(Nrow), 
              y = rnorm(Nrow),
              d = sample(10, Nrow, TRUE)) 
object.size(dt)/1024^2
# 915 MB
microbenchmark(copy(dt),fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)), 
                 fun4(copy(dt)), times = 3)
# Unit: seconds
#            expr       min        lq    median       uq      max neval
#        copy(dt)   8.04262  53.68556  99.32849 269.4414 439.5544     3
#  fun1(copy(dt)) 207.70646 260.16710 312.62775 317.8966 323.1654     3
#  fun2(copy(dt)) 421.78934 502.03503 582.28073 658.0680 733.8553     3
#  fun3(copy(dt)) 104.30914 187.49875 270.68836 384.7804 498.8724     3
#  fun4(copy(dt)) 158.17239 165.35898 172.54557 183.4851 194.4246     3

Here, fun4 is on average fastest by quite a bit, I guess, due to the memory efficiency of a for loop one column at a time. In fun1 and fun3, the RHS of := is three columns wide before that's then assigned to the three target columns. Having said that, why is my previous fun2 slowest then? It goes column by column after all. Maybe get() copies the column before going into ==.

There was one run where fun3 was fastest (104 vs 158). I'm not sure I trust microbenchmark on that. I seem to remember some criticism by Radford Neal of microbenchmark, but don't recall the outcome.

Those timings were on my really slow netbook :

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000
BogoMIPS:              1995.06
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.3-0 data.table_1.9.2     bit64_0.9-3          bit_1.1-11

176

answered Dec 08 '22 14:12

A5C1D2H2I1M1N2O1R2T1

Related questions
                            
                                add header to file created by "write.csv"
                            
                                How to count the number of sentences in a text in R?
                            
                                Creating a cumulative step graph in R
                            
                                money representation in R
                            
                                Use for loop to plot multiple lines in single plot with ggplot2
                            
                                time display in clock with xy scatter plot in r
                            
                                Keyed lookup on data.table without 'with'
                            
                                Use object names within a list in lapply/ldply
                            
                                ggplot2: How to specify multiple fill colors for points that are connected by lines of different colors
                            
                                how to generate random numbers with sequence in R
                            
                                how to draw arrow in ggplot2 with annotation
                            
                                Change thickness of a marker in ggplot2
                            
                                How can I shorten x-axis label text in ggplot?
                            
                                What is the most useful output format for graphs? [closed]
                            
                                Loop through netcdf files and run calculations - Python or R
                            
                                reading multiple csv files in R [duplicate]
                            
                                R: Compare all the columns pairwise in matrix
                            
                                error with scale_x_labels in ggplot2
                            
                                How can I summarizing data statistics using R
                            
                                Captions on tables in pdf documents generated by rmarkdown

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I evaluate columns inside data.table with different conditions

Tags:

r

data.table

newbie

People also ask

1 Answers

A5C1D2H2I1M1N2O1R2T1

Recent Activity

Donate For Us