Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I evaluate columns inside data.table with different conditions

Tags:

r

data.table

Given the data.table as follows:

library(data.table)
set.seed(100)
dt <- data.table(a=c(1:3, 1), b = c(1,0,1, 3), c = c(1,2,1,3), x = rnorm(4), y = rnorm(4), d = c(4, 6, 6, 7)) 

dt return,

   a b c           x          y d
1: 1 1 1 -0.50219235  0.1169713 4
2: 2 0 2  0.13153117  0.3186301 6
3: 3 1 1 -0.07891709 -0.5817907 6
4: 1 3 3  0.88678481  0.7145327 7

Any numbers in column "a", "b", and "c" that are equal to 3 will be TRUE

Also, any numbers in column "d" that are equal to 6 will be TRUE

How do I evaluate inside dt by using column's name ("a", "b","c", and "d")

so that my return would be:

       a     b     c           x          y     d
1: FALSE FALSE FALSE -0.50219235  0.1169713 FALSE
2: FALSE FALSE FALSE  0.13153117  0.3186301  TRUE
3:  TRUE FALSE FALSE -0.07891709 -0.5817907  TRUE
4: FALSE  TRUE  TRUE  0.88678481  0.7145327 FALSE

Thank you

like image 756
newbie Avatar asked Jul 10 '14 03:07

newbie


People also ask

How do you read a data table?

A table can be read from left to right or from top to bottom. If you read a table across the row, you read the information from left to right. In the Cats and Dogs Table, the number of black animals is 2 + 2 = 4. You'll see that those are the numbers in the row directly to the right of the word 'Black.

Can you edit a data table in Excel?

Select the Build tab, and then select See all. In the left navigation pane, select Tables, next to the table you want, select …, and then select Edit data in Excel.


1 Answers

The approach I came up with looks like the following:

dt[, c("a", "b", "c") := lapply(.SD, `==`, 3), 
   .SDcols = c("a", "b", "c")][, d := (d == 6)][]
#        a     b     c           x          y     d
# 1: FALSE FALSE FALSE -0.50219235  0.1169713 FALSE
# 2: FALSE FALSE FALSE  0.13153117  0.3186301  TRUE
# 3:  TRUE FALSE FALSE -0.07891709 -0.5817907  TRUE
# 4: FALSE  TRUE  TRUE  0.88678481  0.7145327 FALSE

It doesn't win any points in terms of readability, but seems to be OK in terms of performance.

Here's some sample data to test:

library(data.table)
set.seed(100)
Nrow = 3000000
dt <- data.table(a = sample(10, Nrow, TRUE), 
                 b = sample(10, Nrow, TRUE), 
                 c = sample(10, Nrow, TRUE), 
                 x = rnorm(Nrow), 
                 y = rnorm(Nrow),
                 d = sample(10, Nrow, TRUE)) 

... some functions to test...

fun1 <- function(indt) {
  indt[, c("a", "b", "c") := lapply(.SD, `==`, 3), 
     .SDcols = c("a", "b", "c")][, d := (d == 6)][]
}

fun2 <- function(indt) {
  for (i in c("a","b","c")) indt[, (i):=get(i)==3]
  for (i in c("d"))         indt[, (i):=get(i)==6]
  indt
}

fun3 <- function(indt) {
  f <- function(col,x) indt[,(col):=(.SD==x),.SDcols=col]
  lapply(list("a","b","c"), f, 3)
  lapply(list("d"), f, 6)
  indt
}

... and some timings...

microbenchmark(fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)), times = 10)
# Unit: milliseconds
#            expr      min        lq    median        uq       max neval
#  fun1(copy(dt)) 518.6034  535.0848  550.3178  643.2968  695.5819    10
#  fun2(copy(dt)) 830.5808 1037.8790 1172.6684 1272.6236 1608.9753    10
#  fun3(copy(dt)) 922.6474 1029.8510 1097.7520 1145.1848 1340.2009    10

identical(fun1(copy(dt)), fun2(copy(dt)))
# [1] TRUE
identical(fun2(copy(dt)), fun3(copy(dt)))
# [1] TRUE

At this scale, I would go for whatever is most readable to you (unless those milliseconds really count), but if your data are larger, you might want to experiment a little more with the different options.


Addition from Matt

Agreed. To follow up comment, here's fun4 but it's only a smidgen fastest on this size (3e6 rows, 90MB)

fun4 <- function(indt) {
  for (i in c("a","b","c")) set(indt,NULL,i,indt[[i]]==3)
  for (i in c("d"))         set(indt,NULL,i,indt[[i]]==6)
  indt
}

microbenchmark(copy(dt), fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)), 
               fun4(copy(dt)), times = 10)
# Unit: milliseconds
#            expr        min         lq     median         uq       max neval
#        copy(dt)   64.13398   65.94222   68.32217   82.39942  110.3293    10
#  fun1(copy(dt))  601.84611  618.69288  690.47179  713.56760  766.1534    10
#  fun2(copy(dt))  887.99727  950.33821  978.98988 1071.31253 1180.1281    10
#  fun3(copy(dt)) 1566.90858 1574.30635 1603.55467 1673.38625 1771.4054    10
#  fun4(copy(dt))  566.43528  568.91103  575.06881  672.44021  692.9839    10

> identical(fun1(copy(dt)), fun4(copy(dt)))
[1] TRUE

Next I increased the data size by 10 times to 30 million rows, 915MB.

Note these timings are now in seconds, and on my slow netbook.

set.seed(100)
Nrow = 30000000
dt <- data.table(a = sample(10, Nrow, TRUE), 
              b = sample(10, Nrow, TRUE), 
              c = sample(10, Nrow, TRUE), 
              x = rnorm(Nrow), 
              y = rnorm(Nrow),
              d = sample(10, Nrow, TRUE)) 
object.size(dt)/1024^2
# 915 MB
microbenchmark(copy(dt),fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)), 
                 fun4(copy(dt)), times = 3)
# Unit: seconds
#            expr       min        lq    median       uq      max neval
#        copy(dt)   8.04262  53.68556  99.32849 269.4414 439.5544     3
#  fun1(copy(dt)) 207.70646 260.16710 312.62775 317.8966 323.1654     3
#  fun2(copy(dt)) 421.78934 502.03503 582.28073 658.0680 733.8553     3
#  fun3(copy(dt)) 104.30914 187.49875 270.68836 384.7804 498.8724     3
#  fun4(copy(dt)) 158.17239 165.35898 172.54557 183.4851 194.4246     3

Here, fun4 is on average fastest by quite a bit, I guess, due to the memory efficiency of a for loop one column at a time. In fun1 and fun3, the RHS of := is three columns wide before that's then assigned to the three target columns. Having said that, why is my previous fun2 slowest then? It goes column by column after all. Maybe get() copies the column before going into ==.

There was one run where fun3 was fastest (104 vs 158). I'm not sure I trust microbenchmark on that. I seem to remember some criticism by Radford Neal of microbenchmark, but don't recall the outcome.

Those timings were on my really slow netbook :

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000
BogoMIPS:              1995.06
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.3-0 data.table_1.9.2     bit64_0.9-3          bit_1.1-11
like image 176
A5C1D2H2I1M1N2O1R2T1 Avatar answered Dec 08 '22 14:12

A5C1D2H2I1M1N2O1R2T1