Given the data.table as follows:
library(data.table)
set.seed(100)
dt <- data.table(a=c(1:3, 1), b = c(1,0,1, 3), c = c(1,2,1,3), x = rnorm(4), y = rnorm(4), d = c(4, 6, 6, 7))
dt
return,
a b c x y d
1: 1 1 1 -0.50219235 0.1169713 4
2: 2 0 2 0.13153117 0.3186301 6
3: 3 1 1 -0.07891709 -0.5817907 6
4: 1 3 3 0.88678481 0.7145327 7
Any numbers in column "a", "b", and "c" that are equal to 3 will be TRUE
Also, any numbers in column "d" that are equal to 6 will be TRUE
How do I evaluate inside dt
by using column's name ("a", "b","c", and "d")
so that my return would be:
a b c x y d
1: FALSE FALSE FALSE -0.50219235 0.1169713 FALSE
2: FALSE FALSE FALSE 0.13153117 0.3186301 TRUE
3: TRUE FALSE FALSE -0.07891709 -0.5817907 TRUE
4: FALSE TRUE TRUE 0.88678481 0.7145327 FALSE
Thank you
A table can be read from left to right or from top to bottom. If you read a table across the row, you read the information from left to right. In the Cats and Dogs Table, the number of black animals is 2 + 2 = 4. You'll see that those are the numbers in the row directly to the right of the word 'Black.
Select the Build tab, and then select See all. In the left navigation pane, select Tables, next to the table you want, select …, and then select Edit data in Excel.
The approach I came up with looks like the following:
dt[, c("a", "b", "c") := lapply(.SD, `==`, 3),
.SDcols = c("a", "b", "c")][, d := (d == 6)][]
# a b c x y d
# 1: FALSE FALSE FALSE -0.50219235 0.1169713 FALSE
# 2: FALSE FALSE FALSE 0.13153117 0.3186301 TRUE
# 3: TRUE FALSE FALSE -0.07891709 -0.5817907 TRUE
# 4: FALSE TRUE TRUE 0.88678481 0.7145327 FALSE
It doesn't win any points in terms of readability, but seems to be OK in terms of performance.
Here's some sample data to test:
library(data.table)
set.seed(100)
Nrow = 3000000
dt <- data.table(a = sample(10, Nrow, TRUE),
b = sample(10, Nrow, TRUE),
c = sample(10, Nrow, TRUE),
x = rnorm(Nrow),
y = rnorm(Nrow),
d = sample(10, Nrow, TRUE))
... some functions to test...
fun1 <- function(indt) {
indt[, c("a", "b", "c") := lapply(.SD, `==`, 3),
.SDcols = c("a", "b", "c")][, d := (d == 6)][]
}
fun2 <- function(indt) {
for (i in c("a","b","c")) indt[, (i):=get(i)==3]
for (i in c("d")) indt[, (i):=get(i)==6]
indt
}
fun3 <- function(indt) {
f <- function(col,x) indt[,(col):=(.SD==x),.SDcols=col]
lapply(list("a","b","c"), f, 3)
lapply(list("d"), f, 6)
indt
}
... and some timings...
microbenchmark(fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1(copy(dt)) 518.6034 535.0848 550.3178 643.2968 695.5819 10
# fun2(copy(dt)) 830.5808 1037.8790 1172.6684 1272.6236 1608.9753 10
# fun3(copy(dt)) 922.6474 1029.8510 1097.7520 1145.1848 1340.2009 10
identical(fun1(copy(dt)), fun2(copy(dt)))
# [1] TRUE
identical(fun2(copy(dt)), fun3(copy(dt)))
# [1] TRUE
At this scale, I would go for whatever is most readable to you (unless those milliseconds really count), but if your data are larger, you might want to experiment a little more with the different options.
Addition from Matt
Agreed. To follow up comment, here's fun4
but it's only a smidgen fastest on this size (3e6 rows, 90MB)
fun4 <- function(indt) {
for (i in c("a","b","c")) set(indt,NULL,i,indt[[i]]==3)
for (i in c("d")) set(indt,NULL,i,indt[[i]]==6)
indt
}
microbenchmark(copy(dt), fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)),
fun4(copy(dt)), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# copy(dt) 64.13398 65.94222 68.32217 82.39942 110.3293 10
# fun1(copy(dt)) 601.84611 618.69288 690.47179 713.56760 766.1534 10
# fun2(copy(dt)) 887.99727 950.33821 978.98988 1071.31253 1180.1281 10
# fun3(copy(dt)) 1566.90858 1574.30635 1603.55467 1673.38625 1771.4054 10
# fun4(copy(dt)) 566.43528 568.91103 575.06881 672.44021 692.9839 10
> identical(fun1(copy(dt)), fun4(copy(dt)))
[1] TRUE
Next I increased the data size by 10 times to 30 million rows, 915MB.
Note these timings are now in seconds, and on my slow netbook.
set.seed(100)
Nrow = 30000000
dt <- data.table(a = sample(10, Nrow, TRUE),
b = sample(10, Nrow, TRUE),
c = sample(10, Nrow, TRUE),
x = rnorm(Nrow),
y = rnorm(Nrow),
d = sample(10, Nrow, TRUE))
object.size(dt)/1024^2
# 915 MB
microbenchmark(copy(dt),fun1(copy(dt)), fun2(copy(dt)), fun3(copy(dt)),
fun4(copy(dt)), times = 3)
# Unit: seconds
# expr min lq median uq max neval
# copy(dt) 8.04262 53.68556 99.32849 269.4414 439.5544 3
# fun1(copy(dt)) 207.70646 260.16710 312.62775 317.8966 323.1654 3
# fun2(copy(dt)) 421.78934 502.03503 582.28073 658.0680 733.8553 3
# fun3(copy(dt)) 104.30914 187.49875 270.68836 384.7804 498.8724 3
# fun4(copy(dt)) 158.17239 165.35898 172.54557 183.4851 194.4246 3
Here, fun4
is on average fastest by quite a bit, I guess, due to the memory efficiency of a for
loop one column at a time. In fun1
and fun3
, the RHS of :=
is three columns wide before that's then assigned to the three target columns. Having said that, why is my previous fun2
slowest then? It goes column by column after all. Maybe get()
copies the column before going into ==
.
There was one run where fun3
was fastest (104 vs 158). I'm not sure I trust microbenchmark
on that. I seem to remember some criticism by Radford Neal of microbenchmark
, but don't recall the outcome.
Those timings were on my really slow netbook :
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 20
Model: 2
Stepping: 0
CPU MHz: 800.000
BogoMIPS: 1995.06
Virtualisation: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
NUMA node0 CPU(s): 0,1
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.3-0 data.table_1.9.2 bit64_0.9-3 bit_1.1-11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With