I don't know how to make great advantage of GForce in data.table 1.9.2
New optimization: GForce. Rather than grouping the data, the group locations are passed into grouped versions of sum and mean (gsum and gmean) which then compute the result for all groups in a single sequential pass through the column for cache efficiency. Further, since the g*function is called just once, we don't need to find ways to speed up calling sum or mean repetitively for each group. `
when submitting the following code
DT <- data.table(A=c(NA,NA,1:3), B=c("a",NA,letters[1:3]))
DT[,sum(A,na.rm=TRUE),by= B]
I got this
B V1 1: a 1 2: NA 0 3: b 2 4: c 3
and when trying DT[,sum(A,na.rm=FALSE),by= B]
, I got
B V1 1: a NA 2: NA NA 3: b 2 4: c 3
Does that results explain what the GForce do, Adding the na.rm = TRUE/FALSE
option?
Thanks a lot!
It's nothing to do with na.rm
. What you show worked fine before as well. However, I can see why you might have thought that. Here is the rest of the same NEWS item :
Examples where GForce applies now :
DT[,sum(x,na.rm=),by=...] # yes
DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...] # yes
DT[,lapply(.SD,sum,na.rm=),by=...] # yes
DT[,list(sum(x),min(y)),by=...] # no. gmin not yet available
GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1)
Reminder: to see the optimizations and other info, set verbose=TRUE
You don't need to do anything to benefit, it's an automatic optimization.
Here's an example on 500 million rows and 4 columns (13GB). First create and illustrate the data :
$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
> require(data.table)
Loading required package: data.table
data.table 1.9.2 For help type: help("data.table")
> DT = data.table( grp = sample(1e6,5e8,replace=TRUE),
a = rnorm(1e6),
b = rnorm(1e6),
c = rnorm(1e6))
> tables()
NAME NROW MB COLS KEY
[1,] DT 500,000,000 13352 grp,a,b,c
Total: 13,352MB
> print(DT)
grp a b c
1e+00: 695059 -1.4055192 1.587540028 1.7104991
2e+00: 915263 -0.8239298 -0.513575696 -0.3429516
3e+00: 139937 -0.2202024 0.971816721 1.0597421
4e+00: 651525 1.0026858 -1.157824780 0.3100616
5e+00: 438180 1.1074729 -2.513939427 0.8357155
---
5e+08: 705823 -1.4773420 0.004369457 -0.2867529
5e+08: 716694 -0.6826147 -0.357086020 -0.4044164
5e+08: 217509 0.4939808 -0.012797093 -1.1084564
5e+08: 501760 1.7081212 -1.772721799 -0.7119432
5e+08: 765653 -1.1141456 -1.569578263 0.4947304
Now time with GForce optimization on (the default). Notice here there is no setkey
first. This is what's known as cold by or ad hoc by which is common practice when you want to group in lots of different ways.
> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
47.520 5.651 53.173
> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
47.372 5.676 53.049 # immediate repeat to confirm timing
Now turn off GForce optimization (as per NEWS item) to see the difference it makes :
> options(datatable.optimize=1)
> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
97.274 3.383 100.659
> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
97.199 3.423 100.624 # immediate repeat to confirm timing
Finally, confirm the results are the same :
> identical(ans1,ans2)
[1] TRUE
> print(ans1)
grp a b c
1: 695059 16.791281 13.269647 -10.663118
2: 915263 43.312584 -33.587933 4.490842
3: 139937 3.967393 -10.386636 -3.766019
4: 651525 -4.152362 9.339594 7.740136
5: 438180 4.725874 26.328877 9.063309
---
999996: 372601 -2.087248 -19.936420 21.172860
999997: 13912 18.414226 -1.744378 -7.951381
999998: 150074 -4.031619 8.433173 -22.041731
999999: 385718 11.527876 6.807802 7.405016
1000000: 906246 -13.857315 -23.702011 6.605254
Notice that data.table
retains the order of the groups according to when they first appeared. To order the grouped result, use keyby=
instead of by=
.
To turn GForce optimization back on (default is Inf
to benefit from all optimizations) :
> options(datatable.optimize=Inf)
Aside : if you're not familiar with the lapply(.SD,...)
syntax, it's just a way to apply a function through columns by group. For example, these two lines are equivalent :
DT[, lapply(.SD,sum), by=grp] # (1)
DT[, list(sum(a),sum(b),sum(c)), by=grp] # (2) exactly the same
The first (1) is more useful as you have more columns, especially in combination with .SDcols
to control which subset of columns to apply the function through.
The NEWS item was just trying to convey that it doesn't matter which of these syntax is used, or whether you pass na.rm
or not, GForce optimization will still be applied. It's saying that you can mix sum()
and mean()
in one call (which syntax (2) allows), but as soon as you do something else (like min()
), then GForce won't kick in since min
isn't done yet; only mean
and sum
have GForce optimizations currently. You can use verbose=TRUE
to see if GForce is being applied.
Details of the machine used for this timing :
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2494.022
BogoMIPS: 4988.04
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With