Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

About GForce in data.table 1.9.2

Tags:

r

data.table

I don't know how to make great advantage of GForce in data.table 1.9.2

New optimization: GForce. Rather than grouping the data, the group locations are passed into grouped versions of sum and mean (gsum and gmean) which then compute the result for all groups in a single sequential pass through the column for cache efficiency. Further, since the g*function is called just once, we don't need to find ways to speed up calling sum or mean repetitively for each group. `

when submitting the following code

DT <- data.table(A=c(NA,NA,1:3), B=c("a",NA,letters[1:3]))
DT[,sum(A,na.rm=TRUE),by= B]

I got this

    B V1
1:  a  1
2: NA  0
3:  b  2
4:  c  3

and when trying DT[,sum(A,na.rm=FALSE),by= B], I got

    B  V1
1:  a  NA
2:  NA NA
3:  b  2
4:  c  3

Does that results explain what the GForce do, Adding the na.rm = TRUE/FALSE option?

Thanks a lot!

like image 426
Bigchao Avatar asked Mar 03 '14 02:03

Bigchao


1 Answers

It's nothing to do with na.rm. What you show worked fine before as well. However, I can see why you might have thought that. Here is the rest of the same NEWS item :

Examples where GForce applies now :
    DT[,sum(x,na.rm=),by=...]                       # yes
    DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...]  # yes
    DT[,lapply(.SD,sum,na.rm=),by=...]              # yes
    DT[,list(sum(x),min(y)),by=...]                 # no. gmin not yet available
GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1)
Reminder: to see the optimizations and other info, set verbose=TRUE

You don't need to do anything to benefit, it's an automatic optimization.

Here's an example on 500 million rows and 4 columns (13GB). First create and illustrate the data :

$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> require(data.table)
Loading required package: data.table
data.table 1.9.2  For help type: help("data.table")

> DT = data.table( grp = sample(1e6,5e8,replace=TRUE), 
                   a = rnorm(1e6),
                   b = rnorm(1e6),
                   c = rnorm(1e6))
> tables()
     NAME        NROW    MB COLS      KEY
[1,] DT   500,000,000 13352 grp,a,b,c    
Total: 13,352MB
> print(DT)
          grp          a            b          c
1e+00: 695059 -1.4055192  1.587540028  1.7104991
2e+00: 915263 -0.8239298 -0.513575696 -0.3429516
3e+00: 139937 -0.2202024  0.971816721  1.0597421
4e+00: 651525  1.0026858 -1.157824780  0.3100616
5e+00: 438180  1.1074729 -2.513939427  0.8357155
   ---                                          
5e+08: 705823 -1.4773420  0.004369457 -0.2867529
5e+08: 716694 -0.6826147 -0.357086020 -0.4044164
5e+08: 217509  0.4939808 -0.012797093 -1.1084564
5e+08: 501760  1.7081212 -1.772721799 -0.7119432
5e+08: 765653 -1.1141456 -1.569578263  0.4947304

Now time with GForce optimization on (the default). Notice here there is no setkey first. This is what's known as cold by or ad hoc by which is common practice when you want to group in lots of different ways.

> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 47.520   5.651  53.173 
> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 47.372   5.676  53.049      # immediate repeat to confirm timing

Now turn off GForce optimization (as per NEWS item) to see the difference it makes :

> options(datatable.optimize=1)

> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 97.274   3.383 100.659 
> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 97.199   3.423 100.624      # immediate repeat to confirm timing

Finally, confirm the results are the same :

> identical(ans1,ans2)
[1] TRUE
> print(ans1)
            grp          a          b          c
      1: 695059  16.791281  13.269647 -10.663118
      2: 915263  43.312584 -33.587933   4.490842
      3: 139937   3.967393 -10.386636  -3.766019
      4: 651525  -4.152362   9.339594   7.740136
      5: 438180   4.725874  26.328877   9.063309
     ---                                        
 999996: 372601  -2.087248 -19.936420  21.172860
 999997:  13912  18.414226  -1.744378  -7.951381
 999998: 150074  -4.031619   8.433173 -22.041731
 999999: 385718  11.527876   6.807802   7.405016
1000000: 906246 -13.857315 -23.702011   6.605254

Notice that data.table retains the order of the groups according to when they first appeared. To order the grouped result, use keyby= instead of by=.

To turn GForce optimization back on (default is Inf to benefit from all optimizations) :

> options(datatable.optimize=Inf)

Aside : if you're not familiar with the lapply(.SD,...) syntax, it's just a way to apply a function through columns by group. For example, these two lines are equivalent :

 DT[, lapply(.SD,sum), by=grp]               # (1)
 DT[, list(sum(a),sum(b),sum(c)), by=grp]    # (2) exactly the same

The first (1) is more useful as you have more columns, especially in combination with .SDcols to control which subset of columns to apply the function through.

The NEWS item was just trying to convey that it doesn't matter which of these syntax is used, or whether you pass na.rm or not, GForce optimization will still be applied. It's saying that you can mix sum() and mean() in one call (which syntax (2) allows), but as soon as you do something else (like min()), then GForce won't kick in since min isn't done yet; only mean and sum have GForce optimizations currently. You can use verbose=TRUE to see if GForce is being applied.

Details of the machine used for this timing :

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    8
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2494.022
BogoMIPS:              4988.04
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7
like image 180
Matt Dowle Avatar answered Oct 20 '22 06:10

Matt Dowle