Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unique rows in dplyr : row_number() from tbl_dt inconsistent to tbl_df

en bref:

I am wondering how to get unique rows from a data.table in a somewhere along a dplyr workflow. Since v0.2 I can use row_number==1 (see: Remove duplicated rows using dplyr)

BUT!

tbl_df(data) %>% group_by(Var1,Var2) %>% filter(row_number() == 1) works.

tbl_dt(data) %>% group_by(Var1,Var2) %>% filter(row_number() == 1) doesn't. is this a bug?

setup:

library(dplyr)
library(data.table)
library(microbenchmark)

little <- expand.grid(rep(letters,121),rep(letters,121)) # my 10M row dataset.
tbl_dt(little) %>% group_by(Var1,Var2) %>% filter(row_number() == 1)

result:

> Error in rank(x, ties.method = "first") : 
> argument "x" is missing, with no default

this is how i actually found it to be broken. I was asking:

this way or that way?

I can use the unique.data.table method:

 dt_u <- function() {
           tbl_dt(little) %>% 
           group_by(Var1,Var2) %>% 
           unique(.) %>% 
           tbl_dt(.) }

I can use summarise then select away the new col:

dt_ss <- function() {
           tbl_dt(little) %>% 
           group_by(Var1,Var2) %>% 
           summarise( n = n() ) %>% 
           select( -(n) ) }

I can use row_number() == 1 # DOESN'T WORK for tbl_dt!

 dt_rn <- function() {
           tbl_dt(little) %>% 
           group_by(Var1,Var2) %>% 
           filter( row_number() == 1 ) }

and so on with the tbl_df() equivalents.

benchmarking the equivalent data.table / data.frame methods microbenchmark(...,times=20):

> Unit: milliseconds
>     expr       min        lq    median        uq       max neval
>  dt_ss()  579.0385  618.0002  661.9056  694.0705  764.2221    20
>  dt_u()   690.1284  729.8723  756.5505  783.7379  897.4799    20
>  df_ss()  419.7841  436.9871  448.1717  461.7023  523.2798    20
>  df_u()  3971.1699 4044.3663 4097.9848 4168.3468 4245.8346    20
>  df_rn()  646.1497  687.3472  711.3924  724.6235  754.3166    20
like image 692
npjc Avatar asked May 25 '14 23:05

npjc


1 Answers

Interesting. Your benchmarks spiked my interest. I find it a bit odd that you don't compare against data.table's unique.data.table directly. So here are the results by including that as well on my system.

# extra function with which the benchmark shown below was run
dt_direct <- function() unique(dt) # where dt = as.data.table(little)

# Unit: milliseconds
#         expr       min        lq    median        uq       max neval
#       dt_u() 1472.2460 1571.0871 1664.0476 1742.5184 2647.2118    20
#       df_u() 6084.2877 6303.9058 6490.1686 6844.8767 7370.3322    20
#      dt_ss() 1340.8479 1485.4064 1552.8756 1586.6706 1810.2979    20
#      df_ss()  799.5289  835.8599  884.6501  957.2208 1251.5994    20
#      df_rn() 1410.0145 1576.2033 1660.1124 1770.2645 2442.7578    20
#  dt_direct()  452.6010  463.6116  486.5015  568.0451  670.3673    20

It's 1.8x faster than the fastest solution from all your runs.

Now, let's increase the number of unique values from 676 to about 10,000 and see what happens.

val = paste0("V", 1:100)
little <- data.frame(Var1=sample(val, 1e7, TRUE), Var2=sample(val, 1e7, TRUE))
dt <- as.data.table(little)

# Unit: milliseconds
#         expr      min        lq    median        uq       max neval
#       dt_u() 1709.458 1776.3510 1892.7761 1991.6339 2562.9171    20
#       df_u() 7541.364 7735.4725 7981.3483 8462.9093 9552.8629    20
#      dt_ss() 1555.110 1627.6519 1791.5219 1911.3594 2299.2864    20
#      df_ss() 1436.355 1500.1043 1528.1319 1649.3043 1961.9945    20
#      df_rn() 2001.396 2189.5164 2393.8861 2550.2198 3047.7019    20
#  dt_direct()  508.596  525.7299  577.6982  674.2288  893.2116    20

And here, it's 2.6x faster.

Note: I don't time the creation of dt here because, in real use cases, you can either use fread to get a data.table directly, or use setDT to convert a data.table by reference or directly use data.table(.) instead of data.fame(.) - which is not timed as well.


But why are both dt_u and dt_ss slower then?

By looking at the file grouped-dt.r and manip-grouped-dt.r, this is happening because of 1) copies and 2) setting keys. (1) is basically because of having to do (2). If you do a summarise operation using dplyr, it's equivalent to:

DT <- copy(DT);
setkey(DT, <group_cols>  ## these two are in grouped_dt
DT[, j, by=<group_cols>] ## this is in summarise.grouped_dt
DT <- copy(DT)           ## because it calls grouped_dt AGAIN!
## and sets key again - which is O(n) now as DT checked if sorted first..

I'm not sure why ad-hoc grouping was not implemented, after this discussion under Hadey's answer.

## equivalent ad-hoc by
DT[, j, by=<group_cols] ## no copy, no setkey

It avoids both copies and setting key.


It is even worse if you mutate. It's effectively doing:

DT <- copy(DT)
setkey(DT, <group_cols>) ## these two are in grouped_dt
DT <- copy(DT)           ## mutate.grouped_dt copies copied data again
DT[, `:=`(...), by=<group_cols>] ## this is in mutate.grouped_dt
DT = copy(DT) ## because of another call to grouped_dt!!!
## and sets key again - which is O(n) now as DT is checked if sorted first..

Here again, the ad-hoc solution is simply:

DT   = copy(DT)
DT[, `:=`(...), by=group_cols]

It avoids 2 copies and setting key.. The only copy is there to satisfy dplyr's philosophy of not modifying objects in-place. So, this'll always be slower + taking up twice the memory in dplyr.


Similarly, copies on some joins can be avoided as I've commented here.


The NEWS item from dplyr v0.2 says:

  • dplyr is more careful when setting the keys of data tables, so it never accidentally modifies an object that it doesn't own. It also avoids unnecessary key setting which negatively affected performance. (#193, #255).

But clearly quite some discussed cases haven't made it.


So far I wrote about the performance tag under your question. That is, if you're looking for performance, you should be avoiding all cases which makes (unnecessary) copies (and setting keys), until fixed.

In that essence, in this particular case, the best answer I could come up with is just call unique.data.table directly in dplyrish way:

tbl_dt(little) %>% unique(.)
like image 101
Arun Avatar answered Sep 19 '22 21:09

Arun