en bref:
I am wondering how to get unique rows from a data.table
in a somewhere along a dplyr
workflow. Since v0.2 I can use row_number==1
(see: Remove duplicated rows using dplyr)
BUT!
tbl_df(data) %>% group_by(Var1,Var2) %>% filter(row_number() == 1)
works.
tbl_dt(data) %>% group_by(Var1,Var2) %>% filter(row_number() == 1)
doesn't. is this a bug?
library(dplyr)
library(data.table)
library(microbenchmark)
little <- expand.grid(rep(letters,121),rep(letters,121)) # my 10M row dataset.
tbl_dt(little) %>% group_by(Var1,Var2) %>% filter(row_number() == 1)
> Error in rank(x, ties.method = "first") :
> argument "x" is missing, with no default
this is how i actually found it to be broken. I was asking:
I can use the unique.data.table
method:
dt_u <- function() {
tbl_dt(little) %>%
group_by(Var1,Var2) %>%
unique(.) %>%
tbl_dt(.) }
I can use summarise
then select
away the new col:
dt_ss <- function() {
tbl_dt(little) %>%
group_by(Var1,Var2) %>%
summarise( n = n() ) %>%
select( -(n) ) }
I can use row_number() == 1
# DOESN'T WORK for tbl_dt!
dt_rn <- function() {
tbl_dt(little) %>%
group_by(Var1,Var2) %>%
filter( row_number() == 1 ) }
and so on with the tbl_df()
equivalents.
microbenchmark(...,times=20)
:> Unit: milliseconds
> expr min lq median uq max neval
> dt_ss() 579.0385 618.0002 661.9056 694.0705 764.2221 20
> dt_u() 690.1284 729.8723 756.5505 783.7379 897.4799 20
> df_ss() 419.7841 436.9871 448.1717 461.7023 523.2798 20
> df_u() 3971.1699 4044.3663 4097.9848 4168.3468 4245.8346 20
> df_rn() 646.1497 687.3472 711.3924 724.6235 754.3166 20
Interesting. Your benchmarks spiked my interest. I find it a bit odd that you don't compare against data.table
's unique.data.table
directly. So here are the results by including that as well on my system.
# extra function with which the benchmark shown below was run
dt_direct <- function() unique(dt) # where dt = as.data.table(little)
# Unit: milliseconds
# expr min lq median uq max neval
# dt_u() 1472.2460 1571.0871 1664.0476 1742.5184 2647.2118 20
# df_u() 6084.2877 6303.9058 6490.1686 6844.8767 7370.3322 20
# dt_ss() 1340.8479 1485.4064 1552.8756 1586.6706 1810.2979 20
# df_ss() 799.5289 835.8599 884.6501 957.2208 1251.5994 20
# df_rn() 1410.0145 1576.2033 1660.1124 1770.2645 2442.7578 20
# dt_direct() 452.6010 463.6116 486.5015 568.0451 670.3673 20
It's 1.8x faster than the fastest solution from all your runs.
Now, let's increase the number of unique values from 676 to about 10,000 and see what happens.
val = paste0("V", 1:100)
little <- data.frame(Var1=sample(val, 1e7, TRUE), Var2=sample(val, 1e7, TRUE))
dt <- as.data.table(little)
# Unit: milliseconds
# expr min lq median uq max neval
# dt_u() 1709.458 1776.3510 1892.7761 1991.6339 2562.9171 20
# df_u() 7541.364 7735.4725 7981.3483 8462.9093 9552.8629 20
# dt_ss() 1555.110 1627.6519 1791.5219 1911.3594 2299.2864 20
# df_ss() 1436.355 1500.1043 1528.1319 1649.3043 1961.9945 20
# df_rn() 2001.396 2189.5164 2393.8861 2550.2198 3047.7019 20
# dt_direct() 508.596 525.7299 577.6982 674.2288 893.2116 20
And here, it's 2.6x faster.
Note: I don't time the creation of
dt
here because, in real use cases, you can either usefread
to get a data.table directly, or usesetDT
to convert adata.table
by reference or directly usedata.table(.)
instead ofdata.fame(.)
- which is not timed as well.
But why are both dt_u
and dt_ss
slower then?
By looking at the file grouped-dt.r
and manip-grouped-dt.r
, this is happening because of 1) copies and 2) setting keys. (1) is basically because of having to do (2). If you do a summarise operation using dplyr
, it's equivalent to:
DT <- copy(DT);
setkey(DT, <group_cols> ## these two are in grouped_dt
DT[, j, by=<group_cols>] ## this is in summarise.grouped_dt
DT <- copy(DT) ## because it calls grouped_dt AGAIN!
## and sets key again - which is O(n) now as DT checked if sorted first..
I'm not sure why ad-hoc grouping was not implemented, after this discussion under Hadey's answer.
## equivalent ad-hoc by
DT[, j, by=<group_cols] ## no copy, no setkey
It avoids both copies and setting key.
It is even worse if you mutate. It's effectively doing:
DT <- copy(DT)
setkey(DT, <group_cols>) ## these two are in grouped_dt
DT <- copy(DT) ## mutate.grouped_dt copies copied data again
DT[, `:=`(...), by=<group_cols>] ## this is in mutate.grouped_dt
DT = copy(DT) ## because of another call to grouped_dt!!!
## and sets key again - which is O(n) now as DT is checked if sorted first..
Here again, the ad-hoc solution is simply:
DT = copy(DT)
DT[, `:=`(...), by=group_cols]
It avoids 2 copies and setting key.. The only copy is there to satisfy dplyr's philosophy of not modifying objects in-place. So, this'll always be slower + taking up twice the memory in dplyr
.
Similarly, copies on some joins can be avoided as I've commented here.
The NEWS item from dplyr v0.2
says:
- dplyr is more careful when setting the keys of data tables, so it never accidentally modifies an object that it doesn't own. It also avoids unnecessary key setting which negatively affected performance. (#193, #255).
But clearly quite some discussed cases haven't made it.
So far I wrote about the performance tag under your question. That is, if you're looking for performance, you should be avoiding all cases which makes (unnecessary) copies (and setting keys), until fixed.
In that essence, in this particular case, the best answer I could come up with is just call unique.data.table
directly in dplyr
ish way:
tbl_dt(little) %>% unique(.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With