Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to order data within subgroups in data.table R

Tags:

r

data.table

Consider the following:

DT = data.table(a=sample(1:2), b=sample(1:1000,20))

How to display b, say the n highest values, by each a?

I am stucked in DT[,b,by=a][order(a,-b)].

Thanks!

like image 815
unmark1 Avatar asked Feb 23 '15 21:02

unmark1


People also ask

How do I order a data table in R?

To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.

What does .SD do in data table?

SD stands for "Subset of Data. table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.


1 Answers

The most elegant would be:

DT[order(-b),head(b,5),by=a]

In terms of pure performance:

DT[order(-b), indx := seq_len(.N), "a"][indx <= 5][,indx:=NULL][]

Or the one suggested by @Frank:

DT[DT[order(-b),.I[1:.N<=5],"a"]$V1]

Below the benchmark of all three above:

# devtools::install_github("jangorecki/dwtools")
library(dwtools) # to populate complex dataset
N <- 5e6
DT <- dw.populate(N, scenario="fact")
str(DT)
#Classes ‘data.table’ and 'data.frame': 5000000 obs. of  8 variables:
# $ cust_code: chr  "id010" "id076" "id024" "id081" ...
# $ prod_code: int  8234 5689 31198 35479 39140 37589 8184 39489 35266 3596 ...
# $ geog_code: chr  "OH" "NH" "TN" "MI" ...
# $ time_code: Date, format: "2012-03-11" "2014-02-10" "2012-11-05" "2013-01-30" ...
# $ curr_code: chr  "XRP" "HRK" "CAD" "BRL" ...
# $ amount   : num  486 382 695 470 749 ...
# $ value    : num  193454 33694 351418 84888 20673 ...

By cust_code column, uniqueN equal to 100:

system.time(DT[order(-time_code),head(.SD,5),"cust_code"])
#   user  system elapsed 
#  1.804   0.084   1.890 
system.time(DT[order(-time_code), indx := seq_len(.N),"cust_code"][indx <= 5][,indx:=NULL][])
#   user  system elapsed 
#  1.414   0.092   1.508 
system.time(DT[DT[order(-time_code),.I[1:.N<=5],"cust_code"]$V1])
#   user  system elapsed 
#  1.405   0.096   1.502 

If there are much more groups (prod_code column, uniqueN equal to 50000), then we can see the impact on the performance:

system.time(DT[order(time_code),head(.SD,5),"prod_code"])
#   user  system elapsed 
# 10.177   0.109  10.322
system.time(DT[order(time_code), indx := seq_len(.N),"prod_code"][indx <= 5][,indx:=NULL][])
#   user  system elapsed 
#  1.555   0.099   1.665 
system.time(DT[DT[order(time_code),.I[1:.N<=5],"prod_code"]$V1])
#   user  system elapsed 
#  1.697   0.064   1.764

Update on 2015-11-09:

With today's Arun commit e615532 the head and tail should be optimized under the hood.

like image 78
jangorecki Avatar answered Oct 13 '22 18:10

jangorecki