Consider the following: <code>DT = data.table(a=sample(1:2), b=sample(1:1000,20))</code> How to display b, say the n highest values, by each a? I am stucked in <code>DT[,b,by=a][order(a,-b)]</code>. Thanks!

The most elegant would be: <pre class="prettyprint"><code>DT[order(-b),head(b,5),by=a] </code></pre> In terms of pure performance: <pre class="prettyprint"><code>DT[order(-b), indx := seq_len(.N), "a"][indx <= 5][,indx:=NULL][] </code></pre> Or the one suggested by @Frank: <pre class="prettyprint"><code>DT[DT[order(-b),.I[1:.N<=5],"a"]$V1] </code></pre> Below the benchmark of all three above: <pre class="prettyprint"><code># devtools::install_github("jangorecki/dwtools") library(dwtools) # to populate complex dataset N <- 5e6 DT <- dw.populate(N, scenario="fact") str(DT) #Classes ‘data.table’ and 'data.frame': 5000000 obs. of 8 variables: # $ cust_code: chr "id010" "id076" "id024" "id081" ... # $ prod_code: int 8234 5689 31198 35479 39140 37589 8184 39489 35266 3596 ... # $ geog_code: chr "OH" "NH" "TN" "MI" ... # $ time_code: Date, format: "2012-03-11" "2014-02-10" "2012-11-05" "2013-01-30" ... # $ curr_code: chr "XRP" "HRK" "CAD" "BRL" ... # $ amount : num 486 382 695 470 749 ... # $ value : num 193454 33694 351418 84888 20673 ... </code></pre> By cust_code column, uniqueN equal to 100: <pre class="prettyprint"><code>system.time(DT[order(-time_code),head(.SD,5),"cust_code"]) # user system elapsed # 1.804 0.084 1.890 system.time(DT[order(-time_code), indx := seq_len(.N),"cust_code"][indx <= 5][,indx:=NULL][]) # user system elapsed # 1.414 0.092 1.508 system.time(DT[DT[order(-time_code),.I[1:.N<=5],"cust_code"]$V1]) # user system elapsed # 1.405 0.096 1.502 </code></pre> If there are much more groups (prod_code column, uniqueN equal to 50000), then we can see the impact on the performance: <pre class="prettyprint"><code>system.time(DT[order(time_code),head(.SD,5),"prod_code"]) # user system elapsed # 10.177 0.109 10.322 system.time(DT[order(time_code), indx := seq_len(.N),"prod_code"][indx <= 5][,indx:=NULL][]) # user system elapsed # 1.555 0.099 1.665 system.time(DT[DT[order(time_code),.I[1:.N<=5],"prod_code"]$V1]) # user system elapsed # 1.697 0.064 1.764 </code></pre> <hr> Update on 2015-11-09: With today's Arun commit e615532 the <code>head</code> and <code>tail</code> should be optimized under the hood.

How to order data within subgroups in data.table R

1 Answers

The most elegant would be:

DT[order(-b),head(b,5),by=a]

In terms of pure performance:

DT[order(-b), indx := seq_len(.N), "a"][indx <= 5][,indx:=NULL][]

Or the one suggested by @Frank:

DT[DT[order(-b),.I[1:.N<=5],"a"]$V1]

Below the benchmark of all three above:

# devtools::install_github("jangorecki/dwtools")
library(dwtools) # to populate complex dataset
N <- 5e6
DT <- dw.populate(N, scenario="fact")
str(DT)
#Classes ‘data.table’ and 'data.frame': 5000000 obs. of  8 variables:
# $ cust_code: chr  "id010" "id076" "id024" "id081" ...
# $ prod_code: int  8234 5689 31198 35479 39140 37589 8184 39489 35266 3596 ...
# $ geog_code: chr  "OH" "NH" "TN" "MI" ...
# $ time_code: Date, format: "2012-03-11" "2014-02-10" "2012-11-05" "2013-01-30" ...
# $ curr_code: chr  "XRP" "HRK" "CAD" "BRL" ...
# $ amount   : num  486 382 695 470 749 ...
# $ value    : num  193454 33694 351418 84888 20673 ...

By cust_code column, uniqueN equal to 100:

system.time(DT[order(-time_code),head(.SD,5),"cust_code"])
#   user  system elapsed 
#  1.804   0.084   1.890 
system.time(DT[order(-time_code), indx := seq_len(.N),"cust_code"][indx <= 5][,indx:=NULL][])
#   user  system elapsed 
#  1.414   0.092   1.508 
system.time(DT[DT[order(-time_code),.I[1:.N<=5],"cust_code"]$V1])
#   user  system elapsed 
#  1.405   0.096   1.502

If there are much more groups (prod_code column, uniqueN equal to 50000), then we can see the impact on the performance:

system.time(DT[order(time_code),head(.SD,5),"prod_code"])
#   user  system elapsed 
# 10.177   0.109  10.322
system.time(DT[order(time_code), indx := seq_len(.N),"prod_code"][indx <= 5][,indx:=NULL][])
#   user  system elapsed 
#  1.555   0.099   1.665 
system.time(DT[DT[order(time_code),.I[1:.N<=5],"prod_code"]$V1])
#   user  system elapsed 
#  1.697   0.064   1.764

Update on 2015-11-09:

With today's Arun commit e615532 the head and tail should be optimized under the hood.

answered Oct 13 '22 18:10

jangorecki

Related questions
                            
                                How do I select the first row in an R data frame that meets certain criteria?
                            
                                How to convert dataframe column names from strings into arguments suitable for (qplot, ggplot2)?
                            
                                What does ...=... do in R?
                            
                                applying rolling mean by group in R
                            
                                Different Starting Point (not 0) in barplot Y-Axis?
                            
                                Best way to reduce consecutive NAs to single NA
                            
                                Fast vectorized function to check if a value is in an interval
                            
                                users own pch (clip) in r [duplicate]
                            
                                How to improve jagged line graph in ggplot2?
                            
                                Histogram of two variables in R
                            
                                How to read csv data with unknown encoding in R
                            
                                shapiro.test(..) cannot deal with more than 5000 data points
                            
                                rCharts with Highcharts as shiny application
                            
                                Legend of a raster map with categorical data
                            
                                melt multiple groups of measure.vars
                            
                                R: Avoid accidently overwriting variables
                            
                                05:00:00 - 28:59:59 time format
                            
                                NumPy percentile function different from MATLAB's percentile function
                            
                                Cannot use dput for data.table in R
                            
                                R: Reorder facet_wrapped x-axis with free_x in ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to order data within subgroups in data.table R

Tags:

r

data.table

unmark1

People also ask

1 Answers

jangorecki

Recent Activity

Donate For Us