I have a data.table dt
. This data.table is sorted first by column date
(my grouping variable), then by column age
:
library(data.table) setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age" > dt date age name 1: 2000-01-01 3 Andrew 2: 2000-01-01 4 Ben 3: 2000-01-01 5 Charlie 4: 2000-01-02 6 Adam 5: 2000-01-02 7 Bob 6: 2000-01-02 8 Campbell
My question is: I am wondering if it's possible to extract the first 2 rows for each unique date? Or phrased more generally:
How to extract the first n rows within each group?
In this example, the result in dt.f
would be:
> dt.f = ???????? # function of dt to extract the first 2 rows per unique date > dt.f date age name 1: 2000-01-01 3 Andrew 2: 2000-01-01 4 Ben 3: 2000-01-02 6 Adam 4: 2000-01-02 7 Bob
p.s. Here is the code to create the aforementioned data.table:
install.packages("data.table") library(data.table) date <- c("2000-01-01","2000-01-01","2000-01-01", "2000-01-02","2000-01-02","2000-01-02") age <- c(3,4,5,6,7,8) name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell") dt <- data.table(date, age, name) setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age"
To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ). You assign the row numbers within each group (i.e., year).
To set the row numbers, use the handy row_number() function. This assigns sequential integers starting at one to each row, according to the sort you specify. You define this order in the over clause. This is also where you can get the row numbers to start at one for each group.
Typically, these are accomplished using the TOP or LIMIT clause. Problem is, Top N result sets are limited to the highest values in the table, without any grouping. The GROUP BY clause can help with that, but it is limited to the single top result for each group.
Think of .N as a variable for the number of instances. For example: dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7) dt[.N] # returns the last row # a b # 1: C 7.
yep, just use .SD
and index it as needed.
DT[, .SD[1:2], by=date] date age name 1: 2000-01-01 3 Andrew 2: 2000-01-01 4 Ben 3: 2000-01-02 6 Adam 4: 2000-01-02 7 Bob
@eddi's suggestion is spot on:
Use this instead, for speed:
DT[DT[, .I[1:2], by = date]$V1] # using a slightly larger data set > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L) Unit: milliseconds expr min lq median uq max neval SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719 200 IStyle 1.675185 2.018773 2.168818 2.269292 11.31072 200
Probably not the fastest method, but it provides some flexibility if you don't use keyed variables and need some more flexibility. By changing the selected Row.ID
the number of first objects can be adjusted as needed.
dt[, .( age , name , Row.ID = rank(age) ) , by = list(date)][Row.ID %in% (1:2), .(date , age , name )]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With