Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the first n rows per group?

Tags:

r

data.table

I have a data.table dt. This data.table is sorted first by column date (my grouping variable), then by column age:

library(data.table) setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age" > dt          date age     name 1: 2000-01-01   3   Andrew 2: 2000-01-01   4      Ben 3: 2000-01-01   5  Charlie 4: 2000-01-02   6     Adam 5: 2000-01-02   7      Bob 6: 2000-01-02   8 Campbell 

My question is: I am wondering if it's possible to extract the first 2 rows for each unique date? Or phrased more generally:

How to extract the first n rows within each group?

In this example, the result in dt.f would be:

> dt.f = ???????? # function of dt to extract the first 2 rows per unique date > dt.f          date age   name 1: 2000-01-01   3 Andrew 2: 2000-01-01   4    Ben 3: 2000-01-02   6   Adam 4: 2000-01-02   7    Bob 

p.s. Here is the code to create the aforementioned data.table:

install.packages("data.table") library(data.table) date <- c("2000-01-01","2000-01-01","2000-01-01",     "2000-01-02","2000-01-02","2000-01-02") age <- c(3,4,5,6,7,8) name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell") dt <- data.table(date, age, name) setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age" 
like image 716
Contango Avatar asked May 01 '13 20:05

Contango


People also ask

How to select the first row per group in SQL?

To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ). You assign the row numbers within each group (i.e., year).

How to select the top n rows per group with SQL in Oracle Database?

To set the row numbers, use the handy row_number() function. This assigns sequential integers starting at one to each row, according to the sort you specify. You define this order in the over clause. This is also where you can get the row numbers to start at one for each group.

Can we use top with Group By clause?

Typically, these are accomplished using the TOP or LIMIT clause. Problem is, Top N result sets are limited to the highest values in the table, without any grouping. The GROUP BY clause can help with that, but it is limited to the single top result for each group.

What is .n in data table?

Think of .N as a variable for the number of instances. For example: dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7) dt[.N] # returns the last row # a b # 1: C 7.


2 Answers

yep, just use .SD and index it as needed.

  DT[, .SD[1:2], by=date]             date age   name   1: 2000-01-01   3 Andrew   2: 2000-01-01   4    Ben   3: 2000-01-02   6   Adam   4: 2000-01-02   7    Bob 

Edited as per @eddi's suggestion.

@eddi's suggestion is spot on:

Use this instead, for speed:

  DT[DT[, .I[1:2], by = date]$V1]    # using a slightly larger data set   > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)   Unit: milliseconds       expr       min        lq    median        uq      max neval    SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719   200     IStyle  1.675185  2.018773  2.168818  2.269292 11.31072   200 
like image 130
Ricardo Saporta Avatar answered Sep 25 '22 15:09

Ricardo Saporta


Probably not the fastest method, but it provides some flexibility if you don't use keyed variables and need some more flexibility. By changing the selected Row.ID the number of first objects can be adjusted as needed.

dt[, .( age         , name         , Row.ID = rank(age)         )    , by = list(date)][Row.ID %in% (1:2), .(date                                            , age                                            , name                                            )] 
like image 45
hannes101 Avatar answered Sep 21 '22 15:09

hannes101