I have a data.table <code>dt</code>. This data.table is sorted first by column <code>date</code> (my grouping variable), then by column <code>age</code>: <pre class="prettyprint"><code>library(data.table) setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age" > dt date age name 1: 2000-01-01 3 Andrew 2: 2000-01-01 4 Ben 3: 2000-01-01 5 Charlie 4: 2000-01-02 6 Adam 5: 2000-01-02 7 Bob 6: 2000-01-02 8 Campbell </code></pre> My question is: I am wondering if it's possible to extract the first 2 rows for each unique date? Or phrased more generally: How to extract the first n rows within each group? In this example, the result in <code>dt.f</code> would be: <pre class="prettyprint"><code>> dt.f = ???????? # function of dt to extract the first 2 rows per unique date > dt.f date age name 1: 2000-01-01 3 Andrew 2: 2000-01-01 4 Ben 3: 2000-01-02 6 Adam 4: 2000-01-02 7 Bob </code></pre> p.s. Here is the code to create the aforementioned data.table: <pre class="prettyprint"><code>install.packages("data.table") library(data.table) date <- c("2000-01-01","2000-01-01","2000-01-01", "2000-01-02","2000-01-02","2000-01-02") age <- c(3,4,5,6,7,8) name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell") dt <- data.table(date, age, name) setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age" </code></pre>

yep, just use <code>.SD</code> and index it as needed. <pre class="prettyprint"><code> DT[, .SD[1:2], by=date] date age name 1: 2000-01-01 3 Andrew 2: 2000-01-01 4 Ben 3: 2000-01-02 6 Adam 4: 2000-01-02 7 Bob </code></pre> <hr> <h3>Edited as per @eddi's suggestion.</h3> @eddi's suggestion is spot on: Use this instead, for speed: <pre class="prettyprint"><code> DT[DT[, .I[1:2], by = date]$V1] # using a slightly larger data set > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L) Unit: milliseconds expr min lq median uq max neval SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719 200 IStyle 1.675185 2.018773 2.168818 2.269292 11.31072 200 </code></pre>

Probably not the fastest method, but it provides some flexibility if you don't use keyed variables and need some more flexibility. By changing the selected <code>Row.ID</code> the number of first objects can be adjusted as needed. <pre class="prettyprint"><code>dt[, .( age , name , Row.ID = rank(age) ) , by = list(date)][Row.ID %in% (1:2), .(date , age , name )] </code></pre>

How to extract the first n rows per group?

Tags:

r

data.table

I have a data.table dt. This data.table is sorted first by column date (my grouping variable), then by column age:

library(data.table) setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age" > dt          date age     name 1: 2000-01-01   3   Andrew 2: 2000-01-01   4      Ben 3: 2000-01-01   5  Charlie 4: 2000-01-02   6     Adam 5: 2000-01-02   7      Bob 6: 2000-01-02   8 Campbell

My question is: I am wondering if it's possible to extract the first 2 rows for each unique date? Or phrased more generally:

How to extract the first n rows within each group?

In this example, the result in dt.f would be:

> dt.f = ???????? # function of dt to extract the first 2 rows per unique date > dt.f          date age   name 1: 2000-01-01   3 Andrew 2: 2000-01-01   4    Ben 3: 2000-01-02   6   Adam 4: 2000-01-02   7    Bob

p.s. Here is the code to create the aforementioned data.table:

install.packages("data.table") library(data.table) date <- c("2000-01-01","2000-01-01","2000-01-01",     "2000-01-02","2000-01-02","2000-01-02") age <- c(3,4,5,6,7,8) name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell") dt <- data.table(date, age, name) setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age"

716

asked May 01 '13 20:05

Contango

2 Answers

yep, just use .SD and index it as needed.

  DT[, .SD[1:2], by=date]             date age   name   1: 2000-01-01   3 Andrew   2: 2000-01-01   4    Ben   3: 2000-01-02   6   Adam   4: 2000-01-02   7    Bob

Edited as per @eddi's suggestion.

@eddi's suggestion is spot on:

Use this instead, for speed:

  DT[DT[, .I[1:2], by = date]$V1]    # using a slightly larger data set   > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)   Unit: milliseconds       expr       min        lq    median        uq      max neval    SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719   200     IStyle  1.675185  2.018773  2.168818  2.269292 11.31072   200

130

answered Sep 25 '22 15:09

Ricardo Saporta

Probably not the fastest method, but it provides some flexibility if you don't use keyed variables and need some more flexibility. By changing the selected Row.ID the number of first objects can be adjusted as needed.

dt[, .( age         , name         , Row.ID = rank(age)         )    , by = list(date)][Row.ID %in% (1:2), .(date                                            , age                                            , name                                            )]

answered Sep 21 '22 15:09

hannes101

Related questions
                            
                                Why am I getting "algorithm did not converge" and "fitted prob numerically 0 or 1" warnings with glm?
                            
                                Dynamic column names in data.table
                            
                                Dplyr join on by=(a = b), where a and b are variables containing strings?
                            
                                How to define a vectorized function in R
                            
                                Replace missing values (NA) with blank (empty string)
                            
                                what is the difference between names and colnames
                            
                                How to update a package in R?
                            
                                Extracting coefficient variable names from glmnet into a data.frame
                            
                                RStudio enters debug mode for every function error - how can I stop it?
                            
                                Why is using assign bad?
                            
                                Use data.table to count and aggregate / summarize a column
                            
                                matplotlib analog of R's `pairs`
                            
                                is it possible to redirect console output to a variable?
                            
                                How to include NA in ifelse?
                            
                                Adjusting width of tables made with kable() in RMarkdown documents
                            
                                using parallel's parLapply: unable to access variables within parallel code
                            
                                Fast reading and combining several files using data.table (with fread)
                            
                                Multiply many columns by a specific other column in R with data.table?
                            
                                meaning of ddply error: 'names' attribute [9] must be the same length as the vector [1]
                            
                                Convert four digit year values to class Date

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With