Consider a data set consisting of a grouping variable (here id) and an ordered variable (here date)
(df <- data.frame(
  id = rep(1:2,2),
  date = 4:1
))
#   id date
# 1  1    4
# 2  2    3
# 3  1    2
# 4  2    1
I'm wondering what the easiest way is in data.table to do the equivalent of this dplyr code:
library(dplyr)
df %>%
  group_by(id) %>%
  filter(min_rank(date)==1)
# Source: local data frame [2 x 2]
# Groups: id
# 
#   id date
# 1  1    2
# 2  2    1
i.e. for each id get the first according to date.
Based on a similar stackoverflow question (Create an "index" for each element of a group with data.table), I came up with this
library(data.table)
dt <- data.table(df)
setkey(dt, id, date)
for(k in unique(dt$id)){
  dt[id==k, index := 1:.N]
}
dt[index==1,]
But it seems like there should be a one-liner for this. Being unfamiliar with data.table I thought something like this
dt[,,mult="first", by=id]
should work, but alas! The last bit of code seems like it should group by id and then take the first (which within id would be determined by date since I've set the keys in this way.)
EDIT
Thanks to Ananda Mahto, this one-liner will now be in my data.table repertoire
dt[,.SD[1], by=id]
#    id date
# 1:  1    2
# 2:  2    1
                Working directly with your source data.frame, you can try:
setkey(as.data.table(df), id, date)[, .SD[1], by = id]
#    id date
# 1:  1    2
# 2:  2    1
Extending your original idea, you can just do:
dt <- data.table(df)
setkey(dt, id, date)
dt[, index := sequence(.N), by = id][index == 1]
#    id date index
# 1:  1    2     1
# 2:  2    1     1
It might be that at a certain scale, David is correct about head vs [1], but I'm not sure what scale that would be.
set.seed(1)
nrow <- 10000
ncol <- 20
df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))
fun1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
fun2 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]
library(microbenchmark)
microbenchmark(fun1(), fun2())
# Unit: milliseconds
#    expr       min        lq      mean    median        uq      max neval
#  fun1() 12.178189 12.496777 13.400905 12.808523 13.483545 30.28425   100
#  fun2()  4.474345  4.554527  4.948255  4.620596  4.965912  8.17852   100
                        Here's another option using data.tables binary search 
setkey(dt[, indx := seq_len(.N), by = id], indx)[J(1)]
#    id date indx
# 1:  1    2    1
# 2:  2    1    1
Some benchmarks:
It seems that all the methods perform more or less the same, but on huge data set (1e+06*1e+2) binrary search wins
set.seed(1)
nrow <- 1e6
ncol <- 1e2
df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))
library(data.table)
funAM1 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]
funAM2 <- function() setkey(as.data.table(df), X1, X2)[, index := sequence(.N), by = X1][index == 1]
funDA1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
funDA2 <- function() setkey(as.data.table(df)[, indx := seq_len(.N), by = X1], X1)[J(1)]
library(microbenchmark)
Res <- microbenchmark(funAM1(), funAM2(), funDA1(), funDA2())
Res
# Unit: milliseconds
#     expr      min       lq   median       uq      max neval
# funAM1() 737.5690 758.3015 771.9344 794.1417 910.1019   100
# funAM2() 631.7822 693.8286 704.6912 729.6960 806.5556   100
# funDA1() 757.0327 772.4353 784.3107 810.0759 938.6344   100
# funDA2() 564.7291 578.1089 587.6470 611.7269 740.4077   100
boxplot(Res)

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With