Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use `data.table` to get first of subgroup based on a variable

Tags:

r

data.table

Consider a data set consisting of a grouping variable (here id) and an ordered variable (here date)

(df <- data.frame(
  id = rep(1:2,2),
  date = 4:1
))
#   id date
# 1  1    4
# 2  2    3
# 3  1    2
# 4  2    1

I'm wondering what the easiest way is in data.table to do the equivalent of this dplyr code:

library(dplyr)
df %>%
  group_by(id) %>%
  filter(min_rank(date)==1)
# Source: local data frame [2 x 2]
# Groups: id
# 
#   id date
# 1  1    2
# 2  2    1

i.e. for each id get the first according to date.

Based on a similar stackoverflow question (Create an "index" for each element of a group with data.table), I came up with this

library(data.table)
dt <- data.table(df)
setkey(dt, id, date)
for(k in unique(dt$id)){
  dt[id==k, index := 1:.N]
}
dt[index==1,]

But it seems like there should be a one-liner for this. Being unfamiliar with data.table I thought something like this

dt[,,mult="first", by=id]

should work, but alas! The last bit of code seems like it should group by id and then take the first (which within id would be determined by date since I've set the keys in this way.)


EDIT

Thanks to Ananda Mahto, this one-liner will now be in my data.table repertoire

dt[,.SD[1], by=id]
#    id date
# 1:  1    2
# 2:  2    1
like image 321
Henrik Renlund Avatar asked Feb 12 '23 15:02

Henrik Renlund


2 Answers

Working directly with your source data.frame, you can try:

setkey(as.data.table(df), id, date)[, .SD[1], by = id]
#    id date
# 1:  1    2
# 2:  2    1

Extending your original idea, you can just do:

dt <- data.table(df)
setkey(dt, id, date)
dt[, index := sequence(.N), by = id][index == 1]
#    id date index
# 1:  1    2     1
# 2:  2    1     1

It might be that at a certain scale, David is correct about head vs [1], but I'm not sure what scale that would be.

set.seed(1)
nrow <- 10000
ncol <- 20

df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))

fun1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
fun2 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]

library(microbenchmark)
microbenchmark(fun1(), fun2())
# Unit: milliseconds
#    expr       min        lq      mean    median        uq      max neval
#  fun1() 12.178189 12.496777 13.400905 12.808523 13.483545 30.28425   100
#  fun2()  4.474345  4.554527  4.948255  4.620596  4.965912  8.17852   100
like image 169
A5C1D2H2I1M1N2O1R2T1 Avatar answered Feb 14 '23 11:02

A5C1D2H2I1M1N2O1R2T1


Here's another option using data.tables binary search

setkey(dt[, indx := seq_len(.N), by = id], indx)[J(1)]
#    id date indx
# 1:  1    2    1
# 2:  2    1    1

Some benchmarks: It seems that all the methods perform more or less the same, but on huge data set (1e+06*1e+2) binrary search wins

set.seed(1)
nrow <- 1e6
ncol <- 1e2

df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))
library(data.table)

funAM1 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]
funAM2 <- function() setkey(as.data.table(df), X1, X2)[, index := sequence(.N), by = X1][index == 1]
funDA1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
funDA2 <- function() setkey(as.data.table(df)[, indx := seq_len(.N), by = X1], X1)[J(1)]

library(microbenchmark)
Res <- microbenchmark(funAM1(), funAM2(), funDA1(), funDA2())
Res
# Unit: milliseconds
#     expr      min       lq   median       uq      max neval
# funAM1() 737.5690 758.3015 771.9344 794.1417 910.1019   100
# funAM2() 631.7822 693.8286 704.6912 729.6960 806.5556   100
# funDA1() 757.0327 772.4353 784.3107 810.0759 938.6344   100
# funDA2() 564.7291 578.1089 587.6470 611.7269 740.4077   100
boxplot(Res)

enter image description here

like image 25
David Arenburg Avatar answered Feb 14 '23 10:02

David Arenburg