I have panel data (subject/year) for which I would like to only keep subjects who appear the maximum number of times per year. The data set is large so I am using the data.table package. Is there a more elegant solution than what I have tried below?
library(data.table)
DT <- data.table(SUBJECT=c(rep('John',3), rep('Paul',2),
rep('George',3), rep('Ringo',2),
rep('John',2), rep('Paul',4),
rep('George',2), rep('Ringo',4)),
YEAR=c(rep(2011,10), rep(2012,12)),
HEIGHT=rnorm(22),
WEIGHT=rnorm(22))
DT
DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, MAXCOUNT := max(COUNT), by='YEAR']
DT <- DT[COUNT==MAXCOUNT]
DT <- DT[, c('COUNT','MAXCOUNT') := NULL]
DT
I'm not sure you'll view this as elegant but how about :
DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, .SD[COUNT == max(COUNT)], by='YEAR']
That's essentially how to apply by
to the i
expression as @SenorO commented. You'll still need [,COUNT:=NULL]
afterwards but for one temporary column rather than two.
We do discourage .SD
though for speed reasons, but hopefully we'll get to this feature request soon so that advice can be dropped: FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged..
A different approach is as follows. It's faster and idiomatic but may be considered less elegant.
# Create a small aggregate table first. No need to use := on the big table.
i = DT[, .N, by='SUBJECT,YEAR']
# Find the even smaller subset. (Do as much as we can on the small aggregate.)
i = i[, .SD[N==max(N)], by=YEAR]
# Finally join the small subset of key values to the big table
setkey(DT, YEAR, SUBJECT)
DT[i]
Something similar is here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With