Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table count panel data

I have panel data (subject/year) for which I would like to only keep subjects who appear the maximum number of times per year. The data set is large so I am using the data.table package. Is there a more elegant solution than what I have tried below?

library(data.table)

DT <- data.table(SUBJECT=c(rep('John',3), rep('Paul',2), 
                           rep('George',3), rep('Ringo',2), 
                           rep('John',2), rep('Paul',4), 
                           rep('George',2), rep('Ringo',4)), 
                 YEAR=c(rep(2011,10), rep(2012,12)), 
                 HEIGHT=rnorm(22), 
                 WEIGHT=rnorm(22))
DT

DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, MAXCOUNT := max(COUNT), by='YEAR']

DT <- DT[COUNT==MAXCOUNT]
DT <- DT[, c('COUNT','MAXCOUNT') := NULL]
DT
like image 910
user1491868 Avatar asked Sep 05 '13 20:09

user1491868


1 Answers

I'm not sure you'll view this as elegant but how about :

DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, .SD[COUNT == max(COUNT)], by='YEAR']

That's essentially how to apply by to the i expression as @SenorO commented. You'll still need [,COUNT:=NULL] afterwards but for one temporary column rather than two.

We do discourage .SD though for speed reasons, but hopefully we'll get to this feature request soon so that advice can be dropped: FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged..

A different approach is as follows. It's faster and idiomatic but may be considered less elegant.

# Create a small aggregate table first. No need to use := on the big table.
i = DT[, .N, by='SUBJECT,YEAR']

# Find the even smaller subset. (Do as much as we can on the small aggregate.)
i = i[, .SD[N==max(N)], by=YEAR]

# Finally join the small subset of key values to the big table
setkey(DT, YEAR, SUBJECT)
DT[i]

Something similar is here.

like image 127
Matt Dowle Avatar answered Nov 05 '22 07:11

Matt Dowle