Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R group select conditioned on value in data.table

Hi I want to select a group of values conditioned on a value in a data table.

Specifically I would like to select all columns grouped by date and id for all Positive values where e == 1

   id   date     e       logret 
   7 2011-07-29  1   -0.0272275211      
   7 2011-07-29  2    0.0034229025      
   7 2011-07-29  3    0.0042622177      
   8 2011-07-29  1    0.0035662770      
   8 2011-07-29  2   -0.0015268474 
   8 2011-07-29  3    0.0013333333
   7 2011-07-30  1    0.0044444444      
   7 2011-07-30  2   -0.0001111111 
   7 2011-07-30  3    0.0013333333

here all elements for id 8 and date 2011-07-29 and all elements of id 7 for date 2011-07-30 will be selected since the logret for e == 1 is > 0 where as all elements of id 7 on 2011-07-29 are ignored since the first logret (where e==1) is < 0

Ans:

   8 2011-07-29  1    0.0035662770      
   8 2011-07-29  2   -0.0015268474 
   8 2011-07-29  3    0.0013333333
   7 2011-07-30  1    0.0044444444      
   7 2011-07-30  2   -0.0001111111 
   7 2011-07-30  3    0.0013333333    

in sql I'd use some kind of subselect to achieve this. I would :

1) Select the id and date where e=1 and logret > 0
2) Select * join on results of subselect

I think data.table can do this as well, but I'm finding it tricky to express it in data.table terms. Specifically, I can replicate step 1, but can't do the join part in step 2.

pos <- DT[e==1][logret > 0]

But can't join the pos values back into my DT

like image 319
user1480926 Avatar asked Jul 26 '12 11:07

user1480926


2 Answers

It's not pretty, and it's not in data.table, but this seems like it would work:

# Recreate your data
df = read.table(header=TRUE, text="id   date    e       logret 
    7 2011-07-29 1   -0.0272275211      
    7 2011-07-29 2    0.0034229025      
    7 2011-07-29 2    0.0042622177      
    8 2011-07-29 1    0.0035662770      
    8 2011-07-29 2   -0.0015268474 
    8 2011-07-29 3    0.0013333333")
df[which(df$id != df$id[which(df$e == 1 & df$logret < 0)]),]
#   id       date e       logret
# 4  8 2011-07-29 1  0.003566277
# 5  8 2011-07-29 2 -0.001526847
# 6  8 2011-07-29 3  0.001333333
#
## Or the equivalent in "positive" terms
#
# df[which(df$id == df$id[which(df$e == 1 & df$logret > 0)]),]

Update based on comments and new sample data

Just off the top of my head (I have not had any experience with the data.table package; it's on my "to learn" list). Here's a possible solution:

temp = split(df, df$date)
lapply(temp, 
       function(x) 
         x[which(x$id == x$id[which(x$e == 1 & x$logret > 0)]),])
# $`2011-07-29`
#   id       date e       logret
# 4  8 2011-07-29 1  0.003566277
# 5  8 2011-07-29 2 -0.001526847
# 6  8 2011-07-29 3  0.001333333
# 
# $`2011-07-30`
#   id       date e        logret
# 7  7 2011-07-30 1  0.0044444444
# 8  7 2011-07-30 2 -0.0001111111
# 9  7 2011-07-30 3  0.0013333333

Update 2

It is also worth trying merge:

merge(df, df[which(df$e == 1 & df$logret > 0), c(1, 2)])
#   id       date e        logret
# 1  7 2011-07-30 1  0.0044444444
# 2  7 2011-07-30 2 -0.0001111111
# 3  7 2011-07-30 3  0.0013333333
# 4  8 2011-07-29 1  0.0035662770
# 5  8 2011-07-29 2 -0.0015268474
# 6  8 2011-07-29 3  0.0013333333
like image 101
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 11 '22 15:10

A5C1D2H2I1M1N2O1R2T1


I have solved it in a round about way :

pos <- DT[e==1][logret > 0, list(id,date)]
ans <- DT[J(pos$id,pos$date)];

would be interested to hear any more elegant 1 line ways to do it in data.table.


EDIT from Matthew :

If key(DT) is already (id,date) then a one liner would be :

DT[DT[e==1 & logret>0, list(id,date)]]

and that should be faster, too. If you can rely on id and date being the first 2 columns of DT, then it can be shortened to :

DT[DT[e==1 & logret>0]]
like image 29
user1480926 Avatar answered Oct 11 '22 15:10

user1480926