Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

indexing a dataframe based on a vector subset

I have a very large dataframe that I'd like to save a subset of based on a certain subset of a vector. In short I have something like this:

> id<-c("ID1","ID2","ID2","ID3","ID4","ID4","ID4","ID4","ID4")
> status<-c("flag","flag","none","none","flag","flag","flag","none","flag")
> misc1ofmany<-c("etc1","etc2","etc3","etc4","etc5","etc6","etc7","etc8","etc9")
> df = data.frame(id, status, misc1ofmany) ; df
   id status misc1ofmany
1 ID1   flag        etc1
2 ID2   flag        etc2
3 ID2   none        etc3
4 ID3   none        etc4
5 ID4   flag        etc5
6 ID4   flag        etc6
7 ID4   flag        etc7
8 ID4   none        etc8
9 ID4   flag        etc9

I'd like to have all the rows of IDs that have been flagged, inclusive of their non-flagged sessions. Right now I'm trying to get the index of the other IDs by grep and plugging that back into a new df. Actually as I wrote this out I figured out that grepl might be easier to work with:

> flaggedIDs <- unique(as.vector(df$id[grep("flag",df$status)]))
> flaggedIDs.allStats.Index <- mapply(grepl,df$id,MoreArgs=list(x=flaggedIDs)) 
> flaggedIDs.allStats.Index
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]
[1,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

however I just want to get to here:

> flaggedIDsdf <- df[flaggedIDs.allStats.Index] ; flaggedIDsdf
   id status misc1ofmany
1 ID1   flag        etc1
2 ID2   flag        etc2
3 ID2   none        etc3
4 ID4   flag        etc5
5 ID4   flag        etc6
6 ID4   flag        etc7
7 ID4   none        etc8
8 ID4   flag        etc9

I feel like this should be simpler than I am making this out to be, however I've tried a number of possibilities to figure this out to no avail. writing out the question helped this become a clearer/simpler problem in my mind (it looks like I'm just missing one step now), but now I'm also wondering if there is a more efficient way of going about this.

like image 977
stites Avatar asked Nov 29 '12 00:11

stites


3 Answers

data.table is really useful here for it's elegant syntax and memory efficiency

library(data.table)

DT <- data.table(df)

setkey(DT, 'id')

DT[DT[status=='flag', list(id = unique(id))]]

    id status misc1ofmany
1: ID1   flag        etc1
2: ID2   flag        etc2
3: ID2   none        etc3
4: ID4   flag        etc5
5: ID4   flag        etc6
6: ID4   flag        etc7
7: ID4   none        etc8
8: ID4   flag        etc9

Or even more compactly

DT[J(unique(id[status=='flag']))]

These two approaches above use the fact that the i component is evaluated first by data.table. Keying by id means we can do a self-join to extract only those ids where status=='flag.


Or, using by

DT[,if(any(status=='flag')){.SD} ,by=id]

This goes through the subsets of DT by id, and returns .SD (the data.table of the subset) if any(status=='flag') (within that subset).

like image 118
mnel Avatar answered Oct 17 '22 20:10

mnel


This seems to work:

df[df$id %in% df$id[df$status == "flag"],]

Plain english: from the data frame, select each row whose id element is in the vector of id elements whose status is flagged in any row.

like image 2
James Porter Avatar answered Oct 17 '22 21:10

James Porter


I believe what you are trying to do may be handled in one line

df[which(df$id %in% df$id[df$status=="flag"]), ]

result:

   id status misc1ofmany
1 ID1   flag        etc1
2 ID2   flag        etc2
3 ID2   none        etc3
5 ID4   flag        etc5
6 ID4   flag        etc6
7 ID4   flag        etc7
8 ID4   none        etc8
9 ID4   flag        etc9
like image 1
Ricardo Saporta Avatar answered Oct 17 '22 22:10

Ricardo Saporta