I have a very large dataframe that I'd like to save a subset of based on a certain subset of a vector. In short I have something like this:
> id<-c("ID1","ID2","ID2","ID3","ID4","ID4","ID4","ID4","ID4")
> status<-c("flag","flag","none","none","flag","flag","flag","none","flag")
> misc1ofmany<-c("etc1","etc2","etc3","etc4","etc5","etc6","etc7","etc8","etc9")
> df = data.frame(id, status, misc1ofmany) ; df
id status misc1ofmany
1 ID1 flag etc1
2 ID2 flag etc2
3 ID2 none etc3
4 ID3 none etc4
5 ID4 flag etc5
6 ID4 flag etc6
7 ID4 flag etc7
8 ID4 none etc8
9 ID4 flag etc9
I'd like to have all the rows of IDs that have been flagged, inclusive of their non-flagged sessions. Right now I'm trying to get the index of the other IDs by grep and plugging that back into a new df. Actually as I wrote this out I figured out that grepl might be easier to work with:
> flaggedIDs <- unique(as.vector(df$id[grep("flag",df$status)]))
> flaggedIDs.allStats.Index <- mapply(grepl,df$id,MoreArgs=list(x=flaggedIDs))
> flaggedIDs.allStats.Index
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
however I just want to get to here:
> flaggedIDsdf <- df[flaggedIDs.allStats.Index] ; flaggedIDsdf
id status misc1ofmany
1 ID1 flag etc1
2 ID2 flag etc2
3 ID2 none etc3
4 ID4 flag etc5
5 ID4 flag etc6
6 ID4 flag etc7
7 ID4 none etc8
8 ID4 flag etc9
I feel like this should be simpler than I am making this out to be, however I've tried a number of possibilities to figure this out to no avail. writing out the question helped this become a clearer/simpler problem in my mind (it looks like I'm just missing one step now), but now I'm also wondering if there is a more efficient way of going about this.
data.table
is really useful here for it's elegant syntax and memory efficiency
library(data.table)
DT <- data.table(df)
setkey(DT, 'id')
DT[DT[status=='flag', list(id = unique(id))]]
id status misc1ofmany
1: ID1 flag etc1
2: ID2 flag etc2
3: ID2 none etc3
4: ID4 flag etc5
5: ID4 flag etc6
6: ID4 flag etc7
7: ID4 none etc8
8: ID4 flag etc9
Or even more compactly
DT[J(unique(id[status=='flag']))]
These two approaches above use the fact that the i
component is evaluated first by data.table
.
Keying by id
means we can do a self-join to extract only those ids
where status=='flag
.
Or, using by
DT[,if(any(status=='flag')){.SD} ,by=id]
This goes through the subsets of DT
by id
, and returns .SD
(the data.table of the subset) if any(status=='flag')
(within that subset).
This seems to work:
df[df$id %in% df$id[df$status == "flag"],]
Plain english: from the data frame, select each row whose id element is in the vector of id elements whose status is flagged in any row.
I believe what you are trying to do may be handled in one line
df[which(df$id %in% df$id[df$status=="flag"]), ]
result:
id status misc1ofmany
1 ID1 flag etc1
2 ID2 flag etc2
3 ID2 none etc3
5 ID4 flag etc5
6 ID4 flag etc6
7 ID4 flag etc7
8 ID4 none etc8
9 ID4 flag etc9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With