I have a dataframe with 3 values relevant to this question, :ID, :Position, :Probability. Each row is unique, but multiple rows may have the same ID. What I'd like to do is get all of the rows for a certain value of Position, that share an ID with any row that has a Probability higher than some value in a different position.
For example, say I have the following DataFrame (df):
1020692×8 DataFrames.DataFrame
│ Row │ ID │ Position │ Probability │
├─────────┼─────┼───────────────┼─────────────┤
│ 1 │ 425 │ "first" │ 0.02 │
│ 2 │ 425 │ "last" │ 0.03 │
│ 3 │ 425 │ "penultimate" │ 0.02 │
│ 4 │ 425 │ "other" │ 0.04 │
│ 5 │ 421 │ "first" │ 0.44 │
│ 6 │ 421 │ "last" │ 0.85 │
│ 7 │ 421 │ "second" │ 0.59 │
│ 8 │ 421 │ "other" │ 1.0 │
⋮
If I set a threshold of 0.8, I want to end up with all of the rows where :Position == "first" if that :ID has :Position == "last" && :Probability > 0.8. In other words, I'd want row 5, since row 6 has a :Probability > 0.8, but not row 1, since row 2 does not.
The row to check the threshold will not always follow the row I want to keep. Not all rows where :Position == "first" will have a "last" row to check, but there will be at most one.
The way I tried to solve this was to make a vector of all the IDs in last position with Probability > 0.8, and then tried to subset the dataframe using in(). So...
firsts = df[df[:Position] .== "first", :]
lasts = df[df[:Position] .== "last", :]
meetsthreshold = lasts[lasts[:Probability] .> 0.8, :ID]
final = firsts[[in(i, meetsthreshold) for i in firsts[:ID]], :]
I tested this with a super short vector of IDs and it works, but it lags super hard on my actual data (where length(meetsthreshold) is > 100k). I think what I want is basically a set intersection, and if I do that with the IDs(eg intersect(Set(firsts[:ID]), Set(meetsthreshold))) it's basically instantaneous. Is there a way to do the set intersection with a dataframe so I can actually get the rows?
I kinda feel like an idiot - the solution is just use a set instead of a vector to search in. Eg:
firsts = df[df[:Position] .== "first", :]
lasts = df[df[:Position] .== "last", :]
meetsthreshold = Set(lasts[lasts[:Probability] .> 0.8, :ID])
final = firsts[Vector{Bool}([in(i, meetsthreshold) for i in firsts[:ID]]), :]
Ran in ~ 1 second.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With