Efficient set intersection to get rows in DataFrame

Question

I have a dataframe with 3 values relevant to this question, :ID, :Position, :Probability. Each row is unique, but multiple rows may have the same ID. What I'd like to do is get all of the rows for a certain value of Position, that share an ID with any row that has a Probability higher than some value in a different position.

For example, say I have the following DataFrame (df):

1020692×8 DataFrames.DataFrame
│ Row     │ ID  │ Position      │ Probability │
├─────────┼─────┼───────────────┼─────────────┤
│ 1       │ 425 │ "first"       │ 0.02        │
│ 2       │ 425 │ "last"        │ 0.03        │
│ 3       │ 425 │ "penultimate" │ 0.02        │
│ 4       │ 425 │ "other"       │ 0.04        │
│ 5       │ 421 │ "first"       │ 0.44        │
│ 6       │ 421 │ "last"        │ 0.85        │
│ 7       │ 421 │ "second"      │ 0.59        │
│ 8       │ 421 │ "other"       │ 1.0         │
⋮

If I set a threshold of 0.8, I want to end up with all of the rows where :Position == "first" if that :ID has :Position == "last" && :Probability > 0.8. In other words, I'd want row 5, since row 6 has a :Probability > 0.8, but not row 1, since row 2 does not.

The row to check the threshold will not always follow the row I want to keep. Not all rows where :Position == "first" will have a "last" row to check, but there will be at most one.

The way I tried to solve this was to make a vector of all the IDs in last position with Probability > 0.8, and then tried to subset the dataframe using in(). So...

firsts = df[df[:Position] .== "first", :]
lasts = df[df[:Position] .== "last", :]
meetsthreshold = lasts[lasts[:Probability] .> 0.8, :ID]

final = firsts[[in(i, meetsthreshold) for i in firsts[:ID]], :]

I tested this with a super short vector of IDs and it works, but it lags super hard on my actual data (where length(meetsthreshold) is > 100k). I think what I want is basically a set intersection, and if I do that with the IDs(eg intersect(Set(firsts[:ID]), Set(meetsthreshold))) it's basically instantaneous. Is there a way to do the set intersection with a dataframe so I can actually get the rows?

kevbonham · Accepted Answer

I kinda feel like an idiot - the solution is just use a set instead of a vector to search in. Eg:

firsts = df[df[:Position] .== "first", :]
lasts = df[df[:Position] .== "last", :]
meetsthreshold = Set(lasts[lasts[:Probability] .> 0.8, :ID])

final = firsts[Vector{Bool}([in(i, meetsthreshold) for i in firsts[:ID]]), :]

Ran in ~ 1 second.

Efficient set intersection to get rows in DataFrame

Tags:

dataframe

julia

kevbonham

1 Answers

kevbonham

Recent Activity

Donate For Us

Efficient set intersection to get rows in DataFrame

Tags:

dataframe

julia

kevbonham

1 Answers

kevbonham

Related questions

Recent Activity

Donate For Us