df is an DataFrame of 1.2 Million rows
valid is an Array of 16000 valid values to filter for
I tried using a list comprehension for the filter, but it is extremely slow, because of searching through both arrays.
df[[i in valid for i in df[:match],:]
What is a faster way to do this? Using where? The 'filter' function?
Searching over a set will be quite fast:
const validset = Set(valid)
filter((x)-> x.match in validset,df)
Some performance:
julia> df=DataFrame(match=rand(1:(10^8),10^6));
julia> valid = collect(1:1_000_000); validset=Set(valid)
julia> @btime filter((x)-> x.match in $validset,$df)
173.341 ms (3999506 allocations: 61.30 MiB)
Or the faster version recommended by Bogumil:
julia> @btime filter(:match => in($validset),$df)
37.500 ms (23 allocations: 282.44 KiB)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With