Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia - Fastest way to filter based on array of values?

Tags:

julia

df is an DataFrame of 1.2 Million rows

valid is an Array of 16000 valid values to filter for

I tried using a list comprehension for the filter, but it is extremely slow, because of searching through both arrays.

df[[i in valid for i in df[:match],:]

What is a faster way to do this? Using where? The 'filter' function?

like image 703
ndw Avatar asked Oct 29 '20 20:10

ndw


1 Answers

Searching over a set will be quite fast:

const validset = Set(valid)
filter((x)-> x.match in validset,df)

Some performance:

julia> df=DataFrame(match=rand(1:(10^8),10^6));

julia> valid = collect(1:1_000_000); validset=Set(valid)

julia> @btime filter((x)-> x.match in $validset,$df)
  173.341 ms (3999506 allocations: 61.30 MiB)

Or the faster version recommended by Bogumil:

julia> @btime filter(:match => in($validset),$df)
  37.500 ms (23 allocations: 282.44 KiB)
like image 196
Przemyslaw Szufel Avatar answered Oct 03 '22 06:10

Przemyslaw Szufel