Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if two sets have a non-empty intersection

Tags:

julia

I wrote a function intersects that efficiently checks wether two sets have a non-empty intersection (at least, more efficient than checking the size of their intersection).

It works well, but now I would like to specialize this function for the type DataStructures.IntSet from the DataStructures library. I wrote the function below, which works, but is a bit messy.

As you can see, when the attribute inverse is true, I have to negate the current chunk. In an attempt to simplify the code, I wrote the function intersects2 which does the same work but without the need of multiples if/else.

But this code appears to be less efficient than the first one. I'm not sure, but I think the problem is that each call to op_u or op_v copy the parameter, as shown in the output below.

How can I rewrite this function such that it does not make any copy (ie. no allocation) and without several imbricated if/else? The full code, benchmarks and results can be found below.

using Random
using DataStructures
using BenchmarkTools

elems = [randstring(8) for i in 1:10000]
ii = rand(1:10000, 3000)
jj = rand(1:10000, 3000)


x1 = Set(elems[ii])
y1 = Set(elems[jj])

x2 = Set(ii)
y2 = Set(jj)

x3 = DataStructures.IntSet(ii)
y3 = DataStructures.IntSet(jj)

function intersects(u, v)
    for x in u
        if x in v
            return true
        end
    end
    false
end

function intersects(u::DataStructures.IntSet, v::DataStructures.IntSet)
    ch_u, ch_v = u.bits.chunks, v.bits.chunks
    for i in 1:length(ch_u)
        if u.inverse
            if v.inverse
                if ~ch_u[i] & ~ch_v[i] > 0
                    return true
                end
            else
                if ~ch_u[i] & ch_v[i] > 0
                    return true
                end
            end
        else
            if v.inverse
                if ch_u[i] & ~ch_v[i] > 0
                    return true
                end
            else
                if ch_u[i] & ch_v[i] > 0
                    return true
                end
            end
        end
    end
    false
end

function intersects2(u::DataStructures.IntSet, v::DataStructures.IntSet)
    op_u = if u.inverse x->~x else x->x end
    op_v = if v.inverse x->~x else x->x end

    ch_u, ch_v = u.bits.chunks, v.bits.chunks
    for i in 1:length(ch_u)
        if op_u(ch_u[i]) & op_v(ch_v[i]) > 0
            return true
        end
    end
    false
end


println("Set{String}")
@btime intersects($x1, $y1)

println("Set{Int}")
@btime intersects($x2, $y2)

println("IntSet")
@btime intersects($x3, $y3)
@btime intersects2($x3, $y3)
Set{String}
  190.163 ns (0 allocations: 0 bytes)
Set{Int}
  17.935 ns (0 allocations: 0 bytes)
IntSet
  7.099 ns (0 allocations: 0 bytes)
  90.000 ns (5 allocations: 80 bytes)
like image 806
GaaH Avatar asked Feb 15 '20 23:02

GaaH


People also ask

What is non empty intersection?

In general topology, a branch of mathematics, a non-empty family A of subsets of a set is said to have the finite intersection property (FIP) if the intersection over any finite subcollection of is non-empty.

Do sets always have an intersection that is not the empty set?

A∩∅=∅ because, as there are no elements in the empty set, none of the elements in A are also in the empty set, so the intersection is empty. Hence the intersection of any set and an empty set is an empty set.

What is the intersection of two empty sets?

We define two sets to be “disjoint” if their intersection is the empty set (this means the two sets have no elements in common).

How do you prove an intersection of a set is nonempty?

Let X be a set with n elements and S be a set of 2n−1 subsets of X such that for any A,B,C∈S we have A∩B∩C≠∅. So S contains half of the subsets of X and if A⊆X then S cannot contain both A and X∖A, so exactly one of A and X∖A is in S.


2 Answers

The overhead you are seeing is likely due to function call overhead: op_u is not being inlined.

This version inlines correctly and has the same performance as intersects:

julia> function intersects2(u::DataStructures.IntSet, v::DataStructures.IntSet)
           op_u(x) = u.inverse ? ~x : x
           op_v(x) = v.inverse ? ~x : x

           ch_u, ch_v = u.bits.chunks, v.bits.chunks
           for i in 1:length(ch_u)
               if op_u(ch_u[i]) & op_v(ch_v[i]) > 0
                   return true
               end
           end
           false
       end
like image 179
David Varela Avatar answered Nov 04 '22 05:11

David Varela


You could have avoided the nesting also with something like this, this is also a bit faster than your intersects(u::DataStructures.IntSet, v::DataStructures.IntSet) method, because of @inbounds:

function intersects4(u::DataStructures.IntSet, v::DataStructures.IntSet)
    ch_u, ch_v = u.bits.chunks, v.bits.chunks
    for i in eachindex(length(ch_u) > length(ch_v) ? ch_u : ch_v)
        @inbounds if (u.inverse && v.inverse && ~ch_u[i] & ~ch_v[i] > 0) ||
           (u.inverse && ~ch_u[i] & ch_v[i] > 0) ||
           (v.inverse && ch_u[i] & ~ch_v[i] > 0) ||
           (ch_u[i] & ch_v[i] > 0) 
            return true
        end
    end
    return false
end

Are you sure there can not be a BoundsError if ch_u's lenght is larger than ch_v's and the function happens to run the whole loop? I did for i in eachindex(length(ch_u) > length(ch_v) ? ch_u : ch_v) just in case, because I haven't tested your function thoroughly.

Great answer by @David Varela (I'll call his function intersects3), you can get even a bit faster than intersect4 above if you make sure there will be no BoundsError and use @inbounds, like this:

function intersects5(u::DataStructures.IntSet, v::DataStructures.IntSet)
    op_u(x) = u.inverse ? ~x : x
    op_v(x) = v.inverse ? ~x : x

    ch_u, ch_v = u.bits.chunks, v.bits.chunks
    for i in eachindex(length(ch_u) > length(ch_v) ? ch_u : ch_v)
        @inbounds if op_u(ch_u[i]) & op_v(ch_v[i]) > 0
            return true
        end
    end
    return false
end

julia> begin
           @btime intersects($x3, $y3)   # nested ifs and bound checks
           @btime intersects2($x3, $y3)  # with non inline-able ops
           @btime intersects3($x3, $y3)  # David's improvement
           @btime intersects4($x3, $y3)  # without ifs and with @inbounds
           @btime intersects5($x3, $y3)  # David's improvement with @inbounds
       end
7.184 ns (0 allocations: 0 bytes)
87.633 ns (5 allocations: 80 bytes)
6.500 ns (0 allocations: 0 bytes)
3.763 ns (0 allocations: 0 bytes)
3.079 ns (0 allocations: 0 bytes)

Also in your code:

op_u = if u.inverse x -> ~x else x -> x end

You could have also done:

op_u = u.inverse ? (~) : identity

As ~ operator is just a function and an identity function already exists, anyway David's answer is better (because this won't inline either) and intersects5 is the fastest and more succinct of all, just wanted to mention it though.

like image 23
HarmonicaMuse Avatar answered Nov 04 '22 05:11

HarmonicaMuse