Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia function to return non-unique elements of an array

Tags:

julia

Julia base has the unique function that returns a vector containing only the unique elements of an array (or any iterable). I was looking for a nonunique function to return an array containing all the elements that appear at least twice in its input. As far as I can tell Julia does not have such a function, which I found a bit surprising.

My first attempt was as follows:

function nonunique(x::AbstractArray)
    uniqueindexes = indexin(unique(x),x)
    nonuniqueindexes = setdiff(1:length(x),uniqueindexes)
    unique(x[nonuniqueindexes])
end

But inspired by Bogumił Kamiński's answer to indices of unique elements of vector in Julia I wrote a second version:

function nonunique(x::AbstractArray{T}) where T
    uniqueset = Set{T}()
    duplicatedset = Set{T}()
    duplicatedvector = Vector{T}()
    for i in x
        if(i in uniqueset)
            if !(i in duplicatedset)
                push!(duplicatedset, i)
                push!(duplicatedvector, i)
            end
        else
            push!(uniqueset, i)
        end
    end
    duplicatedvector
end

In my tests, this version is about 4 times faster. It has the nice property that the return is ordered in the order that the second (first repeat) of each set of equivalent elements originally appear. I believe that in is faster when checking for membership of a Set than an Array, which accounts for having the two variables duplicatedset and duplicatedvector.

Is it really necessary for me to "roll my own" nonunique function and can the second version be improved?

like image 939
Philip Swannell Avatar asked Feb 12 '19 14:02

Philip Swannell


3 Answers

You can get higher performance by sorting the list and then searching for duplicates:

function nonunique2(x::AbstractArray{T}) where T
    xs = sort(x)
    duplicatedvector = T[]
    for i=2:length(xs)
        if (isequal(xs[i],xs[i-1]) && (length(duplicatedvector)==0 || !isequal(duplicatedvector[end], xs[i])))
            push!(duplicatedvector,xs[i])
        end
    end
    duplicatedvector
end

Here are sample results:

julia> x = rand(1:1000,1000);

julia> using BenchmarkTools

julia> nn = @btime nonunique($x);
  42.240 μs (39 allocations: 71.23 KiB)

julia> nn2s = @btime nonunique2($x);
  26.453 μs (10 allocations: 16.33 KiB)

julia> sort(nn) == sort(nn2s)
true

It will be much better if you can do in-place sorting:

function nonunique2!(x::AbstractArray{T}) where T
    sort!(x)
    duplicatedvector = T[]
    for i=2:length(x)
        if (isequal(x[i],x[i-1]) && (length(duplicatedvector)==0 || !isequal(duplicatedvector[end], x[i])))
            push!(duplicatedvector,x[i])
        end
    end
    duplicatedvector
end

Here are the results (the same data)

julia> nn2 = @btime nonunique2!($x)
  9.813 μs (9 allocations: 8.39 KiB)

julia> sort(nn) == sort(nns)
true
like image 104
Przemyslaw Szufel Avatar answered Oct 09 '22 00:10

Przemyslaw Szufel


To add to the answer above, as its limitation is that the type T must be sortable and it is not order-preserving I have two possible solutions.

Here is another non-order preserving solution that uses StatsBase.jl. It can be faster than the sorting solution or slower depending on the density of the duplicates (also it does more work, but in some applications this information might be useful):

nonunique3(x) = [k for (k, v) in countmap(x) if v > 1]

If you want to speed up the order preserving approach you could do something like:

function nonunique4(x::AbstractArray{T}) where T
    status = Dict{T, Bool}()
    duplicatedvector = Vector{T}()
    for i in x
        if haskey(status, i)
            if status[i]
                push!(duplicatedvector, i)
                status[i] = false
            end
        else
            status[i] = true
        end
    end
    duplicatedvector
end

In general benchmarking them is tricky as performance will depend on:

  • density of duplicates and over double duplicates in x
  • the size of type T (e.g. if it were a very large immutable type things might change vs. standard situation)
like image 31
Bogumił Kamiński Avatar answered Oct 08 '22 22:10

Bogumił Kamiński


Not really an answer (excellent answers are above) but a comment that the original implementation can be cleaned a little to:

function nonunique1(x::AbstractArray{T}) where T
    uniqueset = Set{T}()
    duplicatedset = Set{T}()
    for i in x
        if(i in uniqueset)
            push!(duplicatedset, i)
        else
            push!(uniqueset, i)
        end
    end
    collect(duplicatedset)
end

i.e. you don't need to check for existence before pushing to a set, and you don't need to fill a vector and set separately. It's still not as fast as the sorting implementation.

like image 27
Michael K. Borregaard Avatar answered Oct 09 '22 00:10

Michael K. Borregaard