Julia base has the unique
function that returns a vector containing only the unique elements of an array (or any iterable). I was looking for a nonunique
function to return an array containing all the elements that appear at least twice in its input. As far as I can tell Julia does not have such a function, which I found a bit surprising.
My first attempt was as follows:
function nonunique(x::AbstractArray)
uniqueindexes = indexin(unique(x),x)
nonuniqueindexes = setdiff(1:length(x),uniqueindexes)
unique(x[nonuniqueindexes])
end
But inspired by Bogumił Kamiński's answer to indices of unique elements of vector in Julia I wrote a second version:
function nonunique(x::AbstractArray{T}) where T
uniqueset = Set{T}()
duplicatedset = Set{T}()
duplicatedvector = Vector{T}()
for i in x
if(i in uniqueset)
if !(i in duplicatedset)
push!(duplicatedset, i)
push!(duplicatedvector, i)
end
else
push!(uniqueset, i)
end
end
duplicatedvector
end
In my tests, this version is about 4 times faster. It has the nice property that the return is ordered in the order that the second (first repeat) of each set of equivalent elements originally appear. I believe that in
is faster when checking for membership of a Set
than an Array
, which accounts for having the two variables duplicatedset
and duplicatedvector
.
Is it really necessary for me to "roll my own" nonunique
function and can the second version be improved?
You can get higher performance by sorting the list and then searching for duplicates:
function nonunique2(x::AbstractArray{T}) where T
xs = sort(x)
duplicatedvector = T[]
for i=2:length(xs)
if (isequal(xs[i],xs[i-1]) && (length(duplicatedvector)==0 || !isequal(duplicatedvector[end], xs[i])))
push!(duplicatedvector,xs[i])
end
end
duplicatedvector
end
Here are sample results:
julia> x = rand(1:1000,1000);
julia> using BenchmarkTools
julia> nn = @btime nonunique($x);
42.240 μs (39 allocations: 71.23 KiB)
julia> nn2s = @btime nonunique2($x);
26.453 μs (10 allocations: 16.33 KiB)
julia> sort(nn) == sort(nn2s)
true
It will be much better if you can do in-place sorting:
function nonunique2!(x::AbstractArray{T}) where T
sort!(x)
duplicatedvector = T[]
for i=2:length(x)
if (isequal(x[i],x[i-1]) && (length(duplicatedvector)==0 || !isequal(duplicatedvector[end], x[i])))
push!(duplicatedvector,x[i])
end
end
duplicatedvector
end
Here are the results (the same data)
julia> nn2 = @btime nonunique2!($x)
9.813 μs (9 allocations: 8.39 KiB)
julia> sort(nn) == sort(nns)
true
To add to the answer above, as its limitation is that the type T
must be sortable and it is not order-preserving I have two possible solutions.
Here is another non-order preserving solution that uses StatsBase.jl. It can be faster than the sorting solution or slower depending on the density of the duplicates (also it does more work, but in some applications this information might be useful):
nonunique3(x) = [k for (k, v) in countmap(x) if v > 1]
If you want to speed up the order preserving approach you could do something like:
function nonunique4(x::AbstractArray{T}) where T
status = Dict{T, Bool}()
duplicatedvector = Vector{T}()
for i in x
if haskey(status, i)
if status[i]
push!(duplicatedvector, i)
status[i] = false
end
else
status[i] = true
end
end
duplicatedvector
end
In general benchmarking them is tricky as performance will depend on:
x
T
(e.g. if it were a very large immutable type things might change vs. standard situation)Not really an answer (excellent answers are above) but a comment that the original implementation can be cleaned a little to:
function nonunique1(x::AbstractArray{T}) where T
uniqueset = Set{T}()
duplicatedset = Set{T}()
for i in x
if(i in uniqueset)
push!(duplicatedset, i)
else
push!(uniqueset, i)
end
end
collect(duplicatedset)
end
i.e. you don't need to check for existence before pushing to a set, and you don't need to fill a vector and set separately. It's still not as fast as the sorting implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With