Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find and return a duplicate value in array

Tags:

arrays

ruby

People also ask

How do you find duplicate values in an array?

function checkIfArrayIsUnique(myArray) { for (var i = 0; i < myArray. length; i++) { for (var j = 0; j < myArray. length; j++) { if (i != j) { if (myArray[i] == myArray[j]) { return true; // means there are duplicate values } } } } return false; // means there are no duplicate values. }


a = ["A", "B", "C", "B", "A"]
a.detect{ |e| a.count(e) > 1 }

I know this isn't very elegant answer, but I love it. It's beautiful one liner code. And works perfectly fine unless you need to process huge data set.

Looking for faster solution? Here you go!

def find_one_using_hash_map(array)
  map = {}
  dup = nil
  array.each do |v|
    map[v] = (map[v] || 0 ) + 1

    if map[v] > 1
      dup = v
      break
    end
  end

  return dup
end

It's linear, O(n), but now needs to manage multiple lines-of-code, needs test cases, etc.

If you need an even faster solution, maybe try C instead.

And here is the gist comparing different solutions: https://gist.github.com/naveed-ahmad/8f0b926ffccf5fbd206a1cc58ce9743e


You can do this in a few ways, with the first option being the fastest:

ary = ["A", "B", "C", "B", "A"]

ary.group_by{ |e| e }.select { |k, v| v.size > 1 }.map(&:first)

ary.sort.chunk{ |e| e }.select { |e, chunk| chunk.size > 1 }.map(&:first)

And a O(N^2) option (i.e. less efficient):

ary.select{ |e| ary.count(e) > 1 }.uniq

Simply find the first instance where the index of the object (counting from the left) does not equal the index of the object (counting from the right).

arr.detect {|e| arr.rindex(e) != arr.index(e) }

If there are no duplicates, the return value will be nil.

I believe this is the fastest solution posted in the thread so far, as well, since it doesn't rely on the creation of additional objects, and #index and #rindex are implemented in C. The big-O runtime is N^2 and thus slower than Sergio's, but the wall time could be much faster due to the the fact that the "slow" parts run in C.


detect only finds one duplicate. find_all will find them all:

a = ["A", "B", "C", "B", "A"]
a.find_all { |e| a.count(e) > 1 }

Here are two more ways of finding a duplicate.

Use a set

require 'set'

def find_a_dup_using_set(arr)
  s = Set.new
  arr.find { |e| !s.add?(e) }
end

find_a_dup_using_set arr
  #=> "hello" 

Use select in place of find to return an array of all duplicates.

Use Array#difference

class Array
  def difference(other)
    h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
    reject { |e| h[e] > 0 && h[e] -= 1 }
  end
end

def find_a_dup_using_difference(arr)
  arr.difference(arr.uniq).first
end

find_a_dup_using_difference arr
  #=> "hello" 

Drop .first to return an array of all duplicates.

Both methods return nil if there are no duplicates.

I proposed that Array#difference be added to the Ruby core. More information is in my answer here.

Benchmark

Let's compare suggested methods. First, we need an array for testing:

CAPS = ('AAA'..'ZZZ').to_a.first(10_000)
def test_array(nelements, ndups)
  arr = CAPS[0, nelements-ndups]
  arr = arr.concat(arr[0,ndups]).shuffle
end

and a method to run the benchmarks for different test arrays:

require 'fruity'

def benchmark(nelements, ndups)
  arr = test_array nelements, ndups
  puts "\n#{ndups} duplicates\n"    
  compare(
    Naveed:    -> {arr.detect{|e| arr.count(e) > 1}},
    Sergio:    -> {(arr.inject(Hash.new(0)) {|h,e| h[e] += 1; h}.find {|k,v| v > 1} ||
                     [nil]).first },
    Ryan:      -> {(arr.group_by{|e| e}.find {|k,v| v.size > 1} ||
                     [nil]).first},
    Chris:     -> {arr.detect {|e| arr.rindex(e) != arr.index(e)} },
    Cary_set:  -> {find_a_dup_using_set(arr)},
    Cary_diff: -> {find_a_dup_using_difference(arr)}
  )
end

I did not include @JjP's answer because only one duplicate is to be returned, and when his/her answer is modified to do that it is the same as @Naveed's earlier answer. Nor did I include @Marin's answer, which, while posted before @Naveed's answer, returned all duplicates rather than just one (a minor point but there's no point evaluating both, as they are identical when return just one duplicate).

I also modified other answers that returned all duplicates to return just the first one found, but that should have essentially no effect on performance, as they computed all duplicates before selecting one.

The results for each benchmark are listed from fastest to slowest:

First suppose the array contains 100 elements:

benchmark(100, 0)
0 duplicates
Running each test 64 times. Test will take about 2 seconds.
Cary_set is similar to Cary_diff
Cary_diff is similar to Ryan
Ryan is similar to Sergio
Sergio is faster than Chris by 4x ± 1.0
Chris is faster than Naveed by 2x ± 1.0

benchmark(100, 1)
1 duplicates
Running each test 128 times. Test will take about 2 seconds.
Cary_set is similar to Cary_diff
Cary_diff is faster than Ryan by 2x ± 1.0
Ryan is similar to Sergio
Sergio is faster than Chris by 2x ± 1.0
Chris is faster than Naveed by 2x ± 1.0

benchmark(100, 10)
10 duplicates
Running each test 1024 times. Test will take about 3 seconds.
Chris is faster than Naveed by 2x ± 1.0
Naveed is faster than Cary_diff by 2x ± 1.0 (results differ: AAC vs AAF)
Cary_diff is similar to Cary_set
Cary_set is faster than Sergio by 3x ± 1.0 (results differ: AAF vs AAC)
Sergio is similar to Ryan

Now consider an array with 10,000 elements:

benchmark(10000, 0)
0 duplicates
Running each test once. Test will take about 4 minutes.
Ryan is similar to Sergio
Sergio is similar to Cary_set
Cary_set is similar to Cary_diff
Cary_diff is faster than Chris by 400x ± 100.0
Chris is faster than Naveed by 3x ± 0.1

benchmark(10000, 1)
1 duplicates
Running each test once. Test will take about 1 second.
Cary_set is similar to Cary_diff
Cary_diff is similar to Sergio
Sergio is similar to Ryan
Ryan is faster than Chris by 2x ± 1.0
Chris is faster than Naveed by 2x ± 1.0

benchmark(10000, 10)
10 duplicates
Running each test once. Test will take about 11 seconds.
Cary_set is similar to Cary_diff
Cary_diff is faster than Sergio by 3x ± 1.0 (results differ: AAE vs AAA)
Sergio is similar to Ryan
Ryan is faster than Chris by 20x ± 10.0
Chris is faster than Naveed by 3x ± 1.0

benchmark(10000, 100)
100 duplicates
Cary_set is similar to Cary_diff
Cary_diff is faster than Sergio by 11x ± 10.0 (results differ: ADG vs ACL)
Sergio is similar to Ryan
Ryan is similar to Chris
Chris is faster than Naveed by 3x ± 1.0

Note that find_a_dup_using_difference(arr) would be much more efficient if Array#difference were implemented in C, which would be the case if it were added to the Ruby core.

Conclusion

Many of the answers are reasonable but using a Set is the clear best choice. It is fastest in the medium-hard cases, joint fastest in the hardest and only in computationally trivial cases - when your choice won't matter anyway - can it be beaten.

The one very special case in which you might pick Chris' solution would be if you want to use the method to separately de-duplicate thousands of small arrays and expect to find a duplicate typically less than 10 items in. This will be a bit faster as it avoids the small additional overhead of creating the Set.