Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to find duplicate in large array

Tags:

ruby

I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?

all = [*1..35000, 1]

This solution works:

all.select { |v| all.count(v) > 1 }

But it takes a long time.

like image 655
Juanito Fatas Avatar asked Aug 16 '18 02:08

Juanito Fatas


People also ask

How do you find multiple duplicates in array?

Duplicate elements can be found using two loops. The outer loop will iterate through the array from 0 to length of the array. The outer loop will select an element. The inner loop will be used to compare the selected element with the rest of the elements of the array.

How do you check if there is a duplicate in an array?

To check if an array contains duplicates: Use the Array. some() method to iterate over the array. Check if the index of the first occurrence of the current value is NOT equal to the index of its last occurrence. If the condition is met, then the array contains duplicates.

How do I find large duplicate files?

You can use external merge sort for this: Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements). Merge the chunks and again eliminate the duplicates when merging.


1 Answers

Your code is taking an eon to execute because it is executing count for each element, resulting in it having a computational complexity of O(n2).

arr = [*1..35000, 1, 34999]

If you want to know which values appear in the array at least twice...

require 'set'

uniq_set = Set.new
arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
  #=> [1, 34999]

Set lookups (implemented with a hash under the covers) are extremely fast.

See Set#add? and Set#add.

If you want to know the numbers of times values appear in the array that appear at least twice...

arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
  #=> {1=>2, 34999=>2}

This uses a counting hash1. See Hash::new when it takes a default value as an argument.

If you want to know the indices of values that appear in the array at least twice...

arr.each_with_index.
    with_object({}) { |(x,i),h| (h[x] ||= []) << i }.
    select { |_,v| v.size > 1 }
  #=> {1=>[0, 35000], 34999=>[34998, 35001]}

When the hash h does not already have a key x,

(h[x] ||= []) << i
   #=> (h[x] = h[x] || []) << i
   #=> (h[x] = nil || []) << i
   #=> (h[x] = []) << i
   #=> [] << i where [] is now h[x]

1. Ruby v2.7 gave us the method Enumerable#tally, allowing us to write arr.tally.select { |_,v| v > 1 }.

like image 58
Cary Swoveland Avatar answered Sep 30 '22 18:09

Cary Swoveland