Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algo to find duplicates in a very large array

During one of technical interview got this question. I know way to solve this problem using (in java) HashSet.

But i could not understand when interviewer forced on the word "a very large array let's say 10 million elements in the given array".

Do i need to change the approach? If not, what should be the efficient to achieve this?

PS: Algo or implementation is language agnostic.

Thank you.

like image 467
Sam Avatar asked Oct 17 '25 08:10

Sam


2 Answers

There were some key things, which interviewer expected you to ask back like: if you can not load the array in memory, then how much I can load. These are the steps to solve the problem:

  1. you need to divide the array in how much memory is available to you.
  2. Let's say you can load 1M number at a time. You have split the data in k parts. You load the first 1M and build Min Heap of it. Then remove the top and apply the Heapify on Min Heap.
  3. Repeat the same for other parts of the data.
  4. Now you will have K sorted splits.
  5. Now fetch a first number from each K split and again build a Min Heap.
  6. Now remove the top from the Min Heap and store the value in temporary variable as well for comparing with the next coming number for finding the duplicates.
  7. Now fetch the next number from the same split (part) whose number got removed last time. Put that number on top of Min Heap and apply Heapify.
  8. Now the top of the Min Heap is your next sorted number and compare it with the temporary variable for finding the duplicates. Update thetemporary variable` if number is not duplicate.
like image 95
YoungHobbit Avatar answered Oct 19 '25 20:10

YoungHobbit


You can do it in O(nlog(n)):

  • Sort the array
  • find the duplicates (they will be next to each other) in one pass.

I think that is what the interviewer wanted to hear.

if you did a merge sort or a quick sort, finding the duplicates could be done at merging in hidden time. These can be implemented "in-place", or "by-part" if the array is too large to fit in memory.

like image 31
Reblochon Masque Avatar answered Oct 19 '25 22:10

Reblochon Masque



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!