Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing duplicate strings from a massive array in java efficiently?

Tags:

java

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!

I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)

Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

like image 277
Preator Darmatheon Avatar asked Apr 06 '12 15:04

Preator Darmatheon


2 Answers

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:

HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));

If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

like image 170
Jon Skeet Avatar answered Oct 19 '22 17:10

Jon Skeet


ANALYSIS

Let's perform some analysis:

  1. Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).

  2. Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).

  3. Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)

  4. Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).

In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.

CONCLUSION

  1. If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.

  2. If space is a concern - then QuickSort + one pass.

  3. If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

like image 43
Eugene Retunsky Avatar answered Oct 19 '22 19:10

Eugene Retunsky