removing duplicate strings from a massive array in java efficiently?

Question

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!

I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)

Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

Jon Skeet · Accepted Answer

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:

HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));

If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

Eugene Retunsky · Answer

ANALYSIS

Let's perform some analysis:

Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).
Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).
Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)
Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).

In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.

CONCLUSION

If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.
If space is a concern - then QuickSort + one pass.
If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

removing duplicate strings from a massive array in java efficiently?

Tags:

java

Preator Darmatheon

2 Answers

Jon Skeet

Eugene Retunsky

Recent Activity

Donate For Us

removing duplicate strings from a massive array in java efficiently?

Tags:

java

Preator Darmatheon

2 Answers

Jon Skeet

Eugene Retunsky

Related questions

Recent Activity

Donate For Us