Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates from a large integer array using Java

Do you know of any time efficient way to remove duplicated values from a very big integer array using Java? The size of the array depends on the logged in user, but will always exceed 1500000 unsorted values with some duplicates. Every integer contains a number between 100000 and 9999999.

I tried converting it to a List, but the heap on my server doesn't allow this amount of data(my ISP has restricted it). And a regular for loop within a for loop takes over 5 minutes to calculate.

The size of the array without the duplicates is the one I will store in my database.

Help would be appreciated!

like image 610
dirbacke Avatar asked Sep 08 '10 12:09

dirbacke


People also ask

How do you remove duplicates from an integer array in Java?

We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays. sort(arr) method.

How do you remove duplicates from an array in Java 8?

Remove duplicate elements from Arrays :Use distinct() method of Stream API to remove duplicate String elements and then invoke toArray(); method to convert/result into Object[] Array. Finally, iterate/print Array with unique elements using forEach() method of Stream API.


2 Answers

You could perhaps use a bit set? I don't know how efficient Java's BitSet is. But 9999999 possible values would only take 9999999 / 8 = 1250000 bytes = just over 1Mb. As you walk the array of values, set the corresponding bit to true. Then you can walk over the bit set and output the corresponding value whenever you find a bit set to true.

1Mb will fit in a CPU cache, so this could be quite efficient depending on the bit set implementation.

This also has the side-effect of sorting the data too.

And... this is an O(n) algorithm since it requires a single pass over the input data, the set operations are O(1) (for an array-based set like this), and the output pass is also O(m) where m is the number of unique values and, by definition, must be <= n.

like image 113
dty Avatar answered Oct 10 '22 06:10

dty


I would make a hashset where I store all values contained in the list, before i start adding items to the list. Then just check so that the hashset doesn't contain the value you want to add.

like image 45
Marcus Johansson Avatar answered Oct 10 '22 08:10

Marcus Johansson