Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate words using Java when words are more than 200 million?

Tags:

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.

In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.

Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:

  1. With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
  2. With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
  3. With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.

I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.

Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)

Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

like image 588
Ketan Avatar asked Sep 19 '12 18:09

Ketan


People also ask

How do I remove duplicates from an array in Word?

We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays. sort(arr) method.

How do you remove duplicates in a sentence?

1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.


2 Answers

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

like image 115
Tobias Ritzau Avatar answered Sep 21 '22 22:09

Tobias Ritzau


Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.

Process each of the letter files separately using a Set to remove duplicates.

like image 43
Gilbert Le Blanc Avatar answered Sep 25 '22 22:09

Gilbert Le Blanc