How to remove duplicate words using Java when words are more than 200 million?

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.

In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.

Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:

With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.

I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.

Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)

Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

How do I remove duplicates from an array in Word?

We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays. sort(arr) method.

How do you remove duplicates in a sentence?

1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.

Process each of the letter files separately using a Set to remove duplicates.

How to remove duplicate words using Java when words are more than 200 million?

Tags:

Ketan

People also ask

2 Answers

Tobias Ritzau

Gilbert Le Blanc

Recent Activity

Donate For Us

How to remove duplicate words using Java when words are more than 200 million?

Tags:

Ketan

People also ask

2 Answers

Tobias Ritzau

Gilbert Le Blanc

Related questions

Recent Activity

Donate For Us