Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicates in large file

I have really large file with approximately 15 million entries. Each line in the file contains a single string (call it key).

I need to find the duplicate entries in the file using java. I tried to use a hashmap and detect duplicate entries. Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.

How can I solve this problem?

I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.

like image 513
Maximus Avatar asked Feb 09 '12 17:02

Maximus


People also ask

How do I find duplicates in a large Excel file?

Select the data you want to check for duplicate information. Then, from the Home tab, select Conditional Formatting > Highlight Cell Rules > Duplicate Values.


3 Answers

The key is that your data will not fit into memory. You can use external merge sort for this:

Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

like image 150
BrokenGlass Avatar answered Oct 20 '22 23:10

BrokenGlass


I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:

cat file | sort | uniq
like image 11
Michael Avatar answered Oct 21 '22 00:10

Michael


You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.

Pseudo code...

for line in file
    entries.put(line.hashCode, line-number)
for entry in entries
    if entry.lineNumbers > 1
         fetch each line by line number and compare
like image 7
Andrew White Avatar answered Oct 21 '22 00:10

Andrew White