Find duplicates in large file

Tags:

data-structures

I have really large file with approximately 15 million entries. Each line in the file contains a single string (call it key).

I need to find the duplicate entries in the file using java. I tried to use a hashmap and detect duplicate entries. Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.

How can I solve this problem?

I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.

513

asked Feb 09 '12 17:02

Maximus

3 Answers

The key is that your data will not fit into memory. You can use external merge sort for this:

Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

150

answered Oct 20 '22 23:10

BrokenGlass

I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:

cat file | sort | uniq

answered Oct 21 '22 00:10

Michael

You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.

Pseudo code...

for line in file
    entries.put(line.hashCode, line-number)
for entry in entries
    if entry.lineNumbers > 1
         fetch each line by line number and compare

answered Oct 21 '22 00:10

Andrew White

Related questions
                            
                                Algorithm to find nth root of a number
                            
                                What is the best matrix multiplication algorithm? [closed]
                            
                                Split number into 4 random numbers
                            
                                Optimized javascript code to find 3 largest element and its indexes in array?
                            
                                Shortest code to calculate list min/max in .NET
                            
                                Determine if a Python list is 95% the same?
                            
                                Determining the uniqueness of a min-cut
                            
                                Determining the best audio quality
                            
                                Why is C quicksort function much slower (tape comparisons, tape swapping) than bubble sort function?
                            
                                Perfect square or not?
                            
                                How many subrectangle exists on a m x n grid
                            
                                Substring algorithm suggestion
                            
                                Check std::vector has duplicates
                            
                                Using a map to find subarray with given sum (with negative numbers)
                            
                                Optimization! - What is it? How is it done?
                            
                                How does an unsharp mask work?
                            
                                How do I generate random numbers in an array that add up to a defined total?
                            
                                O(n) algorithm to find the odd-number-out in array of consecutive integers from 1 to n(not odd numbers)
                            
                                How to check dependencies of floats
                            
                                When is loop unwinding effective?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With