removing duplicates in java on large scale data

Tags:

java

I have the following issue. I'm connecting to some place using and API and getting the data as an inputstream. the goal is to save the data after removing duplicate lines. duplication defined by columns 10, 15, 22.

i'm getting the data using several threads. currently I first save the data into a csv file and then remove duplicates. I want to do it while i'm reading the data. the volume of the data is about 10 million records. I have limited memory that I can use. the machine has 32gb of memory but I am limited since there are other applications that using it.

I read here about using hash maps. but I'm not sure I have enough memory to use it.

does any one has a suggestion how to solve this issue?

708

asked Nov 21 '16 10:11

mikeP

2 Answers

A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).

What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.

When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).

The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.

If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).

154

answered Oct 26 '22 15:10

Markus Fischer

The solution depends on how big is your data in columns 10, 15, 22.

Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.

Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
Create a Set which would contain keys of all records you read.
For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.

In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.

If the key size is still too large, you'll probably need a database to store the set of key.

Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.

answered Oct 26 '22 17:10

lexicore

Related questions
                            
                                SonarQube: Change this condition so that it does not always evaluate to "false" (for finally in javax.mail receiving)
                            
                                Timeout Exception in Zuul based routing
                            
                                How to divide 1 completablefuture to many completablefuture in stream?
                            
                                Swagger ReaderListener is never picked up during scanning
                            
                                How to know the fewest numbers we should add to get a full array
                            
                                Writing Java code that compiles using one of two implementations of a class
                            
                                JVM - get hold of instance of a class in a running (non-instrumented) session
                            
                                Highly Concurrent Apache Async HTTP Client IOReactor issues
                            
                                Is it "ok" to add the final keyword to an inherited/overridden method?
                            
                                Scanner.skip documentation concerning delimiters
                            
                                Java generics weird behaviour
                            
                                Spring Boot - how to include REST endpoint from dependency?
                            
                                Error: Could not find or load main class sample.Main
                            
                                Error creating bean with name 'org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping'
                            
                                Can't start RCP application due to OSGi bundle dependency issue
                            
                                how to build a jar with maven for a specific OS?
                            
                                /usr/bin/java not found when dumping assetic
                            
                                Spring Boot integration tests cannot reach application.properties file
                            
                                How to retrive string value of Long.MAX_VALUE in compile time in java?
                            
                                Conditionally getting json data using java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With