Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing duplicates in java on large scale data

Tags:

java

I have the following issue. I'm connecting to some place using and API and getting the data as an inputstream. the goal is to save the data after removing duplicate lines. duplication defined by columns 10, 15, 22.

i'm getting the data using several threads. currently I first save the data into a csv file and then remove duplicates. I want to do it while i'm reading the data. the volume of the data is about 10 million records. I have limited memory that I can use. the machine has 32gb of memory but I am limited since there are other applications that using it.

I read here about using hash maps. but I'm not sure I have enough memory to use it.

does any one has a suggestion how to solve this issue?

like image 708
mikeP Avatar asked Nov 21 '16 10:11

mikeP


People also ask

How do you remove duplicates with the highest value?

1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys.

How do I remove duplicate rows and keep the highest value only?

(1) Select Fruit column (which you will remove duplicates rows by), and then click the Primary Key button; (2) Select the Amount column (Which you will keep highest values in), and then click Calculate > Max. (3) Specify combination rules for other columns as you need.


2 Answers

A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).

What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.

When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).

The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.

If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).

like image 154
Markus Fischer Avatar answered Oct 26 '22 15:10

Markus Fischer


The solution depends on how big is your data in columns 10, 15, 22.

Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.

  • Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
  • Create a Set which would contain keys of all records you read.
  • For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.

In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.

If the key size is still too large, you'll probably need a database to store the set of key.

Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.

like image 34
lexicore Avatar answered Oct 26 '22 17:10

lexicore