I have the following issue. I'm connecting to some place using and API and getting the data as an inputstream. the goal is to save the data after removing duplicate lines. duplication defined by columns 10, 15, 22.
i'm getting the data using several threads. currently I first save the data into a csv file and then remove duplicates. I want to do it while i'm reading the data. the volume of the data is about 10 million records. I have limited memory that I can use. the machine has 32gb of memory but I am limited since there are other applications that using it.
I read here about using hash maps. but I'm not sure I have enough memory to use it.
does any one has a suggestion how to solve this issue?
1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys.
(1) Select Fruit column (which you will remove duplicates rows by), and then click the Primary Key button; (2) Select the Amount column (Which you will keep highest values in), and then click Calculate > Max. (3) Specify combination rules for other columns as you need.
A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).
What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.
When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).
The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.
If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).
The solution depends on how big is your data in columns 10, 15, 22.
Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.
Key
class to store values from columns 10, 15, 22. Carefully implement equals
and hashCode
methods. (You may also use a normal ArrayList
instead.)Set
which would contain keys of all records you read. In the worst case you'll need number of records * size of key
amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.
If the key size is still too large, you'll probably need a database to store the set of key.
Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With