I have a file which might be 30+GB or more. And each line in this file is called a record and is composed of 2 cols, which goes like this
id1 id2
All of this 2 ids are integers (32-bit). My job is to write a program to remove all the duplicate record, make the record unique, finally output the unique id2 into a file.
There is some constraints, 30G memory is allowed at most, and better get the job done efficiently by a non-multithread/process program.
Initially I came up with an idea: because of the memory constraints, I decided to read the file n times, each only keep in memory those record with id1 % n = i (i = 0,1,2,..,n-1)
. The data structure I use is a std::map<int, std::set<int> >
, it takes id1 as key, and put id2 in id1's std::set
.
This way, memory constraints will not be violated, but it's quite slow. I think it's because as the std::map
and std::set
grows larger, the insertion speed goes down. Moreover, I need to read the file n times, when each round is done, I gotta clear the std::map
for next round which also cost some time.
I also tried hash, but it doesn't satisfy me either, which I thought there might be too many collisions even with 300W buckets.
So, I post my problem here, help you guys can offer me any better data structure or algorithm.
Thanks a lot.
PS
Scripts (shell, python) are desired, if it can do it efficiently.
Unless I overlooked a requirement, it should be possible to do this on the Linux shell as
sort -u inputfile > outputfile
Many implementations enable you to use sort
in a parallelised manner as well:
sort --parallel=4 -u inputfile > outputfile
for up to four parallel executions.
Note that sort
might use a lot of space in /tmp
temporarily. If you run out of disk space there, you may use the -T
option to point it to an alternative place on disk to use as temporary directory.
(Edit:) A few remarks about efficiency:
sort
is highly optimised for.sort
). Again, optimising this means a lot of work, while for sort
all of that work has been done.sort
is that it operates on string representations of the input lines. If you were to write your own code, one thing you could do (similar to what you suggesed already) is to convert the input lines to 64-bit integers and hash them. If you have enough RAM, that may be a way to beat sort
in terms of speed, if you get IO and integer conversions to be really fast. I suspect it may not be worth the effort as sort
is easy to use and – I think – fast enough.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With