remove all duplicate records efficiently

Question

I have a file which might be 30+GB or more. And each line in this file is called a record and is composed of 2 cols, which goes like this

id1 id2

All of this 2 ids are integers (32-bit). My job is to write a program to remove all the duplicate record, make the record unique, finally output the unique id2 into a file.

There is some constraints, 30G memory is allowed at most, and better get the job done efficiently by a non-multithread/process program.

Initially I came up with an idea: because of the memory constraints, I decided to read the file n times, each only keep in memory those record with id1 % n = i (i = 0,1,2,..,n-1). The data structure I use is a std::map<int, std::set<int> >, it takes id1 as key, and put id2 in id1's std::set.

This way, memory constraints will not be violated, but it's quite slow. I think it's because as the std::map and std::set grows larger, the insertion speed goes down. Moreover, I need to read the file n times, when each round is done, I gotta clear the std::map for next round which also cost some time.

I also tried hash, but it doesn't satisfy me either, which I thought there might be too many collisions even with 300W buckets.

So, I post my problem here, help you guys can offer me any better data structure or algorithm.

Thanks a lot.

PS

Scripts (shell, python) are desired, if it can do it efficiently.

jogojapan · Accepted Answer

Unless I overlooked a requirement, it should be possible to do this on the Linux shell as

sort -u inputfile > outputfile

Many implementations enable you to use sort in a parallelised manner as well:

sort --parallel=4 -u inputfile > outputfile

for up to four parallel executions.

Note that sort might use a lot of space in /tmp temporarily. If you run out of disk space there, you may use the -T option to point it to an alternative place on disk to use as temporary directory.

(Edit:) A few remarks about efficiency:

A significant portion of the time spent during execution (of any solution to your problem) will be spent on IO, something that sort is highly optimised for.
Unless you have extremely much RAM, your solution is likely to end up performing some of the work on disk (just like sort). Again, optimising this means a lot of work, while for sort all of that work has been done.
One disadvantage of sort is that it operates on string representations of the input lines. If you were to write your own code, one thing you could do (similar to what you suggesed already) is to convert the input lines to 64-bit integers and hash them. If you have enough RAM, that may be a way to beat sort in terms of speed, if you get IO and integer conversions to be really fast. I suspect it may not be worth the effort as sort is easy to use and – I think – fast enough.

remove all duplicate records efficiently

Tags:

c++

algorithm

data-structures

Alcott

1 Answers

jogojapan

Recent Activity

Donate For Us

remove all duplicate records efficiently

Tags:

c++

algorithm

data-structures

Alcott

1 Answers

jogojapan

Related questions

Recent Activity

Donate For Us