Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:
a 1
b 2
c 445
a 1
We need to get file "B.txt", that will not have such duplicates. As example, it should be this:
a 1
b 2
c 445
I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.
But I think it is way too slow. What can I use?
A is not database! No SQL, please.
Sorry, that didn't said, sorting is OK.
Although it can be sorted, what about if it cannot be sorted?
One solution would be to sort the file, then copy one line at a time to a new file, filtering out consecutive duplicates.
Then the question becomes: how do you sort a file that is too big to fit in memory?
Here's how Unix sort does it.
See also this question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With