Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What algorithm to use to delete duplicates?

Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:

a 1
b 2
c 445
a 1

We need to get file "B.txt", that will not have such duplicates. As example, it should be this:

a 1
b 2
c 445

I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.

But I think it is way too slow. What can I use?

A is not database! No SQL, please.

Sorry, that didn't said, sorting is OK.

Although it can be sorted, what about if it cannot be sorted?

like image 771
Monory Avatar asked Dec 17 '22 07:12

Monory


1 Answers

One solution would be to sort the file, then copy one line at a time to a new file, filtering out consecutive duplicates.

Then the question becomes: how do you sort a file that is too big to fit in memory?

Here's how Unix sort does it.

See also this question.

like image 87
mbeckish Avatar answered Dec 28 '22 09:12

mbeckish