What algorithm to use to delete duplicates?

Question

Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:

a 1
b 2
c 445
a 1

We need to get file "B.txt", that will not have such duplicates. As example, it should be this:

a 1
b 2
c 445

I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.

But I think it is way too slow. What can I use?

A is not database! No SQL, please.

Sorry, that didn't said, sorting is OK.

Although it can be sorted, what about if it cannot be sorted?

mbeckish · Accepted Answer

One solution would be to sort the file, then copy one line at a time to a new file, filtering out consecutive duplicates.

Then the question becomes: how do you sort a file that is too big to fit in memory?

Here's how Unix sort does it.

What algorithm to use to delete duplicates?

Tags:

algorithm

duplicates

Monory

1 Answers

mbeckish

Recent Activity

Donate For Us

What algorithm to use to delete duplicates?

Tags:

algorithm

duplicates

Monory

1 Answers

mbeckish

Related questions

Recent Activity

Donate For Us