Algorithm for detecting duplicates in a dataset which is too large to be completely loaded into memory

Question

Is there an optimal solution to this problem?

Describe an algorithm for finding duplicates in a file of one million phone numbers. The algorithm, when running, would only have two megabytes of memory available to it, which means you cannot load all the phone numbers into memory at once.

My 'naive' solution would be an O(n^2) solution which iterates over the values and just loads the file in chunks instead of all at once.

For i = 0 to 999,999

string currentVal = get the item at index i

for j = i+1 to 999,999
  if (j - i mod fileChunkSize == 0)
    load file chunk into array
  if data[j] == currentVal
    add currentVal to duplicateList and exit for

There must be another scenario were you can load the whole dataset in a really unique way and verify if a number is duplicated. Anyone have one?

argentage · Accepted Answer

Divide the file into M chunks, each of which is large enough to be sorted in memory. Sort them in memory.

For each set of two chunks, we will then carry out the last step of mergesort on two chunks to make one larger chunk (c_1 + c_2) (c_3 + c_4) .. (c_m-1 + c_m)

Point at the first element on c_1 and c_2 on disk, and make a new file (we'll call it c_1+2).

if c_1's pointed-to element is a smaller number than c_2's pointed-to element, copy it into c_1+2 and point to the next element of c_1.
Otherwise, copy c_2's pointed element into and point to the next element of c_2.

Repeat the previous step until both arrays are empty. You only need to use the space in memory needed to hold the two pointed-to numbers. During this process, if you encounter c_1 and c_2's pointed-to elements being equal, you have found a duplicate - you can copy it in twice and increment both pointers.

The resulting m/2 arrays can be recursively merged in the same manner- it will take log(m) of these merge steps to generate the correct array. Each number will be compared against each other number in a way that will find the duplicates.

Alternately, a quick and dirty solution as alluded to by @Evgeny Kluev is to make a bloom filter which is as large as you can reasonably fit in memory. You can then make a list of the index of each element which fails the bloom filter and loop through the file a second time in order to test these members for duplication.

GettnDer · Answer

I think Airza's solution is heading towards a good direction, but since sorting is not what you want, and it is more expensive you can do the following by combining with angelatlarge's approach:

Take a chunk C that fits in the memory of size M/2 .

Get the chunk C_i

Iterate through _i and hash each element into a hash-table. If the element already exists then you know it is a duplicate, and you can mark it as a duplicate. (add its index into an array or something).
Get the next chunk C_i+1 and check if any of the key already exists in the hash table. If an element exists mark it for deletion.
Repeat with all chunks until you know they will not contain any duplicates from chunk C_i
Repeat steps 1,2 with chunk C_i+1
Deleted all elements marked for deletion (could be done during, whatever is more appropriate, it might be more expensive to delete one at the time if you have to shift everything else around).

This runs in O((N/M)*|C|) , where |C| is the chunk size. Notice that if M > 2N, then we only have one chunk, and this runs in O(N), which is optimal for deleting duplicates. We simply hash them and make sure that all collisions are deleted.

Edit: Per requested, I'm providing details: * N is the number phone numbers.

The size of the chunk will depend on the memory, it should be of size M/2. This is the size of memory that will load a chunk of the file, since the whole file is too big to be loaded to memory.
This leaves another M/2 bytes to keep the hash table², and/or a duplicate list¹.
Hence, there should be N/(M/2) chunks, each of size |C| = M/2
The run time will be the number of chunks(N/(M/2)), times the size of each chunk |C| (or M/2). Overall, this should be linear (plus or minus the overhead of changing from one chunk to the other, which is why the best way to describe it is O( (N/M) * |C| )

a. Loading a chunk C_i. O(|C|)
b. Iterate through each element, test and set if not there O(1) will be hashed in which insertion and lookup should take.
c. If the element is already there, you can delete it.¹
d. Get the next chunk, rinse and repeat (2N/M chunks, so O(N/M))

¹ Removing an element might cost O(N), unless we keep a list and remove them all in one go, by avoiding to shift all the remaining elements whenever an element is removed.

² If the phone numbers can all be represented as an integer < 2³² - 1, we can avoid having a full hash-table and just use a flag map, saving piles of memory (we'll only need N-bits of memory)

Here's a somewhat detailed pseudo-code:

void DeleteDuplicate(File file, int numberOfPhones, int maxMemory)
{
    //Assume each 1'000'000 number of phones that fit in 32-bits.
    //Assume 2MB of memory
    //Assume that arrays of bool are coalesced into 8 bools per byte instead of 1 bool per byte
    int chunkSize = maxMemory / 2; // 2MB / 2 / 4-byes per int = 1MB or 256K integers

    //numberOfPhones-bits. C++ vector<bool> for example would be space efficient
    //  Coalesced-size ~= 122KB  | Non-Coalesced-size (worst-case) ~= 977KB 
    bool[] exists = new bool[numberOfPhones];

    byte[] numberData = new byte[chunkSize];
    int fileIndex = 0;
    int bytesLoaded;
    do //O(chunkNumber)
    {
        bytesLoaded = file.GetNextByes(chunkSize, /*out*/ numberData);
        List<int> toRemove = new List<int>(); //we still got some 30KB-odd to spare, enough for some 6 thousand-odd duplicates that could be found

        for (int ii = 0; ii < bytesLoaded; ii += 4)//O(chunkSize)
        {
            int phone = BytesToInt(numberData, ii);
            if (exists[phone])
                toRemove.push(ii);
            else
                exists[phone] = true;
        }

        for (int ii = toRemove.Length - 1; ii >= 0; --ii)
            numberData.removeAt(toRemove[ii], 4);

        File.Write(fileIndex, numberData);
        fileIndex += bytesLoaded;

    } while (bytesLoaded > 0); // while still stuff to load
}

Cameron · Answer

If you can store temporary files you can load the file in chunks, sort each chunk, write it to a file, and then iterate through the chunks and look for duplicates. You can easily tell if a number is duplicated by comparing it to the next number in the file and the next number in each of the chunks. Then move to the next lowest number of all of the chunks and repeat until you run out of numbers.

Your runtime is O(n log n) due to the sorting.

angelatlarge · Answer

I like the @airza solution, but perhaps there is another algorithm to consider: maybe one million phone numbers cannot be loaded into memory at once because they are expressed inefficiently, i.e. using more bytes per phone number than necessary. In that case, you might be able to have an efficient solution by hashing the phone numbers and storing the hashes in a (hash) table. Hash tables support dictionary operations (such as in) that let you find dupes easily.

To be more concrete about it, if each phone number is 13 bytes (such as a string in the format (NNN)NNN-NNNN), the string represents one of a billion numbers. As an integer, this can be stored in 4 bytes (instead of 13 in the string format). We then might be able to store this 4 byte "hash" in a hash table, because now our 1 billion hashed numbers take up as much space as 308 million numbers, not one billion. Ruling out impossible numbers (everything in area codes 000, 555, etc) might allow us reduce the hash size further.

Algorithm for detecting duplicates in a dataset which is too large to be completely loaded into memory

Tags:

algorithm

duplicates

Malcolm O'Hare

4 Answers

argentage

GettnDer

Cameron

angelatlarge

Recent Activity

Donate For Us

Algorithm for detecting duplicates in a dataset which is too large to be completely loaded into memory

Tags:

algorithm

duplicates

Malcolm O'Hare

4 Answers

argentage

GettnDer

Cameron

angelatlarge

Related questions

Recent Activity

Donate For Us