Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest algorithm to figure out if an array has at least one duplicate

Tags:

c

algorithm

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.

Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...

like image 963
Phroggyy Avatar asked Oct 14 '16 13:10

Phroggyy


People also ask

How do you check if an array contains a duplicate value?

To check if an array contains duplicates:Use the Array. some() method to iterate over the array. Check if the index of the first occurrence of the current value is NOT equal to the index of its last occurrence. If the condition is met, then the array contains duplicates.

Can you find duplicate numbers in an array?

The standard way to find duplicate elements from an array is by using the HashSet data structure. If you remember, Set abstract data type doesn't allow duplicates. You can take advantage of this property to filter duplicate elements.

Is there any other possible algorithm that can help you find the duplicate student ids in the given data?

The first solution is the brute force algorithm, which is demonstrated by finding duplicate elements on integer array, but you can use the logic to find a duplicate on any kind of array.


4 Answers

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).

If X ∈ {A..Z} and N ∈ {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.


Here you go: a python script to generate all possible 17 million keys:

import itertools
from string import ascii_uppercase

for prefix in itertools.product(ascii_uppercase, repeat=3):
    for numeric in range(1000):
        print "%s%03d" % (''.join(prefix), numeric)   

and a simple C bitset filter:

#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
    return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
    int byteIndex = value / CHAR_BIT;
    unsigned char mask = 1 << (value % CHAR_BIT);

    unsigned char byte = filter[byteIndex];
    filter[byteIndex] = byte | mask;

    return byte & mask;
}

which for your purposes you'd use like so:

#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
    int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
    return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
    int alpha   = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
    int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
    int key = numeric + 1000 * alpha;
    return filterTestAndSet(filter, key);
}

#include <stdio.h>
int main() {
    unsigned char *filter = allocMyFilter();
    char key[8]; /* 6 chars + newline + nul */
    while (fgets(key, sizeof(key), stdin)) {
        if (testAndSetMyKey(filter, key)) {
            printf("collision: %s\n", key);
            return 1;
        }
    }
    return 0;
}

This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:

useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt

real    0m0.474s
user    0m0.436s
sys 0m0.036s

useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000

real    0m0.467s
user    0m0.452s
sys 0m0.016s

admittedly the input file is cached in memory for these runs.

like image 95
Useless Avatar answered Sep 25 '22 01:09

Useless


The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.

public static bool ThereAreDuplicates(string[] inputs)
        {
            var hashTable = new Hashtable();
            foreach (var input in inputs)
            {
                if (hashTable[input] != null)
                    return true;

                hashTable.Add(input, string.Empty);
            }
            return false;
        }
like image 20
clcastro87 Avatar answered Sep 23 '22 01:09

clcastro87


A fast but inefficient memory solution would use

// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 };  // or calloc() this
char buffer[100];

while (fgets(buffer, sizeof buffer, istream)) {
  unsigned long index = strtoul(buffer, NULL, 36);
  if (found[index]++) {
    Dupe_found();
    break;
  }
}

The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.

like image 23
chux - Reinstate Monica Avatar answered Sep 24 '22 01:09

chux - Reinstate Monica


Depending on how many different things there can be you have some options:

  • Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
  • Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
like image 45
Hauleth Avatar answered Sep 27 '22 01:09

Hauleth