Fastest algorithm to figure out if an array has at least one duplicate

Tags:

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.

Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...

963

asked Oct 14 '16 13:10

Phroggyy

4 Answers

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).

If X &in; {A..Z} and N &in; {0..9}, that gives 26³ * 10³ = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.

Here you go: a python script to generate all possible 17 million keys:

import itertools
from string import ascii_uppercase

for prefix in itertools.product(ascii_uppercase, repeat=3):
    for numeric in range(1000):
        print "%s%03d" % (''.join(prefix), numeric)

and a simple C bitset filter:

#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
    return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
    int byteIndex = value / CHAR_BIT;
    unsigned char mask = 1 << (value % CHAR_BIT);

    unsigned char byte = filter[byteIndex];
    filter[byteIndex] = byte | mask;

    return byte & mask;
}

which for your purposes you'd use like so:

#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
    int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
    return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
    int alpha   = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
    int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
    int key = numeric + 1000 * alpha;
    return filterTestAndSet(filter, key);
}

#include <stdio.h>
int main() {
    unsigned char *filter = allocMyFilter();
    char key[8]; /* 6 chars + newline + nul */
    while (fgets(key, sizeof(key), stdin)) {
        if (testAndSetMyKey(filter, key)) {
            printf("collision: %s\n", key);
            return 1;
        }
    }
    return 0;
}

This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:

useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt

real    0m0.474s
user    0m0.436s
sys 0m0.036s

useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000

real    0m0.467s
user    0m0.452s
sys 0m0.016s

admittedly the input file is cached in memory for these runs.

answered Sep 25 '22 01:09

Useless

The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.

public static bool ThereAreDuplicates(string[] inputs)
        {
            var hashTable = new Hashtable();
            foreach (var input in inputs)
            {
                if (hashTable[input] != null)
                    return true;

                hashTable.Add(input, string.Empty);
            }
            return false;
        }

answered Sep 23 '22 01:09

clcastro87

A fast but inefficient memory solution would use

// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 };  // or calloc() this
char buffer[100];

while (fgets(buffer, sizeof buffer, istream)) {
  unsigned long index = strtoul(buffer, NULL, 36);
  if (found[index]++) {
    Dupe_found();
    break;
  }
}

The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.

answered Sep 24 '22 01:09

chux - Reinstate Monica

Depending on how many different things there can be you have some options:

Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.

answered Sep 27 '22 01:09

Hauleth

Related questions
                            
                                Converting a #defined constant number to a string
                            
                                Initialize struct members to 0 (gcc -Wextra)
                            
                                Call function without parameter and parenthesis
                            
                                Understanding SIGCHLD when the child process terminates
                            
                                Which data type or data structure to choose to calculate factorial of 100?
                            
                                Variable length arrays in struct
                            
                                How to copy arrray to array using memcpy() in C
                            
                                C struct python equivalent [duplicate]
                            
                                Is it a good idea to include a large text variable in compiled code?
                            
                                How to copy local variable to global variable if both have the same names in C without using a third variable?
                            
                                Does it matter what I name a C file?
                            
                                Casting while using the '==' operator
                            
                                Size of strcat Destination Array
                            
                                Arbitrary length string in C [closed]
                            
                                Should the compiler warn on pointer arithmetic with a void pointer?
                            
                                Printing time_t as long int without casting gives unexpected behaviour
                            
                                Memory Allocation for Recursive Functions
                            
                                Bitwise operation result and booleans
                            
                                Why does casting a char array to an int pointer and writing to it using the pointer make the data reversed?
                            
                                How big of a number can you store in double and float in c?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With