Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find the duplicate number in an array which has no duplicates except for one number

Suppose there is an array of elements which has no duplicates except for 1 number,

ex. 1,2,13,4,7,11,2,6

How to find the duplicate number in an efficient manner? we can do it using a hash table(HT) in O(n) time and with O(n) space like below.

if(HT.Contains(item)) -> this is the duplicate
else
ht.add(item)

Is there a better way in terms of both space and time complexity?

Note: this problem is not a duplicate of below two problems which are different.

If the integers are consecutive the solution in this link can be used how-to-find-a-duplicate-element-in-an-array-of-shuffled-consecutive-integers

If the array of n elements contains elements from 0 to n-1 only this link has solution Finding duplicates in O(n) time and O(1) space

like image 928
CRM Avatar asked Nov 01 '22 17:11

CRM


1 Answers

I don't think you can do better than O(n) time complexity - in the worst case you're going to have to touch every element of the dataset to find the duplicate

One way to improve the space consumption (at the cost of requiring some bit twiddling as well as two passes over the dataset) is to use a Bloom Filter. The idea is to make a first pass over the dataset: if you find a possible duplicate then you remove it from the dataset and add it to a hash table (if the bloom filter functions correctly then only around 1% of the elements will be flagged as possible duplicates). Then make a second pass over the filtered dataset, testing elements against the small hash table of possible duplicates.

My code will be in Java since it's the language that I'm the most familiar with.

Class DupFinder {
  BloomFilter filter = new BloomFilter();
  HashTable hashTable = new HashTable();
  int start = 0;

  int run(int[] dataset) {
    // first pass
    for(int i = 0; i < dataset.length; i++) {
      if(filter.contains(dataset[i]) {
        // check if element is in hashTable, else add it
        if(hashTable.contains(dataset[i]) {
          return dataset[i]; // duplicate found
        } else {
          hashTable.add(dataset[i]);
        }

        // remove element from dataset
        int temp = dataset[start];
        dataset[start] = dataset[i];
        dataset[i] = temp;
        start++;
      } else filter.add(dataset[i]);
    } 

    // second pass
    for(int i = start; i < dataset.length; i++) {
      if(hashTable.contains(dataset[i]) {
        return dataset[i]; // duplicate found
      }
    }
    return NULL; // no duplicate found
  }
}

An alternative to your hash table is to use a Radix Sort, a linear time sorting algorithm. Radix sort will have better worst-case performance (O(n) for radix sort, compared to O(n^2) for the hash table in the unlikely event that you run into a ridiculous number of collisions) but worse average-case performance (the hash table implementation will usually find the duplicate after scanning only half of the dataset, whereas the radix sort will always require multiple passes over the dataset). Radix Sort will also be marginally more efficient in terms of space consumption if you use a space efficient data structure for the buckets, e.g. a chunked list.

You can parallelize the hash table implementation without incurring too much synchronization overhead. Using t threads, each thread will process n/t elements of the dataset (e.g. if you have 32 elements in the dataset and 2 threads, then thread0 processes elements 0-15 and thread1 processes elements 16-31), placing each element in a bucket with index absoluteValue(x modulo t). Following this, each thread will be responsible for processing all elements with a given bucket index, e.g. thread0 will process all buckets with index 0. I'm using a BlockingQueue for synchronization - this allows a thread to call take() on the queue, causing the thread to remove the first element of the queue or else to block until an element becomes available; all other data structures are thread-local. You'll need to initialize the dupFinders variable so that a DupFinder instance appears in the same index of every DupFinder's dupFinders variable (e.g. thread0 always appears in the 0th index, thus ensuring that all elements in its BlockingQueue have absoluteValue(x modulo t) == 0).

Class DupFinder implements Callable<Integer> {
  private Class Chunk {
    int size = 0;
    int chunk = new int[64];

    boolean add(int x) {
      if(size < 64) {
        chunk[size] = x;
        size++;
        return true;
      } else return false;
    }
  }

  int t = ??? // number of threads
  private BlockingQueue<Stack<Chunk>> queue = new LinkedBlockingQueue()
  private DupFinder[] dupFinders = new DupFinder[t];
  private Stack<Chunk>[] stacks = new Stack<Chunk>[t];

  void add(Stack<Chunk> stack) {
    queue.add(stack);
  }

  // the thread only receives n/t elements of the dataset
  int call(int[] partialDataset) {
    // partition dataset elements by their modulus(t)
    for(int i = 0; i < partialDataset.length; i++) {
      tempStack = stacks[Math.absoluteValue(partialDataset[i] modulo t)];
      if(!tempStack.peek().add(partialDataset[i])) {
        Chunk chunk = new Chunk();
        chunk.add(partialDataset[i]);
        tempStack.push(chunk);
      } 
    }

    // distribute chunk stacks to the appropriate threads
    for(int i = 0; i < t; i++) {
      dupFinders[i].add(stacks[i]);
    }

    HashTable hashTable = new HashTable();
    for(int i = 0; i < t; i++) {
      // wait for a chunk stack to become available
      Stack<Chunk> tempStack = queue.take();
      while(!tempStack.isEmpty) {
        tempChunk = tempStack.pop();
        for(int i = 0; i < tempChunk.size; i++) {
          if(hashTable.contains(tempChunk.chunk[i]) {
            return tempChunk.chunk[i]; // duplicate found
          } else {
            hashTable.add(tempChunk.chunk[i]);
          }
        }
      }
    }
    return NULL; // no duplicate found
  }
}
like image 150
Zim-Zam O'Pootertoot Avatar answered Nov 15 '22 10:11

Zim-Zam O'Pootertoot