Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find an integer not in a set of size 40, 400, or 4000

Tags:

c++

algorithm

Related to the classic problem find an integer not among four billion given ones but not exactly the same.

To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].

Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?

If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.

I am more in favor of simple, non-random approaches. Is there any?

Thank you!

Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.

like image 261
aafulei Avatar asked Dec 18 '18 14:12

aafulei


3 Answers

You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.

 int find_not_contained(container<int> const&data)
 {
     const int N=data.size();
     std::vector<char> known(N+1, 0);   // one more candidates than data
     for(int i=0; i< N; ++i)
         if(data[i]>=0 && data[i]<=N)
             known[data[i]]=1;
     for(int i=0; i<=N; ++i)
         if(!known[i])
             return i;
     assert(false);                     // should never be reached.
 }

Random methods can be more space efficient, but may require more passes over the data in the worst case.

like image 121
Walter Avatar answered Nov 15 '22 00:11

Walter


Random methods are indeed very efficient here.

If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.

For each i in the range, we just set x[array[i] modulo m] = 1.

Then a simple O(n) search in x will provide a value which is not in array

Note: the modulo operation is not exactly the "%" operation

Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation

like image 10
Damien Avatar answered Nov 15 '22 00:11

Damien


You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)

size_t smallest_unused(std::vector<unsigned>& data) {
  size_t N = data.size(), scan = 0;
  while (scan < N) {
    auto other = data[scan];
    if (other < scan && data[other] != other) {
      data[scan] = data[other];
      data[other] = other;
    }
    else
      ++scan;
  }
  for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
  return scan;
}

The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.

That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.

To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).


Notes:

  1. I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
like image 6
rici Avatar answered Nov 15 '22 00:11

rici