Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the Smallest Integer Not in a List

An interesting interview question that a colleague of mine uses:

Suppose that you are given a very long, unsorted list of unsigned 64-bit integers. How would you find the smallest non-negative integer that does not occur in the list?

FOLLOW-UP: Now that the obvious solution by sorting has been proposed, can you do it faster than O(n log n)?

FOLLOW-UP: Your algorithm has to run on a computer with, say, 1GB of memory

CLARIFICATION: The list is in RAM, though it might consume a large amount of it. You are given the size of the list, say N, in advance.

like image 253
PeterAllenWebb Avatar asked Oct 19 '09 03:10

PeterAllenWebb


People also ask

How do you find the smallest positive integer in an array?

First sort the array. Then initialize a variable to 1 and using a loop scan through the array. Check the value of the variable if it matches the current array element, increment it if that is the case. The value of the variable after the loop is the smallest positive missing integer.


2 Answers

If the datastructure can be mutated in place and supports random access then you can do it in O(N) time and O(1) additional space. Just go through the array sequentially and for every index write the value at the index to the index specified by value, recursively placing any value at that location to its place and throwing away values > N. Then go again through the array looking for the spot where value doesn't match the index - that's the smallest value not in the array. This results in at most 3N comparisons and only uses a few values worth of temporary space.

# Pass 1, move every value to the position of its value for cursor in range(N):     target = array[cursor]     while target < N and target != array[target]:         new_target = array[target]         array[target] = target         target = new_target  # Pass 2, find first location where the index doesn't match the value for cursor in range(N):     if array[cursor] != cursor:         return cursor return N 
like image 195
Ants Aasma Avatar answered Dec 03 '22 04:12

Ants Aasma


Here's a simple O(N) solution that uses O(N) space. I'm assuming that we are restricting the input list to non-negative numbers and that we want to find the first non-negative number that is not in the list.

  1. Find the length of the list; lets say it is N.
  2. Allocate an array of N booleans, initialized to all false.
  3. For each number X in the list, if X is less than N, set the X'th element of the array to true.
  4. Scan the array starting from index 0, looking for the first element that is false. If you find the first false at index I, then I is the answer. Otherwise (i.e. when all elements are true) the answer is N.

In practice, the "array of N booleans" would probably be encoded as a "bitmap" or "bitset" represented as a byte or int array. This typically uses less space (depending on the programming language) and allows the scan for the first false to be done more quickly.


This is how / why the algorithm works.

Suppose that the N numbers in the list are not distinct, or that one or more of them is greater than N. This means that there must be at least one number in the range 0 .. N - 1 that is not in the list. So the problem of find the smallest missing number must therefore reduce to the problem of finding the smallest missing number less than N. This means that we don't need to keep track of numbers that are greater or equal to N ... because they won't be the answer.

The alternative to the previous paragraph is that the list is a permutation of the numbers from 0 .. N - 1. In this case, step 3 sets all elements of the array to true, and step 4 tells us that the first "missing" number is N.


The computational complexity of the algorithm is O(N) with a relatively small constant of proportionality. It makes two linear passes through the list, or just one pass if the list length is known to start with. There is no need to represent the hold the entire list in memory, so the algorithm's asymptotic memory usage is just what is needed to represent the array of booleans; i.e. O(N) bits.

(By contrast, algorithms that rely on in-memory sorting or partitioning assume that you can represent the entire list in memory. In the form the question was asked, this would require O(N) 64-bit words.)


@Jorn comments that steps 1 through 3 are a variation on counting sort. In a sense he is right, but the differences are significant:

  • A counting sort requires an array of (at least) Xmax - Xmin counters where Xmax is the largest number in the list and Xmin is the smallest number in the list. Each counter has to be able to represent N states; i.e. assuming a binary representation it has to have an integer type (at least) ceiling(log2(N)) bits.
  • To determine the array size, a counting sort needs to make an initial pass through the list to determine Xmax and Xmin.
  • The minimum worst-case space requirement is therefore ceiling(log2(N)) * (Xmax - Xmin) bits.

By contrast, the algorithm presented above simply requires N bits in the worst and best cases.

However, this analysis leads to the intuition that if the algorithm made an initial pass through the list looking for a zero (and counting the list elements if required), it would give a quicker answer using no space at all if it found the zero. It is definitely worth doing this if there is a high probability of finding at least one zero in the list. And this extra pass doesn't change the overall complexity.


EDIT: I've changed the description of the algorithm to use "array of booleans" since people apparently found my original description using bits and bitmaps to be confusing.

like image 43
Stephen C Avatar answered Dec 03 '22 04:12

Stephen C