An interesting interview question that a colleague of mine uses:
Suppose that you are given a very long, unsorted list of unsigned 64-bit integers. How would you find the smallest non-negative integer that does not occur in the list?
FOLLOW-UP: Now that the obvious solution by sorting has been proposed, can you do it faster than O(n log n)?
FOLLOW-UP: Your algorithm has to run on a computer with, say, 1GB of memory
CLARIFICATION: The list is in RAM, though it might consume a large amount of it. You are given the size of the list, say N, in advance.
First sort the array. Then initialize a variable to 1 and using a loop scan through the array. Check the value of the variable if it matches the current array element, increment it if that is the case. The value of the variable after the loop is the smallest positive missing integer.
If the datastructure can be mutated in place and supports random access then you can do it in O(N) time and O(1) additional space. Just go through the array sequentially and for every index write the value at the index to the index specified by value, recursively placing any value at that location to its place and throwing away values > N. Then go again through the array looking for the spot where value doesn't match the index - that's the smallest value not in the array. This results in at most 3N comparisons and only uses a few values worth of temporary space.
# Pass 1, move every value to the position of its value for cursor in range(N): target = array[cursor] while target < N and target != array[target]: new_target = array[target] array[target] = target target = new_target # Pass 2, find first location where the index doesn't match the value for cursor in range(N): if array[cursor] != cursor: return cursor return N
Here's a simple O(N)
solution that uses O(N)
space. I'm assuming that we are restricting the input list to non-negative numbers and that we want to find the first non-negative number that is not in the list.
N
.N
booleans, initialized to all false
. X
in the list, if X
is less than N
, set the X'th
element of the array to true
.0
, looking for the first element that is false
. If you find the first false
at index I
, then I
is the answer. Otherwise (i.e. when all elements are true
) the answer is N
.In practice, the "array of N
booleans" would probably be encoded as a "bitmap" or "bitset" represented as a byte
or int
array. This typically uses less space (depending on the programming language) and allows the scan for the first false
to be done more quickly.
This is how / why the algorithm works.
Suppose that the N
numbers in the list are not distinct, or that one or more of them is greater than N
. This means that there must be at least one number in the range 0 .. N - 1
that is not in the list. So the problem of find the smallest missing number must therefore reduce to the problem of finding the smallest missing number less than N
. This means that we don't need to keep track of numbers that are greater or equal to N
... because they won't be the answer.
The alternative to the previous paragraph is that the list is a permutation of the numbers from 0 .. N - 1
. In this case, step 3 sets all elements of the array to true
, and step 4 tells us that the first "missing" number is N
.
The computational complexity of the algorithm is O(N)
with a relatively small constant of proportionality. It makes two linear passes through the list, or just one pass if the list length is known to start with. There is no need to represent the hold the entire list in memory, so the algorithm's asymptotic memory usage is just what is needed to represent the array of booleans; i.e. O(N)
bits.
(By contrast, algorithms that rely on in-memory sorting or partitioning assume that you can represent the entire list in memory. In the form the question was asked, this would require O(N)
64-bit words.)
@Jorn comments that steps 1 through 3 are a variation on counting sort. In a sense he is right, but the differences are significant:
Xmax - Xmin
counters where Xmax
is the largest number in the list and Xmin
is the smallest number in the list. Each counter has to be able to represent N states; i.e. assuming a binary representation it has to have an integer type (at least) ceiling(log2(N))
bits.Xmax
and Xmin
.ceiling(log2(N)) * (Xmax - Xmin)
bits.By contrast, the algorithm presented above simply requires N
bits in the worst and best cases.
However, this analysis leads to the intuition that if the algorithm made an initial pass through the list looking for a zero (and counting the list elements if required), it would give a quicker answer using no space at all if it found the zero. It is definitely worth doing this if there is a high probability of finding at least one zero in the list. And this extra pass doesn't change the overall complexity.
EDIT: I've changed the description of the algorithm to use "array of booleans" since people apparently found my original description using bits and bitmaps to be confusing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With