I have an N-length array of positive random numbers that are certain to contain duplicates.
e.g. 10,4,5,7,10,9,10,9,8,10,5
Edit: N is likely to be 32, or some other power of two about that size.
I am trying to find the fastest way to replace the duplicates with the missing numbers from 0-(N-1). Using the above example, I want a result that looks like this:
10,4,5,7,0,9,1,2,8,3,6
The goal being to have one of each number from 0 to N-1, without just replacing all the numbers with 0-(N-1) (the random order is important).
Edit: It's also important that this replacement is deterministic, i.e. the same input will have the same output (not random).
Currently implemented in Java, uses 2 boolean arrays to keep track of used/unused numbers (unique numbers/missing numbers in the range [0,N) ), and has an approximate worst-case runtime of N+N*sqrt(N).
The code follows:
public byte[] uniqueify(byte[] input)
{
boolean[] usedNumbers = new boolean[N];
boolean[] unusedIndices = new boolean[N];
byte[] result = new byte[N];
for(int i = 0; i < N; i++) // first pass through
{
int newIdx = (input[i] + 128) % N; // first make positive
if(!usedNumbers[newIdx]) // if this number has not been used
{
usedNumbers[newIdx] = true; // mark as used
result[i] = newIdx; // save it in the result
}
else // if the number is used
{
unusedIndices[i] = true; // add it to the list of duplicates
}
}
// handle all the duplicates
for(int idx = 0; idx < N; idx++) // iterate through all numbers
{
if(unusedIndices[idx]) // if unused
for(int i = 0; i < N; i++) // go through all numbers again
{
if(!usedNumbers[i]) // if this number is still unused
{
usedNumbers[i] = true; // mark as used
result[i] = idx;
break;
}
}
}
return result;
}
This seems like the fastest I can hope for, but I thought I'd ask the internet, because there are people much more clever than I who may have a better solution.
N.B. Suggestions/solutions do not have to be in Java.
Thank you.
Edit: I forgot to mention that I am converting this to C++. I posted my java implementation because it's more complete.
Use a balanced binary search tree to keep track of used/unused numbers instead of a boolean array. Then you're running time will be n log n
.
The most straightforward solution would be this:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With