Efficiently choose an integer distinct from all elements of a list

Tags:

I have a linked list of objects each containing a 32-bit integer (and provably fewer than 2³² such objects) and I want to efficiently choose an integer that's not present in the list, without using any additional storage (so copying them to an array, sorting the array, and choosing the minimum value not in the array would not be an option). However, the definition of the structure for list elements is under my control, so I could add (within reason) additional storage to each element as part of solving the problem. For example, I could add an extra set of prev/next pointers and merge-sort the list. Is this the best solution? Or is there a simpler or more efficient way to do it?

366

asked Aug 15 '14 18:08

R.. GitHub STOP HELPING ICE

2 Answers

Given the conditions that you outline in the comments, especially your expectation of many identical values, you must expect a sparse distribution of used values.

Consequently, it might actually be best to just guess a value randomly and then check whether it coincides with a value in the list. Even if half the available value range were used (which seems extremely unlikely from your comments), you would only traverse the list twice on average. And you can drastically decrease this factor by simultaneously checking a number of guesses in one pass. Done correctly, the factor should always be close to one.

The advantage of such a probabilistic approach is that you are immune to bad sequences of values. Such sequences are always possible with range based approaches: If you calculate the min and max of the data, you run the risk, that the data contains both 0 and 2^32-1. If you sequentially subdivide an interval, you run the risk of always getting values in the middle of the interval, which can shrink it to zero in 32 steps. With a probabilistic approach, these sequences can't hurt you.

I think, I would use something like four guesses for very small lists, and crank it up to roughly 16 as the size of the list approaches the limit. The high starting value is due to the fact that any such algorithm will be memory bound, i. e. your CPU has ample amounts of time to check a value while it waits for the next values to arrive from memory, so you better make good use of that time to reduce the number of passes required.

A further optimization would instantly replace a busted guess with a new one and keep track of where the replacement happened, so that you can avoid a complete second pass through the data. Also, move the busted guess to the end of the list of guesses, so that you only need to check against the start position of the first guess in your loop to stop as early as possible.

answered Oct 17 '22 11:10

cmaster - reinstate monica

If you can spare one pointer in each object, you get an O(n) worst-case algorithm easily (standard divide-and-conquer):

Divide the range of possible IDs equally.
Make a singly-linked list covering each subrange.
If one subrange is empty, choose any id in it.
Otherwise repeat with the elements of the subrange with fewest elements.

Example code using two sub-ranges per iteration:

unsigned getunusedid(element* h) {
    unsigned start = 0, stop = -1;
    for(;h;h = h->mainnext)
        h->next = h->mainnext;
    while(h) {
        element *l = 0, *r = 0;
        unsigned cl = 0, cr = 0;
        unsigned mid = start + (stop - start) / 2;
        while(h) {
            element* next = h->next;
            if(h->id < mid) {
                h->next = l;
                cl++;
                l = h;
            } else {
                h->next = r;
                cr++;
                r = h;
            }
            h = next;
        }
        if(cl < cr) {
            h = l;
            stop = mid - 1;
        } else {
            h = r;
            start = mid;
        }
    }
    return start;
}

Some more remarks:

Beware of bugs in the above code; I have only proved it correct, not tried it.
Using more buckets (best keep to a power of 2 for easy and efficient handling) each iteration might be faster due to better data-locality (though only try and measure if it's not fast enough otherwise), as @MarkDickson rightly remarks.
Without those extra-pointers, you need full sweeps each iteration, raising the bound to O(n*lg n).

An alternative would be using 2+ extra-pointers per element to maintain a balanced tree. That would speed up id-search, at the expense of some memory and insertion/removal time overhead.

answered Oct 17 '22 11:10

Deduplicator

Related questions
                            
                                What is the meaning of 'construct' in programming languages
                            
                                What is the -z option for in this gcc compiler command?
                            
                                Passing va_list to other functions
                            
                                How to write "CMakeLists.txt" for a big project with multiple subdirectories?
                            
                                How can I include a header file that contains `>` in its name?
                            
                                Why is the size of the data type different when the value is directly passed to the sizeof operator?
                            
                                Calling a C function with a varargs argument dynamically
                            
                                What is the best way to design GUI Applications with C? [closed]
                            
                                Fast dot product for a very special case
                            
                                "this" pointer in C (not C++)
                            
                                How to return a pointer as a function parameter
                            
                                How to send a signal to a process in C?
                            
                                Proper use libdl and dynamically linked libraries
                            
                                getting the sh_name member in a section header elf file
                            
                                How do I use Minizip (on Zlib)?
                            
                                What are the main differences between fwrite and write?
                            
                                Boost Variant essentially a Union in c/c++?
                            
                                C struct initialization with char array
                            
                                How can I prevent scanf() to wait forever for an input character?
                            
                                Is it better to use char or unsigned char array for storing raw data?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With