Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently choose an integer distinct from all elements of a list

Tags:

c

algorithm

I have a linked list of objects each containing a 32-bit integer (and provably fewer than 232 such objects) and I want to efficiently choose an integer that's not present in the list, without using any additional storage (so copying them to an array, sorting the array, and choosing the minimum value not in the array would not be an option). However, the definition of the structure for list elements is under my control, so I could add (within reason) additional storage to each element as part of solving the problem. For example, I could add an extra set of prev/next pointers and merge-sort the list. Is this the best solution? Or is there a simpler or more efficient way to do it?

like image 366
R.. GitHub STOP HELPING ICE Avatar asked Aug 15 '14 18:08

R.. GitHub STOP HELPING ICE


People also ask

How do you check if all elements in a list are distinct?

Example. # Given List Alist = ['Mon','Tue','Wed','Mon'] print("The given list : ",Alist) # Compare length for unique elements if(len(set(Alist)) == len(Alist)): print("All elements are unique. ") else: print("All elements are not unique. ")

What does unique () do in Python?

unique() function. The unique() function is used to find the unique elements of an array. Returns the sorted unique elements of an array.

How do you find the number of distinct elements in an array?

Using sort function() Calculate the length of an array using the length() function that will return an integer value as per the elements in an array. Call the sort function and pass the array and the size of an array as a parameter. Take a temporary variable that will store the count of distinct elements.


2 Answers

Given the conditions that you outline in the comments, especially your expectation of many identical values, you must expect a sparse distribution of used values.

Consequently, it might actually be best to just guess a value randomly and then check whether it coincides with a value in the list. Even if half the available value range were used (which seems extremely unlikely from your comments), you would only traverse the list twice on average. And you can drastically decrease this factor by simultaneously checking a number of guesses in one pass. Done correctly, the factor should always be close to one.

The advantage of such a probabilistic approach is that you are immune to bad sequences of values. Such sequences are always possible with range based approaches: If you calculate the min and max of the data, you run the risk, that the data contains both 0 and 2^32-1. If you sequentially subdivide an interval, you run the risk of always getting values in the middle of the interval, which can shrink it to zero in 32 steps. With a probabilistic approach, these sequences can't hurt you.

I think, I would use something like four guesses for very small lists, and crank it up to roughly 16 as the size of the list approaches the limit. The high starting value is due to the fact that any such algorithm will be memory bound, i. e. your CPU has ample amounts of time to check a value while it waits for the next values to arrive from memory, so you better make good use of that time to reduce the number of passes required.

A further optimization would instantly replace a busted guess with a new one and keep track of where the replacement happened, so that you can avoid a complete second pass through the data. Also, move the busted guess to the end of the list of guesses, so that you only need to check against the start position of the first guess in your loop to stop as early as possible.

like image 77
cmaster - reinstate monica Avatar answered Oct 17 '22 11:10

cmaster - reinstate monica


If you can spare one pointer in each object, you get an O(n) worst-case algorithm easily (standard divide-and-conquer):

  1. Divide the range of possible IDs equally.
  2. Make a singly-linked list covering each subrange.
  3. If one subrange is empty, choose any id in it.
  4. Otherwise repeat with the elements of the subrange with fewest elements.

Example code using two sub-ranges per iteration:

unsigned getunusedid(element* h) {
    unsigned start = 0, stop = -1;
    for(;h;h = h->mainnext)
        h->next = h->mainnext;
    while(h) {
        element *l = 0, *r = 0;
        unsigned cl = 0, cr = 0;
        unsigned mid = start + (stop - start) / 2;
        while(h) {
            element* next = h->next;
            if(h->id < mid) {
                h->next = l;
                cl++;
                l = h;
            } else {
                h->next = r;
                cr++;
                r = h;
            }
            h = next;
        }
        if(cl < cr) {
            h = l;
            stop = mid - 1;
        } else {
            h = r;
            start = mid;
        }
    }
    return start;
}

Some more remarks:

  1. Beware of bugs in the above code; I have only proved it correct, not tried it.
  2. Using more buckets (best keep to a power of 2 for easy and efficient handling) each iteration might be faster due to better data-locality (though only try and measure if it's not fast enough otherwise), as @MarkDickson rightly remarks.
  3. Without those extra-pointers, you need full sweeps each iteration, raising the bound to O(n*lg n).

An alternative would be using 2+ extra-pointers per element to maintain a balanced tree. That would speed up id-search, at the expense of some memory and insertion/removal time overhead.

like image 21
Deduplicator Avatar answered Oct 17 '22 11:10

Deduplicator