Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest code C/C++ to select the median in a set of 27 floating point values

The selection algorithm is linear time (O(n)). Complexity-wise you can't do better than linear time, because it takes linear time to read in all the data. So you couldn't have made something that is faster complexity-wise. Perhaps you have something that is a constant factor faster on certain inputs? I doubt it would make much of a difference.

C++ already includes the linear-time selection algorithm. Why not just use it?

std::vector<YourType>::iterator first = yourContainer.begin();
std::vector<YourType>::iterator last = yourContainer.end();
std::vector<YourType>::iterator middle = first + (last - first) / 2;
std::nth_element(first, middle, last); // can specify comparator as optional 4th arg
YourType median = *middle;

Edit: Technically, that is only the median for a container of odd length. For one of even length, it will get the "upper" median. If you want the traditional definition of median for even length, you might have to run it twice, once for each of the two "middles" at first + (last - first) / 2 and first + (last - first) / 2 - 1 and then average them or something.


EDIT: I have to apologize. The code below was WRONG. I have the fixed code, but need to find an icc compiler to redo the measurements.

The benchmark results of the algorithms considered so far

For the protocol and short description of algorithms see below. First value is mean time (seconds) over 200 different sequences and second value is stdDev.

HeapSort     : 2.287 0.2097
QuickSort    : 2.297 0.2713
QuickMedian1 : 0.967 0.3487
HeapMedian1  : 0.858 0.0908
NthElement   : 0.616 0.1866
QuickMedian2 : 1.178 0.4067
HeapMedian2  : 0.597 0.1050
HeapMedian3  : 0.015 0.0049 <-- best

Protocol: generate 27 random floats using random bits obtained from rand(). Apply each algorithm 5 million times in a row (including prior array copy) and compute average and stdDev over 200 random sequences. C++ code compiled with icc -S -O3 and run on Intel E8400 with 8GB DDR3.

Algorithms:

HeapSort : full sort of sequence using heap sort and pick middle value. Naive implementation using subscript access.

QuickSort: full in place sort of sequence using quick sort and pick middle value. Naive implementation using subscript access.

QuickMedian1: quick select algorithm with swapping. Naive implementation using subscript access.

HeapMedian1: in place balanced heap method with prior swapping. Naive implementation using subscript access.

NthElement : uses the nth_element STL algorithm. Data is copied into the vector using memcpy( vct.data(), rndVal, ... );

QuickMedian2: uses quick select algorithm with pointers and copy in two buffers to avoid swaping. Based on proposal of MSalters.

HeapMedian2 : variant of my invented algorithm using dual heaps with shared heads. Left heap has biggest value as head, right has smallest value as head. Initialize with first value as common head and first median value guess. Add subsequent values to left heap if smaller than head, otherwise to right heap, until one of the heap is full. It is full when it contains 14 values. Then consider only the full heap. If its the right heap, for all values bigger than the head, pop head and insert value. Ignore all other values. If its the left heap, for all values smaller than the head, pop head and insert it in heap. Ignore all other values. When all values have been proceeded, the common head is the median value. It uses integer index into array. The version using pointers (64bit) appeared to be nearly twice slower (~1s).

HeapMedian3 : same algorithm as HeapMedian2 but optimized. It uses unsigned char index, avoids value swapping and various other little things. The mean and stdDev values are computed over 1000 random sequences. For nth_element I measured 0.508s and a stdDev of 0.159537 with the same 1000 random sequences. HeapMedian3 is thus 33 time faster than the nth_element stl function. Each returned median value is checked against the median value returned by heapSort and they all match. I doubt a method using hash may be significantly faster.

EDIT 1: This algorithm can be further optimized. The first phase where elements are dispatched in the left or right heap based on the comparison result doesn't need heaps. It is sufficient to simply append elements to two unordered sequences. The phase one stops as soon as one sequence is full, which means it contains 14 elements (including the median value). The second phase starts by heapifying the full sequence and then proceed as described in the HeapMedian3 algorithm. I'll provide the new code and benchmark as soon as possible.

EDIT 2: I implemented and benchmarked the optimized algorithm. But there is no significant performance difference compared heapMedian3. It is even slightly slower on the average. Shown results are confirmed. There might be with much larger sets. Note also that I simply pick the first value as initial median guess. As suggested, one could benefit from the fact that we search a median value in "overlapping" value sets. Using the median of median algorithm would help to pick a much better initial median value guess.


Source code of HeapMedian3

// return the median value in a vector of 27 floats pointed to by a
float heapMedian3( float *a )
{
   float left[14], right[14], median, *p;
   unsigned char nLeft, nRight;

   // pick first value as median candidate
   p = a;
   median = *p++;
   nLeft = nRight = 1;

   for(;;)
   {
       // get next value
       float val = *p++;

       // if value is smaller than median, append to left heap
       if( val < median )
       {
           // move biggest value to the heap top
           unsigned char child = nLeft++, parent = (child - 1) / 2;
           while( parent && val > left[parent] )
           {
               left[child] = left[parent];
               child = parent;
               parent = (parent - 1) / 2;
           }
           left[child] = val;

           // if left heap is full
           if( nLeft == 14 )
           {
               // for each remaining value
               for( unsigned char nVal = 27 - (p - a); nVal; --nVal )
               {
                   // get next value
                   val = *p++;

                   // if value is to be inserted in the left heap
                   if( val < median )
                   {
                       child = left[2] > left[1] ? 2 : 1;
                       if( val >= left[child] )
                           median = val;
                       else
                       {
                           median = left[child];
                           parent = child;
                           child = parent*2 + 1;
                           while( child < 14 )
                           {
                               if( child < 13 && left[child+1] > left[child] )
                                   ++child;
                               if( val >= left[child] )
                                   break;
                               left[parent] = left[child];
                               parent = child;
                               child = parent*2 + 1;
                           }
                           left[parent] = val;
                       }
                   }
               }
               return median;
           }
       }

       // else append to right heap
       else
       {
           // move smallest value to the heap top
           unsigned char child = nRight++, parent = (child - 1) / 2;
           while( parent && val < right[parent] )
           {
               right[child] = right[parent];
               child = parent;
               parent = (parent - 1) / 2;
           }
           right[child] = val;

           // if right heap is full
           if( nRight == 14 )
           {
               // for each remaining value
               for( unsigned char nVal = 27 - (p - a); nVal; --nVal )
               {
                   // get next value
                   val = *p++;

                   // if value is to be inserted in the right heap
                   if( val > median )
                   {
                       child = right[2] < right[1] ? 2 : 1;
                       if( val <= right[child] )
                           median = val;
                       else
                       {
                           median = right[child];
                           parent = child;
                           child = parent*2 + 1;
                           while( child < 14 )
                           {
                               if( child < 13 && right[child+1] < right[child] )
                                   ++child;
                               if( val <= right[child] )
                                   break;
                               right[parent] = right[child];
                               parent = child;
                               child = parent*2 + 1;
                           }
                           right[parent] = val;
                       }
                   }
               }
               return median;
           }
       }
   }
} 

Since it sounds like you're performing a median filter on a large array of volume data, you might want to take a look at the Fast Median and Bilateral Filtering paper from SIGGRAPH 2006. That paper deals with 2D image processing, but you might be able to adapt the algorithm for 3D volumes. If nothing else, it might give you some ideas on how to step back and look at the problem from a slightly different perspective.


The question cannot easily be answered for the simple reason that the performance of one algorithm relative to another depends as much the on compiler / processor / data structure combination as on the algorithm itself, as you surely know

Therefore your approach to try a couple of them seems good enough. And yes, quicksort should be pretty fast. If you haven't done so, you might want to try insertionsort which often performs better on small data sets. This said, just settle on a sorting algo that does the job fast enough. You will typically not get 10-times faster just be picking the "right" algo.

To get substantial speed-ups, the better way frequently is to use more structure. Some ideas that worked for me in the past with large-scale problems:

  • Can you efficiently pre-calculate while creating the voxels and store 28 instead of 27 floats?

  • Is an approximate solution good enough? If so, just look at the median of, say 9 values, since "in general it can be expected that values are relatively close." Or you can replace it with the average as long as the values are relatively close.

  • Do you really need the median for all billions of voxels? Maybe you have an easy test whether you need the median, and can then only calculate for the relevant sub-set.

  • If nothing else helps: look at the asm code that the compiler generates. You might be able write asm code that is substantially faster (e.g. by doing all the calcs using registers).

Edit: For what it's worth, I have attached the (partial) insertionsort code mentioned in the comment below (totally untested). If numbers[] is an array of size N, and you want the smallest P floats sorted at the beginning of the array, call partial_insertionsort<N, P, float>(numbers);. Hence if you call partial_insertionsort<27, 13, float>(numbers);, numbers[13] will contain the median. To gain additional speed, you would have to unfold the while loop, too. As discussed above, to get really fast, you have to use your knowledge about the data (e.g. is the data already partially sorted? Do you know properties of the distribution of the data? I guess, you get the drift).

template <long i> class Tag{};

template<long i, long N, long P, typename T>
inline void partial_insertionsort_for(T a[], Tag<N>, Tag<i>)
{   long j = i <= P+1 ? i : P+1;  // partial sort
    T temp = a[i];
    a[i] = a[j];       // compiler should optimize this away where possible
    while(temp < a[j - 1] && j > 0)
    { a[j] = a[j - 1];
      j--;}
    a[j] = temp;
    partial_insertionsort_for<i+1,N,P,T>(a,Tag<N>(),Tag<i+1>());}

template<long i, long N, long P, typename T>
inline void partial_insertionsort_for(T a[], Tag<N>, Tag<N>){}

template <long N, long P, typename T>
inline void partial_insertionsort(T a[])
 {partial_insertionsort_for<0,N,P,T>(a, Tag<N>(), Tag<0>());}

The most likely algorithm to use in your first attempt is just nth_element; it pretty much gives you what you want directly. Just ask for the 14th element.

On your second attempt, the goal is to take advantage of the fixed data size. You do not wnat to allocate any memory at all duing your algorithm. So, copy your voxel values to a pre-allocated array of 27 elements. Pick a pivot, and copy it to the middle of a 53 element array. Copy the remaining values to either side of the pivot. Here you keep two pointers (float* left = base+25, *right=base+27). There are now three possibilities: the left side is larger, the right side is larger, or the both have 12 elements. The last case is trivial; your pivot is the median. Otherwise, call nth_element on either the left side or the right side. The exact value of Nth depends on how many values were larger or smaller than the pivot. For instance, if the division is 12/14, you need the smallest element bigger than the pivot, so Nth=0, and if the division was 14/12, you need the biggest element smaller the pivot, so Nth=13. The worst cases are 26/0 and 0/26, when your pivot was an extreme, but those happen only in 2/27th of all cases.

The third improvement (or the first, if you have to use C and do not have nth_element) replaces nth_element entirely. You still have the 53 element array, but this time you fill it directly from the voxel values (saving you an interim copy into a float[27]). The pivot in this first iteration is just voxel[0][0][0]. For subsequent iterations, you use a second pre-allocated float[53] (easier if both are the same size) and copy floats between the two. The basic iteration step here is still: copy the pivot to the middle, sort the rest to the left and the right. At the end of each step, you'll know whether the median is smaller or larger than the current pivot, so you can discard the floats bigger or smaller than that pivot. Per iteration, this eliminates between 1 and 12 elements, with an average of 25% of the remaining.

The final iteration, if you still need more speed, is based on the observation that most of your voxels overlap significantly. You pre-calculate for every 3x3x1 slice the median value. Then, when you need an initial pivot for your 3x3x3 voxel cube, you take the median of the the three. You know a priori that there are 9 voxels smaller and 9 voxels larger than that median of medians (4+4+1). So, after the first pivotting step, the worst cases are a 9/17 and a 17/9 split. So, you'd only need to find the 4th or 13th element in a float[17], instead of the 12th or 14th in a float[26].


Background: The idea of copying first a pivot and then the rest of a float[N] to a float[2N-1], using left and right pointers is that you fill a float[N] subarray around the pivot, with all elements smaller than the pivot to the left (lower index) and higher to the right (higher index). Now, if you want the Mth element, you might find yourself lucky and have M-1 elements smaller than the pivot, in which case the pivot is the element you need. If there are more than (M-1) elements smaller than the pivot, the Mth element is amongst them, so you can discard the pivot and anything bigger than the pivot, and seacrh for the Mth element in all the lower values. If there are less than (M-1) elements smaller than the pivot, you're looking for a value higher than the pivot. So, you'll discard the pivot and anything smaller than it. Let the number of elements less than the pivot, i.e. to the left of the pivot be L. In the next iteration, you want the (M-L-1)th element of the (N-L-1)floats that are bigger than the pivot.

This kind of nth_element algorithm is fairly efficient because most of the work is spent copying floats between two small arrays, both of which will be in cache, and because your state is most of the time represented by 3 pointers (source pointer, left destination pointer, right destination pointer).

To show the basic code:

float in[27], out[53];
float pivot = out[26] = in[0];     // pivot
float* left = out+25, right = out+27
for(int i = 1; i != 27; ++1)
if((in[i]<pivot)) *left-- = in[i] else *right++ = in[i];
// Post-condition: The range (left+1, right) is initialized.
// There are 25-(left-out) floats <pivot and (right-out)-27 floats >pivot

A sorting network generated using the Bose-Nelson algorithm will find the median directly with no loops/recursion using 173 comparisons. If you have the facility to perform comparisions in parallel such as usage of vector-arithmetic instructions then you may be able to group the comparisions into as few as 28 parallel operations.

If you are sure that the floats are normalized and not (qs)NaN's, then you can use integer operations to compare IEEE-754 floats which can perform more favorably on some CPU's.

A direct conversion of this sorting network to C (gcc 4.2) yields a worst-case of 388 clock cycles on my Core i7.

Sorting Networks