I have some performance critical code that involves sorting a very short fixed-length array with between around 3 and 10 elements in C++ (the parameter changes at compile time). It occurred to me that a static sorting network specialised to each possible input size would perhaps be a very efficient way to do this: We do all the comparisons necessary to figure out which case we are in, then do the optimal number of swaps to sort the array. To apply this, we use a bit of template magic to deduce the array length and apply the correct network: <pre class="prettyprint"><code>#include <iostream> using namespace std; template< int K > void static_sort(const double(&array)[K]) { cout << "General static sort\n" << endl; } template<> void static_sort<3>(const double(&array)[3]) { cout << "Static sort for K=3" << endl; } int main() { double array[3]; // performance critical code. // ... static_sort(array); // ... } </code></pre> Obviously it's quite a hassle to code all this up, so: <ul> <li>Does anyone have any opinions on whether or not this is worth the effort?</li> <li>Does anyone know if this optimisation exists in any standard implementations of, for example, std::sort? </li> <li>Is there an easy place to get hold of code implementing this kind of sorting network?</li> <li>Perhaps it would be possible to generate a sorting network like this statically using template magic..</li> </ul> For now I just use insertion sort with a static template parameter (as above), in the hope that it will encourage unrolling and other compile-time optimisations. Your thoughts welcome. <hr> Update: I wrote some testing code to compare a 'static' insertion short and std::sort. (When I say static, I mean that the array size is fixed and deduced at compile time (presumably allowing loop unrolling etc). I get at least a 20% NET improvement (note that the generation is included in the timing). Platform: clang, OS X 10.9. The code is here https://github.com/rosshemsley/static_sorting if you would like to compare it to your implementations of stdlib. I have still yet to find a nice set of implementations for comparator network sorters. <hr>

Here is a little class that uses the Bose-Nelson algorithm to generate a sorting network on compile time. <pre class="prettyprint"><code>/** * A Functor class to create a sort for fixed sized arrays/containers with a * compile time generated Bose-Nelson sorting network. * \tparam NumElements The number of elements in the array or container to sort. * \tparam T The element type. * \tparam Compare A comparator functor class that returns true if lhs < rhs. */ template <unsigned NumElements, class Compare = void> class StaticSort { template <class A, class C> struct Swap { template <class T> inline void s(T &v0, T &v1) { T t = Compare()(v0, v1) ? v0 : v1; // Min v1 = Compare()(v0, v1) ? v1 : v0; // Max v0 = t; } inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); } }; template <class A> struct Swap <A, void> { template <class T> inline void s(T &v0, T &v1) { // Explicitly code out the Min and Max to nudge the compiler // to generate branchless code. T t = v0 < v1 ? v0 : v1; // Min v1 = v0 < v1 ? v1 : v0; // Max v0 = t; } inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); } }; template <class A, class C, int I, int J, int X, int Y> struct PB { inline PB(A &a) { enum { L = X >> 1, M = (X & 1 ? Y : Y + 1) >> 1, IAddL = I + L, XSubL = X - L }; PB<A, C, I, J, L, M> p0(a); PB<A, C, IAddL, J + M, XSubL, Y - M> p1(a); PB<A, C, IAddL, J, XSubL, M> p2(a); } }; template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 1> { inline PB(A &a) { Swap<A, C> s(a, I - 1, J - 1); } }; template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 2> { inline PB(A &a) { Swap<A, C> s0(a, I - 1, J); Swap<A, C> s1(a, I - 1, J - 1); } }; template <class A, class C, int I, int J> struct PB <A, C, I, J, 2, 1> { inline PB(A &a) { Swap<A, C> s0(a, I - 1, J - 1); Swap<A, C> s1(a, I, J - 1); } }; template <class A, class C, int I, int M, bool Stop = false> struct PS { inline PS(A &a) { enum { L = M >> 1, IAddL = I + L, MSubL = M - L}; PS<A, C, I, L, (L <= 1)> ps0(a); PS<A, C, IAddL, MSubL, (MSubL <= 1)> ps1(a); PB<A, C, I, IAddL, L, MSubL> pb(a); } }; template <class A, class C, int I, int M> struct PS <A, C, I, M, true> { inline PS(A &a) {} }; public: /** * Sorts the array/container arr. * \param arr The array/container to be sorted. */ template <class Container> inline void operator() (Container &arr) const { PS<Container, Compare, 1, NumElements, (NumElements <= 1)> ps(arr); }; /** * Sorts the array arr. * \param arr The array to be sorted. */ template <class T> inline void operator() (T *arr) const { PS<T*, Compare, 1, NumElements, (NumElements <= 1)> ps(arr); }; }; #include <iostream> #include <vector> int main(int argc, const char * argv[]) { enum { NumValues = 32 }; // Arrays { int rands[NumValues]; for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100; std::cout << "Before Sort: \t"; for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " "; std::cout << "\n"; StaticSort<NumValues> staticSort; staticSort(rands); std::cout << "After Sort: \t"; for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " "; std::cout << "\n"; } std::cout << "\n"; // STL Vector { std::vector<int> rands(NumValues); for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100; std::cout << "Before Sort: \t"; for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " "; std::cout << "\n"; StaticSort<NumValues> staticSort; staticSort(rands); std::cout << "After Sort: \t"; for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " "; std::cout << "\n"; } return 0; } </code></pre> Benchmarks The following benchmarks are compiled with clang -O3 and ran on my mid-2012 macbook air. Time (in milliseconds) to sort 1 million arrays. The number of milliseconds for arrays of size 2, 4, 8 are 1.943, 8.655, 20.246 respectively. <img src="https://i.stack.imgur.com/W7T5H.png" alt="C++ Templated Bose-Nelson Static Sort timings"> Here are the average clocks per sort for small arrays of 6 elements. The benchmark code and examples can be found at this question: Fastest sort of fixed length 6 int array <pre class="prettyprint"><code>Direct call to qsort library function : 342.26 Naive implementation (insertion sort) : 136.76 Insertion Sort (Daniel Stutzbach) : 101.37 Insertion Sort Unrolled : 110.27 Rank Order : 90.88 Rank Order with registers : 90.29 Sorting Networks (Daniel Stutzbach) : 93.66 Sorting Networks (Paul R) : 31.54 Sorting Networks 12 with Fast Swap : 32.06 Sorting Networks 12 reordered Swap : 29.74 Reordered Sorting Network w/ fast swap : 25.28 Templated Sorting Network (this class) : 25.01 </code></pre> It performs as fast as the fastest example in the question for 6 elements. The code used for the benchmarks can be found here. It includes more features and further optimizations for more robust performance on real-world data.

The other answers are interesting and fairly good, but I believe that I can provide some additional elements of answer, point per point: <ul> <li>Is it worth the effort? Well, if you need to sort small collections of integers and the sorting networks are tuned to take advantage of some instructions as much as possible, it might be worth the effort. The following graph presents the results of sorting a million arrays of <code>int</code> of size 0-14 with different sorting algorithms. As you can see, the sorting networks can provide a significant speedup if you really need it.</li> </ul> <img src="https://camo.githubusercontent.com/474b4e969d8043f0b8d83b073ff77bcd032b916a/68747470733a2f2f692e696d6775722e636f6d2f476152486e39782e706e67" alt=""> <ul> <li> No standard implementation of <code>std::sort</code> I know of use sorting networks; when they are not fine-tuned, they might be slower than a straight insertion sort. libc++'s <code>std::sort</code> has dedicated algorithms to sort 0 thru 5 values at once but they it doesn't use sorting networks either. The only sorting algorithm I know of which uses sorting networks to sort a few values is Wikisort. That said, the research paper Applying Sorting Networks to Synthesize Optimized Sorting Libraries suggests that sorting networks could be used to sort small arrays or to improve recursive sorting algorithms such as quicksort, but only if they are fine-tuned to take advantage of specific hardware instructions. The access aligned sort algorithm is some kind of bottom-up mergesort that apparently uses bitonic sorting networks implemented with SIMD instructions for the first pass. Apparently, the algorithm could be faster than the standard library one for some scalar types. </li> <li> I can actually provide such information for the simple reason that I developed a C++14 sorting library that happens to provide efficient sorting networks of size 0 thru 32 that implement the optimizations described in the previous section. I used it to generate the graph in the first section. I am still working on the sorting networks part of the library to provide size-optimal, depth-optimal and swaps-optimal networks. Small optimal sorting networks are found with brute force while bigger sorting networks use results from the litterature. Note that none of the sorting algorithms in the library directly use sorting networks, but you can adapt them so that a sorting network will be picked whenever the sorting algorithm is given a small <code>std::array</code> or a small fixed-size C array: <pre class="prettyprint"><code>using namespace cppsort; // Sorters are function objects that can be // adapted with sorter adapters from the // library using sorter = small_array_adapter< std_sorter, sorting_network_sorter >; // Now you can use it as a function sorter sort; // Instead of a size-agnostic sorting algorithm, // sort will use an optimal sorting network for // 5 inputs since the bound of the array can be // deduced at compile time int arr[] = { 2, 4, 7, 9, 3 }; sort(arr); </code></pre> As mentioned above, the library provides efficient sorting networks for built-in integers, but you're probably out of luck if you need to sort small arrays of something else (e.g. my latest benchmarks show that they are not better than a straight insertion sort even for <code>long long int</code>). </li> <li>You could probably use template metaprogramming to generate sorting networks of any size, but no known algorithm can generate the best sorting networks, so you might as well write the best ones by hand. I don't think the ones generated by simple algorithms can actually provide usable and efficient networks anyway (Batcher's odd-even sort and pairwise sorting networks might be the only usable ones) [Another answer seems to show that generated networks could actually work].</li> </ul>

Very fast sorting of fixed length arrays using comparator networks

Tags:

c++

arrays

sorting

template-meta-programming

sorting-network

I have some performance critical code that involves sorting a very short fixed-length array with between around 3 and 10 elements in C++ (the parameter changes at compile time).

It occurred to me that a static sorting network specialised to each possible input size would perhaps be a very efficient way to do this: We do all the comparisons necessary to figure out which case we are in, then do the optimal number of swaps to sort the array.

To apply this, we use a bit of template magic to deduce the array length and apply the correct network:

#include <iostream> using namespace std;  template< int K > void static_sort(const double(&array)[K]) {     cout << "General static sort\n" << endl; }  template<> void static_sort<3>(const double(&array)[3]) {     cout << "Static sort for K=3" << endl; }   int main() {      double  array[3];      // performance critical code.     // ...     static_sort(array);     // ...  }

Obviously it's quite a hassle to code all this up, so:

Does anyone have any opinions on whether or not this is worth the effort?
Does anyone know if this optimisation exists in any standard implementations of, for example, std::sort?
Is there an easy place to get hold of code implementing this kind of sorting network?
Perhaps it would be possible to generate a sorting network like this statically using template magic..

For now I just use insertion sort with a static template parameter (as above), in the hope that it will encourage unrolling and other compile-time optimisations.

Your thoughts welcome.

Update: I wrote some testing code to compare a 'static' insertion short and std::sort. (When I say static, I mean that the array size is fixed and deduced at compile time (presumably allowing loop unrolling etc). I get at least a 20% NET improvement (note that the generation is included in the timing). Platform: clang, OS X 10.9.

The code is here https://github.com/rosshemsley/static_sorting if you would like to compare it to your implementations of stdlib.

I have still yet to find a nice set of implementations for comparator network sorters.

456

asked Nov 05 '13 13:11

Ross Hemsley

2 Answers

Here is a little class that uses the Bose-Nelson algorithm to generate a sorting network on compile time.

/**  * A Functor class to create a sort for fixed sized arrays/containers with a  * compile time generated Bose-Nelson sorting network.  * \tparam NumElements  The number of elements in the array or container to sort.  * \tparam T            The element type.  * \tparam Compare      A comparator functor class that returns true if lhs < rhs.  */ template <unsigned NumElements, class Compare = void> class StaticSort {     template <class A, class C> struct Swap     {         template <class T> inline void s(T &v0, T &v1)         {             T t = Compare()(v0, v1) ? v0 : v1; // Min             v1 = Compare()(v0, v1) ? v1 : v0; // Max             v0 = t;         }          inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }     };      template <class A> struct Swap <A, void>     {         template <class T> inline void s(T &v0, T &v1)         {             // Explicitly code out the Min and Max to nudge the compiler             // to generate branchless code.             T t = v0 < v1 ? v0 : v1; // Min             v1 = v0 < v1 ? v1 : v0; // Max             v0 = t;         }          inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }     };      template <class A, class C, int I, int J, int X, int Y> struct PB     {         inline PB(A &a)         {             enum { L = X >> 1, M = (X & 1 ? Y : Y + 1) >> 1, IAddL = I + L, XSubL = X - L };             PB<A, C, I, J, L, M> p0(a);             PB<A, C, IAddL, J + M, XSubL, Y - M> p1(a);             PB<A, C, IAddL, J, XSubL, M> p2(a);         }     };      template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 1>     {         inline PB(A &a) { Swap<A, C> s(a, I - 1, J - 1); }     };      template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 2>     {         inline PB(A &a) { Swap<A, C> s0(a, I - 1, J); Swap<A, C> s1(a, I - 1, J - 1); }     };      template <class A, class C, int I, int J> struct PB <A, C, I, J, 2, 1>     {         inline PB(A &a) { Swap<A, C> s0(a, I - 1, J - 1); Swap<A, C> s1(a, I, J - 1); }     };      template <class A, class C, int I, int M, bool Stop = false> struct PS     {         inline PS(A &a)         {             enum { L = M >> 1, IAddL = I + L, MSubL = M - L};             PS<A, C, I, L, (L <= 1)> ps0(a);             PS<A, C, IAddL, MSubL, (MSubL <= 1)> ps1(a);             PB<A, C, I, IAddL, L, MSubL> pb(a);         }     };      template <class A, class C, int I, int M> struct PS <A, C, I, M, true>     {         inline PS(A &a) {}     };  public:     /**      * Sorts the array/container arr.      * \param  arr  The array/container to be sorted.      */     template <class Container> inline void operator() (Container &arr) const     {         PS<Container, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);     };      /**      * Sorts the array arr.      * \param  arr  The array to be sorted.      */     template <class T> inline void operator() (T *arr) const     {         PS<T*, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);     }; };  #include <iostream> #include <vector>  int main(int argc, const char * argv[]) {     enum { NumValues = 32 };      // Arrays     {         int rands[NumValues];         for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;         std::cout << "Before Sort: \t";         for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";         std::cout << "\n";         StaticSort<NumValues> staticSort;         staticSort(rands);         std::cout << "After Sort: \t";         for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";         std::cout << "\n";     }      std::cout << "\n";      // STL Vector     {         std::vector<int> rands(NumValues);         for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;         std::cout << "Before Sort: \t";         for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";         std::cout << "\n";         StaticSort<NumValues> staticSort;         staticSort(rands);         std::cout << "After Sort: \t";         for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";         std::cout << "\n";     }      return 0; }

Benchmarks

The following benchmarks are compiled with clang -O3 and ran on my mid-2012 macbook air.

Time (in milliseconds) to sort 1 million arrays.
The number of milliseconds for arrays of size 2, 4, 8 are 1.943, 8.655, 20.246 respectively.
C++ Templated Bose-Nelson Static Sort timings

Here are the average clocks per sort for small arrays of 6 elements. The benchmark code and examples can be found at this question:
Fastest sort of fixed length 6 int array

Direct call to qsort library function   : 342.26 Naive implementation (insertion sort)   : 136.76 Insertion Sort (Daniel Stutzbach)       : 101.37 Insertion Sort Unrolled                 : 110.27 Rank Order                              : 90.88 Rank Order with registers               : 90.29 Sorting Networks (Daniel Stutzbach)     : 93.66 Sorting Networks (Paul R)               : 31.54 Sorting Networks 12 with Fast Swap      : 32.06 Sorting Networks 12 reordered Swap      : 29.74 Reordered Sorting Network w/ fast swap  : 25.28 Templated Sorting Network (this class)  : 25.01

It performs as fast as the fastest example in the question for 6 elements.

The code used for the benchmarks can be found here.

It includes more features and further optimizations for more robust performance on real-world data.

154

answered Sep 24 '22 01:09

Vectorized

The other answers are interesting and fairly good, but I believe that I can provide some additional elements of answer, point per point:

Is it worth the effort? Well, if you need to sort small collections of integers and the sorting networks are tuned to take advantage of some instructions as much as possible, it might be worth the effort. The following graph presents the results of sorting a million arrays of int of size 0-14 with different sorting algorithms. As you can see, the sorting networks can provide a significant speedup if you really need it.

No standard implementation of std::sort I know of use sorting networks; when they are not fine-tuned, they might be slower than a straight insertion sort. libc++'s std::sort has dedicated algorithms to sort 0 thru 5 values at once but they it doesn't use sorting networks either. The only sorting algorithm I know of which uses sorting networks to sort a few values is Wikisort. That said, the research paper Applying Sorting Networks to Synthesize Optimized Sorting Libraries suggests that sorting networks could be used to sort small arrays or to improve recursive sorting algorithms such as quicksort, but only if they are fine-tuned to take advantage of specific hardware instructions.

The access aligned sort algorithm is some kind of bottom-up mergesort that apparently uses bitonic sorting networks implemented with SIMD instructions for the first pass. Apparently, the algorithm could be faster than the standard library one for some scalar types.
I can actually provide such information for the simple reason that I developed a C++14 sorting library that happens to provide efficient sorting networks of size 0 thru 32 that implement the optimizations described in the previous section. I used it to generate the graph in the first section. I am still working on the sorting networks part of the library to provide size-optimal, depth-optimal and swaps-optimal networks. Small optimal sorting networks are found with brute force while bigger sorting networks use results from the litterature.

Note that none of the sorting algorithms in the library directly use sorting networks, but you can adapt them so that a sorting network will be picked whenever the sorting algorithm is given a small std::array or a small fixed-size C array:
```
using namespace cppsort; // Sorters are function objects that can be // adapted with sorter adapters from the // library using sorter = small_array_adapter< std_sorter, sorting_network_sorter >; // Now you can use it as a function sorter sort; // Instead of a size-agnostic sorting algorithm, // sort will use an optimal sorting network for // 5 inputs since the bound of the array can be // deduced at compile time int arr[] = { 2, 4, 7, 9, 3 }; sort(arr); 
```
As mentioned above, the library provides efficient sorting networks for built-in integers, but you're probably out of luck if you need to sort small arrays of something else (e.g. my latest benchmarks show that they are not better than a straight insertion sort even for long long int).
You could probably use template metaprogramming to generate sorting networks of any size, but no known algorithm can generate the best sorting networks, so you might as well write the best ones by hand. I don't think the ones generated by simple algorithms can actually provide usable and efficient networks anyway (Batcher's odd-even sort and pairwise sorting networks might be the only usable ones) [Another answer seems to show that generated networks could actually work].

answered Sep 25 '22 01:09

Morwenn

Related questions
                            
                                Checking if a file opened successfully with ifstream
                            
                                Is it a sensible optimization to check whether a variable holds a specific value before writing that value?
                            
                                How do I return the largest type in a list of types?
                            
                                Is it possible to have an "auto" member variable?
                            
                                Lvalue to rvalue reference binding
                            
                                C++ std::vector<>::iterator is not a pointer, why?
                            
                                Documenting namespaces with Doxygen
                            
                                Creating a professional-looking (and behaving!) form designer
                            
                                What is "object" in "object file" and why is it called this way? [duplicate]
                            
                                Reading json file with boost
                            
                                Error mixing types with Eigen matrices
                            
                                What is the return type of a lambda expression if an item of a vector is returned?
                            
                                Does public and private have any influence on the memory layout of an object? [duplicate]
                            
                                Can std::vector move its data to another address at emplace_back() even though there is still unused space according to capacity()?
                            
                                Why is an integer array search loop slower in C++ than Java?
                            
                                Reading binary istream byte by byte
                            
                                Linking error: undefined reference to `vtable for XXX`
                            
                                Why does my translation matrix needs to be transposed?
                            
                                Use cases for std::add_const and similar
                            
                                Does Q_UNUSED have any side effects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With