Why is <code>std::rotate</code> so much faster than the equivalent function that cplusplus.com describes? cplusplus.com's implementation: <pre class="prettyprint"><code>template <class ForwardIterator> void rotate (ForwardIterator first, ForwardIterator middle, ForwardIterator last) { ForwardIterator next= middle; while (first != next) { swap (*first++, *next++); if(next == last) next= middle; else if (first==middle) middle= next; } } </code></pre> I have two insertion sorting algorithms which are entirely identical, with the exception that one uses <code>std::rotate</code>, and one uses cplusplus.com's equivalent function. I'm setting them to sort 1000 vectors with 1000 <code>int</code> elements. The sort which uses <code>std::rotate</code> takes 0.376 seconds, and the other takes 8.181 seconds. Why is this? I'm not intending to try and make something better than the STL functions but I'm still curious.

As the commentors already stated, it depends on your Standard Library implementation. But the code that you posted is valid even for forward iterators. As such, it imposes very little requirements (only that these iterators can be incremented and dereferenced). Stepanov's classic Elements of Programming devotes an entire chapter (10) to <code>rotate</code> and other rearrangement algorithms. For forward iterators, the series of swaps in your code gives <code>O(3N)</code> assignments. For bidirectional iterators, three consecutive calls to <code>reverse</code> yield another <code>O(3N)</code> algorithm. For random access iterators, <code>std::rotate</code> can be implemented as <code>O(N)</code> assignments by defining a permutation of indices w.r.t. to the starting iterator <code>first</code>. All the above algorithms are in-place. Using a memory buffer, it is possible that the random access version can benefit from the greater cache locality of <code>memcpy()</code> or <code>memmove()</code> (if the underlying value type is POD) in which entire blocks of contiguous memory can be swapped. If your insertion sort is done on an array or <code>std::vector</code>, it is likely that your Standard Library will take advantage of this optimization. TL;DR: trust your Standard Library and don't reinvent the wheel!

Why is std::rotate so fast?

Tags:

c++

algorithm

sorting

c++11

stl

Why is std::rotate so much faster than the equivalent function that cplusplus.com describes?

cplusplus.com's implementation:

template <class ForwardIterator>
  void rotate (ForwardIterator first, ForwardIterator middle, ForwardIterator last)
{
  ForwardIterator next= middle;

  while (first != next)
  {
    swap (*first++, *next++);

    if(next == last)
        next= middle;
    else if (first==middle)
        middle= next;
  }
}

I have two insertion sorting algorithms which are entirely identical, with the exception that one uses std::rotate, and one uses cplusplus.com's equivalent function. I'm setting them to sort 1000 vectors with 1000 int elements. The sort which uses std::rotate takes 0.376 seconds, and the other takes 8.181 seconds.

Why is this? I'm not intending to try and make something better than the STL functions but I'm still curious.

244

asked Jan 16 '14 11:01

bradshire

Video Answer

2 Answers

As the commentors already stated, it depends on your Standard Library implementation. But the code that you posted is valid even for forward iterators. As such, it imposes very little requirements (only that these iterators can be incremented and dereferenced).

Stepanov's classic Elements of Programming devotes an entire chapter (10) to rotate and other rearrangement algorithms. For forward iterators, the series of swaps in your code gives O(3N) assignments. For bidirectional iterators, three consecutive calls to reverse yield another O(3N) algorithm. For random access iterators, std::rotate can be implemented as O(N) assignments by defining a permutation of indices w.r.t. to the starting iterator first.

All the above algorithms are in-place. Using a memory buffer, it is possible that the random access version can benefit from the greater cache locality of memcpy() or memmove() (if the underlying value type is POD) in which entire blocks of contiguous memory can be swapped. If your insertion sort is done on an array or std::vector, it is likely that your Standard Library will take advantage of this optimization.

TL;DR: trust your Standard Library and don't reinvent the wheel!

answered Oct 07 '22 12:10

TemplateRex

Edit:

Since the context is not given, it is not clear if your code calls std::swap() or another swap(a,b) algorithm like

T tmp = a; a = b; b = tmp;

When a and b are vectors of 1000 ints each, this would copy all vector elements 3 times. The specialized version of std::swap() for containers like std::vector<T> call the container a.swap(b) method instead, essentially swapping only the dynamic data pointers of the containers.

Also, for different iterator types the std::rotate() implementation can utilize some optimizations (see my older, possibly misleading answer below).

Caveat: The implementation of std::rotate() is implementation-dependent. For different iterator categories, different algorithms can be utilized (e.g. look for __rotate( in bits/stl_algo.h header of GNU g++).

To shift n elements by m=std::distance(first,middle) a simple (naive) algorithm like m rotations by one element needs O(n*m) moving or copying operations. But only O(n) moves are needed, when each element is directly placed to its right position, which results in a (roughly) m times faster algorithm.

An example for illustration: Rotate a string s = "abcdefg" by three elements:

abcdefg : store 'a' in temporary place
dbcdefg : move s[3] to s[0] (where it belongs in the end, directly)
dbcgefg : move s[6] to s[3]
dbcgefc : move s[9%7] to s[6] (wrapping index modulo container size: 9%7 == 2)
dbfgefc : move s[5] to s[2]
dbfgebc : move s[1] to s[5] (another wrapping around)
defgebc : move s[4] to s[1]
defgabc : move 'a' from temporary place to s[4]

For n and m with greatest common divisor 1 you are done now. Otherwise you have to repeat that scheme n/m time for the first m consecutive elements (n > m assumed here). This little more complicated algorithm is much faster.

For bidirectional iterators another legendary O(3n) algorithm can be used, referred to as "flipping hands". According to Jon Bentley's book Programming Pearls it was used in early UNIX editors for moving text:

Place your hands in front of you, one above the other, thumbs up. Now

Turn one hand.
Turn the other.
Turn both, connected to each other.

In code:

reverse(first, middle);
reverse(middle, last);
reverse(first, last);

For random access iterators large chunks of memory can be relocated by swap_ranges() (or memmove() operations for POD types).

Microoptimization by utilising assembler operations can give a small extra amount of acceleration, it can be done on top of the fasted algorithm.

Algorithms using consecutive elements instead of "hopping around" in memory also result in smaller number of cache misses on modern computer architectures.

answered Oct 07 '22 13:10

René Richter

Related questions
                            
                                How do I correctly organize output into columns?
                            
                                How to use floating point tolerances in the Catch framework?
                            
                                Why are (member) function pointers behaving so weirdly in Visual C++?
                            
                                Qt QHBoxLayout percentage size
                            
                                Conditions under which compiler will not define implicits (constructor, destructor, copy constructor, copy assignment) [duplicate]
                            
                                Error: C2228: left of '' must have class/struct/union
                            
                                Are the addresses of two temporaries guaranteed to be different in the same expression?
                            
                                Handling overflow when casting doubles to integers in C
                            
                                What does the "L" mean at the end of an integer literal?
                            
                                Reading from ifstream won't read whitespace
                            
                                Why does passing object reference arguments to thread function fails to compile?
                            
                                Why doesn't floor return an integer? [duplicate]
                            
                                Efficiently find an integer not in a set of size 40, 400, or 4000
                            
                                How to be notified of file/directory change in C/C++, ideally using POSIX
                            
                                What are the different calling conventions in C/C++ and what do each mean?
                            
                                GCC error: explicit specialization in non-namespace scope
                            
                                What can a 'const' method change?
                            
                                What is the point of an SDL2 Texture?
                            
                                Why is dereferencing a pointer called dereferencing?
                            
                                Why doesn't PRIu64 work in this code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With