Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of x86 rep instructions on modern (pipelined/superscalar) processors

I've been writing in x86 assembly lately (for fun) and was wondering whether or not rep prefixed string instructions actually have a performance edge on modern processors or if they're just implemented for back compatibility.

I can understand why Intel would have originally implemented the rep instructions back when processors only ran one instruction at a time, but is there a benefit to using them now?

With a loop that compiles to more instructions, there is more to fill up the pipeline and/or be issued out-of-order. Are modern processors built to optimize for these rep-prefixed instructions, or are rep instructions used so rarely in modern code that they're not important to the manufacturers?

like image 213
RyanS Avatar asked Dec 08 '11 01:12

RyanS


3 Answers

There is a lot of space given to questions like this in both AMD and Intel's optimization guides. Validity of advice given in this area has a "half life" - different CPU generations behave differently, for example:

  • AMD Software Optimization Guide (Sep/2005), section 8.3, pg. 167:
    Avoid using the REP prefix when performing string operations, especially when copying blocks of memory.
  • AMD Software Optimization Guide (Apr/2011), section 9.3, pg. 148:
    Use the REP prefix judiciously when performing string operations.

The Intel Architecture Optimization Manual gives performance comparison figures for various block copy techniques (including rep stosd) on Table 7-2. Relative Performance of Memory Copy Routines, pg. 7-37f., for different CPUs, and again what's fastest on one might not be fastest on others.

For many cases, recent x86 CPUs (which have the "string" SSE4.2 operations) can do string operations via the SIMD unit, see this investigation.

To follow up on all this (and/or keep yourself updated when things change again, inevitably), read Agner Fog's Optimization guides/blogs.

like image 108
FrankH. Avatar answered Oct 18 '22 13:10

FrankH.


In addition to FrankH's excellent answer; I'd like to point out that which method is best also depends on the length of the string, its alignment, and if the length is fixed or variable.

For small strings (maybe up to about 16 bytes) doing it manually with simple instructions is probably faster, as it avoids the setup costs of more complex techniques (and for fixed size strings can be easily unrolled). For medium sized strings (maybe from 16 bytes to 4 KiB) something like "REP MOVSD" (with some "MOVSB" instructions thrown in if misalignment is possible) is likely to be best.

For anything larger than that, some people would be tempted to go into SSE/AVX and prefetching, etc. A better idea is to fix the caller/s so that copying (or strlen() or whatever) isn't needed in the first place. If you try hard enough, you'll almost always find a way. Note: Also be very wary of "supposed" fast mempcy() routines - typically they've been tested on massive strings and not tested on far more likely tiny/small/medium strings.

Also note that (for the purpose of optimisation rather than convenience) due to all these differences (likely length, alignment, fixed or variable size, CPU type, etc) the idea of having one multi-purpose "memcpy()" for all of the very different cases is near-sighted.

like image 10
Brendan Avatar answered Oct 18 '22 15:10

Brendan


Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.

  • Precondition: Both dst , src and n are multiples of 8.
  • Processor: AMD Phenom(tm) II X6 1090T Processor 64bit/linux

Here are my three memcpy variants:

Hand-coded while-loop:

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    ptr *end = dst + n_ptrs;
    while (dst < end) {
        *dst++ = *src++;
    }
}

(ptr is an alias to uintptr_t). Time: 101.16%

rep movsb

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    asm volatile("cld\n\t"
                 "rep ; movsb"
                 : "=D" (dst), "=S" (src)
                 : "c" (n), "D" (dst), "S" (src)
                 : "memory");
}

Time: 103.22%

rep movsq

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    asm volatile("cld\n\t"
                 "rep ; movsq"
                 : "=D" (dst), "=S" (src)
                 : "c" (n_ptrs), "D" (dst), "S" (src)
                 : "memory");
}

Time: 100.00%

req movsq wins by a tiny margin.

like image 1
Björn Lindqvist Avatar answered Oct 18 '22 15:10

Björn Lindqvist