I've been writing in x86 assembly lately (for fun) and was wondering whether or not rep prefixed string instructions actually have a performance edge on modern processors or if they're just implemented for back compatibility.
I can understand why Intel would have originally implemented the rep instructions back when processors only ran one instruction at a time, but is there a benefit to using them now?
With a loop that compiles to more instructions, there is more to fill up the pipeline and/or be issued out-of-order. Are modern processors built to optimize for these rep-prefixed instructions, or are rep instructions used so rarely in modern code that they're not important to the manufacturers?
There is a lot of space given to questions like this in both AMD and Intel's optimization guides. Validity of advice given in this area has a "half life" - different CPU generations behave differently, for example:
The Intel Architecture Optimization Manual gives performance comparison figures for various block copy techniques (including rep stosd
) on Table 7-2. Relative Performance of Memory Copy Routines, pg. 7-37f., for different CPUs, and again what's fastest on one might not be fastest on others.
For many cases, recent x86 CPUs (which have the "string" SSE4.2 operations) can do string operations via the SIMD unit, see this investigation.
To follow up on all this (and/or keep yourself updated when things change again, inevitably), read Agner Fog's Optimization guides/blogs.
In addition to FrankH's excellent answer; I'd like to point out that which method is best also depends on the length of the string, its alignment, and if the length is fixed or variable.
For small strings (maybe up to about 16 bytes) doing it manually with simple instructions is probably faster, as it avoids the setup costs of more complex techniques (and for fixed size strings can be easily unrolled). For medium sized strings (maybe from 16 bytes to 4 KiB) something like "REP MOVSD" (with some "MOVSB" instructions thrown in if misalignment is possible) is likely to be best.
For anything larger than that, some people would be tempted to go into SSE/AVX and prefetching, etc. A better idea is to fix the caller/s so that copying (or strlen() or whatever) isn't needed in the first place. If you try hard enough, you'll almost always find a way. Note: Also be very wary of "supposed" fast mempcy() routines - typically they've been tested on massive strings and not tested on far more likely tiny/small/medium strings.
Also note that (for the purpose of optimisation rather than convenience) due to all these differences (likely length, alignment, fixed or variable size, CPU type, etc) the idea of having one multi-purpose "memcpy()" for all of the very different cases is near-sighted.
Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.
dst
, src
and n
are multiples of 8.Here are my three memcpy
variants:
Hand-coded while-loop:
if (n == 16) {
*dst++ = *src++;
*dst++ = *src++;
} else {
size_t n_ptrs = n / sizeof(ptr);
ptr *end = dst + n_ptrs;
while (dst < end) {
*dst++ = *src++;
}
}
(ptr
is an alias to uintptr_t
). Time: 101.16%
rep movsb
if (n == 16) {
*dst++ = *src++;
*dst++ = *src++;
} else {
asm volatile("cld\n\t"
"rep ; movsb"
: "=D" (dst), "=S" (src)
: "c" (n), "D" (dst), "S" (src)
: "memory");
}
Time: 103.22%
rep movsq
if (n == 16) {
*dst++ = *src++;
*dst++ = *src++;
} else {
size_t n_ptrs = n / sizeof(ptr);
asm volatile("cld\n\t"
"rep ; movsq"
: "=D" (dst), "=S" (src)
: "c" (n_ptrs), "D" (dst), "S" (src)
: "memory");
}
Time: 100.00%
req movsq
wins by a tiny margin.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With