When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc... I also recently encountered this 'bug' due to memcpy optimization in glibc. The question is: why can't the hardware manufacturers (Intel, AMD) optimize the specific case of <pre class="prettyprint"><code>rep stos </code></pre> and <pre class="prettyprint"><code>rep movs </code></pre> to be recognized as such, and do the fastest fill and copy as possible on their own architecture?

Cost. (Note that memcpy is coming to ARM, see below.) The cost of optimizing <code>memcpy</code> in your C library is fairly minimal, maybe a few weeks of developer time here and there. You'll have to make a new version every several years or so when processor features change enough to warrant a rewrite. For example, GNU's <code>glibc</code> and Apple's <code>libSystem</code> both have a <code>memcpy</code> which is specifically optimized for SSE3. The cost of optimizing in hardware is much higher. Not only is it more expensive in terms of developer costs (designing a CPU is vastly more difficult than writing user-space assembly code), but it would increase the transistor count of the processor. That could have a number of negative effects: <ul> <li>Increased power consumption</li> <li>Increased unit cost</li> <li>Increased latency for certain CPU subsystems</li> <li>Lower maximum clock speed</li> </ul> In theory, it could have an overall negative impact on both performance and unit cost. Maxim: Don't do it in hardware if the software solution is good enough. But, memcpy is coming to ARM. As processors get larger and larger, the incremental cost of adding more instructions gets lower and lower, relative to the existing cost of a core. From Arm A-Profile Architecture Developments 2021: <blockquote> To address these concerns the 2021 extensions introduce new instructions specifically targeting the memcpy() and memset() family of functions. </blockquote> The key concerns mentioned are that the complicated software implementations of memcpy, while fast, may need to be rewritten for different microarchitectures in order to get better performance. They also need to account for alignment and size in different ways. Having a fast hardware implementation means that memcpy can be inlined and get good performance across different microarchitectures. Note: The bug you've cited is not really a bug in <code>glibc</code> w.r.t. the C specification. It's more complicated. Basically, the glibc folks say that <code>memcpy</code> behaves exactly as advertised in the standard, and some other folks are complaining that <code>memcpy</code> should be aliased to <code>memmove</code>. Time for a story: It reminds me of a complaint that a Mac game developer had when he ran his game on a 603 processor instead of a 601 (this is from the 1990s). The 601 had hardware support for unaligned loads and stores with minimal performance penalty. The 603 simply generated an exception; by offloading to the kernel I imagine the load/store unit could be made much simpler, possibly making the processor faster and cheaper in the process. The Mac OS nanokernel handled the exception by performing the required load/store operation and returning control to the process. But this developer had a custom blitting routine to write pixels to the screen which did unaligned loads and stores. Game performance was fine on the 601 but abominable on the 603. Most other developers didn't notice if they used Apple's blitting function, since Apple could just reimplement it for newer processors. The moral of the story is that better performance comes both from software and hardware improvements. In general, the trend seems to be in the opposite direction from the kind of hardware optimizations mentioned. While in x86 it's easy to write <code>memcpy</code> in assembly, some newer architectures offload even more work to software. Of particular note are the VLIW architectures: Intel IA64 (Itanium), the TI TMS320C64x DSPs, and the Transmeta Efficeon are examples. With VLIW, assembly programming gets much more complicated: you have to explicitly select which execution units get which commands and which commands can be done at the same time, something that a modern x86 will do for you (unless it's an Atom). So writing <code>memcpy</code> suddenly gets much, much harder. These architectural tricks allow you to cut a huge chunk of hardware out of your microprocessors while retaining the performance benefits of a superscalar design. Imagine having a chip with a footprint closer to an Atom but performance closer to a Xeon. I suspect the difficulty of programming these devices are is the major factor impeding wider adoption.

One thing I'd like to add to the other answers is that <code>rep movs</code> is not actually slow on all modern processors. For instance, <blockquote> Usually, the REP MOVS instruction has a large overhead for choosing and setting up the right method. Therefore, it is not optimal for small blocks of data. For large blocks of data, it may be quite efficient when certain conditions for alignment etc. are met. These conditions depend on the specific CPU (see page 143). On Intel Nehalem and Sandy Bridge processors, this is the fastest method for moving large blocks of data, even if the data are unaligned. </blockquote> [Highlighting is mine.] Reference: <a href="http://www.agner.org/optimize/optimizing_assembly.pdf" rel="noreferrer">Agner Fog, Optimizing subroutines in assembly language An optimization guide for x86 platforms. </a>,p. 156 (and see also section 16.10, p. 143) [version of 2011-06-08].

Why are complicated memcpy/memset superior?

Tags:

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc... I also recently encountered this 'bug' due to memcpy optimization in glibc.

The question is: why can't the hardware manufacturers (Intel, AMD) optimize the specific case of

rep stos

and

rep movs

to be recognized as such, and do the fastest fill and copy as possible on their own architecture?

383

asked Jan 13 '12 23:01

Yakov Galka

2 Answers

Cost.

(Note that memcpy is coming to ARM, see below.)

The cost of optimizing memcpy in your C library is fairly minimal, maybe a few weeks of developer time here and there. You'll have to make a new version every several years or so when processor features change enough to warrant a rewrite. For example, GNU's glibc and Apple's libSystem both have a memcpy which is specifically optimized for SSE3.

The cost of optimizing in hardware is much higher. Not only is it more expensive in terms of developer costs (designing a CPU is vastly more difficult than writing user-space assembly code), but it would increase the transistor count of the processor. That could have a number of negative effects:

Increased power consumption
Increased unit cost
Increased latency for certain CPU subsystems
Lower maximum clock speed

In theory, it could have an overall negative impact on both performance and unit cost.

Maxim: Don't do it in hardware if the software solution is good enough.

But, memcpy is coming to ARM. As processors get larger and larger, the incremental cost of adding more instructions gets lower and lower, relative to the existing cost of a core. From Arm A-Profile Architecture Developments 2021:

To address these concerns the 2021 extensions introduce new instructions specifically targeting the memcpy() and memset() family of functions.

The key concerns mentioned are that the complicated software implementations of memcpy, while fast, may need to be rewritten for different microarchitectures in order to get better performance. They also need to account for alignment and size in different ways. Having a fast hardware implementation means that memcpy can be inlined and get good performance across different microarchitectures.

Note: The bug you've cited is not really a bug in glibc w.r.t. the C specification. It's more complicated. Basically, the glibc folks say that memcpy behaves exactly as advertised in the standard, and some other folks are complaining that memcpy should be aliased to memmove.

Time for a story: It reminds me of a complaint that a Mac game developer had when he ran his game on a 603 processor instead of a 601 (this is from the 1990s). The 601 had hardware support for unaligned loads and stores with minimal performance penalty. The 603 simply generated an exception; by offloading to the kernel I imagine the load/store unit could be made much simpler, possibly making the processor faster and cheaper in the process. The Mac OS nanokernel handled the exception by performing the required load/store operation and returning control to the process.

But this developer had a custom blitting routine to write pixels to the screen which did unaligned loads and stores. Game performance was fine on the 601 but abominable on the 603. Most other developers didn't notice if they used Apple's blitting function, since Apple could just reimplement it for newer processors.

The moral of the story is that better performance comes both from software and hardware improvements.

In general, the trend seems to be in the opposite direction from the kind of hardware optimizations mentioned. While in x86 it's easy to write memcpy in assembly, some newer architectures offload even more work to software. Of particular note are the VLIW architectures: Intel IA64 (Itanium), the TI TMS320C64x DSPs, and the Transmeta Efficeon are examples. With VLIW, assembly programming gets much more complicated: you have to explicitly select which execution units get which commands and which commands can be done at the same time, something that a modern x86 will do for you (unless it's an Atom). So writing memcpy suddenly gets much, much harder.

These architectural tricks allow you to cut a huge chunk of hardware out of your microprocessors while retaining the performance benefits of a superscalar design. Imagine having a chip with a footprint closer to an Atom but performance closer to a Xeon. I suspect the difficulty of programming these devices are is the major factor impeding wider adoption.

answered Sep 24 '22 02:09

Dietrich Epp

One thing I'd like to add to the other answers is that rep movs is not actually slow on all modern processors. For instance,

Usually, the REP MOVS instruction has a large overhead for choosing and setting up the right method. Therefore, it is not optimal for small blocks of data. For large blocks of data, it may be quite efficient when certain conditions for alignment etc. are met. These conditions depend on the specific CPU (see page 143). On Intel Nehalem and Sandy Bridge processors, this is the fastest method for moving large blocks of data, even if the data are unaligned.

[Highlighting is mine.] Reference: Agner Fog, Optimizing subroutines in assembly language An optimization guide for x86 platforms. ,p. 156 (and see also section 16.10, p. 143) [version of 2011-06-08].

answered Sep 23 '22 02:09

PhiS

Related questions
                            
                                How to set breakpoint on a particular file in Perl program?
                            
                                Node.js / Express.js - How to override/intercept res.render function?
                            
                                Best way to store an array in MySQL database?
                            
                                Get output parameter value
                            
                                How do Ruby programmers do type checking?
                            
                                Can you recognize an infinite list in a Haskell program? [duplicate]
                            
                                What does JavaScript Routing buy you?
                            
                                Arguments for usage choice: RavenDB vs SQL Server - performance, reliability and simplicity [closed]
                            
                                Checking if ALL form inputs are empty with jQuery
                            
                                Django lazy QuerySet and pagination
                            
                                How to handle lack of JavaScript Object.bind() method in IE 8
                            
                                Add data to JSONObject

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With