Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does rewriting memcpy/memcmp/... with SIMD instructions make sense?

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?

If so, why doesn't GCC generate SIMD instructions for these library functions by default?

Also, are there any other functions can be possibly improved by SIMD?

like image 227
limi Avatar asked Mar 16 '11 05:03

limi


People also ask

How can I make memcpy faster?

memcpy is likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of not copying things around, e.g. swap pointers only, not the data itself.

What can I use instead of memcpy?

memmove() is similar to memcpy() as it also copies data from a source to destination.

Does memcpy use SSE?

For example, some implementations of the memset , memcpy , or memmove standard C library routines use SSE2 instructions for better throughput.


2 Answers

Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive.

I have a custom SIMD memchr which is a hell-of-a-lot faster than the library version. Especially when I'm finding the first of 2 or 3 characters (example, I want to know if there's an equation in this line of text, I search for the first of =, \n, \r).

On the other hand, the library functions are well tested, so it's only worth writing your own if you call them a lot and a profiler shows they're a significant fraction of your CPU time.

like image 132
Ben Voigt Avatar answered Sep 18 '22 17:09

Ben Voigt


It does not make sense. Your compiler ought to be emitting these instructions implicitly for memcpy/memcmp/similar intrinsics, if it is able to emit SIMD at all.

You may need to explicitly instruct GCC to emit SSE opcodes with eg -msse -msse2; some GCCs do not enable them by default. Also, if you do not tell GCC to optimize (ie, -o2), it won't even try to emit fast code.

The use of SIMD opcodes for memory work like this can have a massive performance impact, because they also include cache prefetches and other DMA hints that are important for optimizing bus access. But that doesn't mean that you need to emit them manually; even though most compiler stink at emitting SIMD ops generally, every one I've used at least handles them for the basic CRT memory functions.

Basic math functions can also benefit a lot from setting the compiler to SSE mode. You can easily get an 8x speedup on basic sqrt() just by telling the compiler to use the SSE opcode instead of the terrible old x87 FPU.

like image 36
Crashworks Avatar answered Sep 21 '22 17:09

Crashworks