I learned that memset(ptr, 0, nbytes)
is really fast, but is there a faster way (at least on x86)?
I assume that memset uses mov
, however when zeroing memory most compilers use xor
as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?
Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.
edit2: I revisited this issue and did a little testing. Here is what I tested:
#include <stdio.h> #include <malloc.h> #include <string.h> #include <sys/time.h> #define TIME(body) do { \ struct timeval t1, t2; double elapsed; \ gettimeofday(&t1, NULL); \ body \ gettimeofday(&t2, NULL); \ elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \ printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \ #define SIZE 0x1000000 void zero_1(void* buff, size_t size) { size_t i; char* foo = buff; for (i = 0; i < size; i++) foo[i] = 0; } /* I foolishly assume size_t has register width */ void zero_sizet(void* buff, size_t size) { size_t i; char* bar; size_t* foo = buff; for (i = 0; i < size / sizeof(size_t); i++) foo[i] = 0; // fixes bug pointed out by tristopia bar = (char*)buff + size - size % sizeof(size_t); for (i = 0; i < size % sizeof(size_t); i++) bar[i] = 0; } int main() { char* buffer = malloc(SIZE); TIME( memset(buffer, 0, SIZE); ); TIME( zero_1(buffer, SIZE); ); TIME( zero_sizet(buffer, SIZE); ); return 0; }
results:
zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.
For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)
edit3: fixed bug in test code, test results are not affected
edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset
has a SSE optimized routine for zero. It will be hard to beat this.
memset can be faster since it is written in assembler, whereas std::fill is a template function which simply does a loop internally.
You can assume that memset will be at least as fast as a naive implementation such as the loop. Try it under a debug build and you will notice that the loop is not replaced. That said, it depends on what the compiler does for you. Looking at the disassembly is always a good way to know exactly what is going on.
All zeroing operations that the pool allocator performs and many structure/array initializations that InitAll performs end up going through the memset function. Memset is one of the hottest functions on the operating system and is already quite optimized as a result.
The bzero() function is deprecated (marked as LEGACY in POSIX.
x86 is rather broad range of devices.
For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.
For chips with mmx, an assembly loop with movq could hit 64bits at a time.
You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.
for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps
Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.
memset
is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).
That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset
implementation by doing it yourself. memset
and its friends in the standard library are always fun targets for one-upmanship programming. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With