Faster way to zero memory than with memset?

Tags:

libc

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?

I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?

Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.

edit2: I revisited this issue and did a little testing. Here is what I tested:

    #include <stdio.h>     #include <malloc.h>     #include <string.h>     #include <sys/time.h>      #define TIME(body) do {                                                     \         struct timeval t1, t2; double elapsed;                                  \         gettimeofday(&t1, NULL);                                                \         body                                                                    \         gettimeofday(&t2, NULL);                                                \         elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \         printf("%s\n --- %f ---\n", #body, elapsed); } while(0)                 \       #define SIZE 0x1000000      void zero_1(void* buff, size_t size)     {         size_t i;         char* foo = buff;         for (i = 0; i < size; i++)             foo[i] = 0;      }      /* I foolishly assume size_t has register width */     void zero_sizet(void* buff, size_t size)     {         size_t i;         char* bar;         size_t* foo = buff;         for (i = 0; i < size / sizeof(size_t); i++)             foo[i] = 0;          // fixes bug pointed out by tristopia         bar = (char*)buff + size - size % sizeof(size_t);         for (i = 0; i < size % sizeof(size_t); i++)             bar[i] = 0;     }      int main()     {         char* buffer = malloc(SIZE);         TIME(             memset(buffer, 0, SIZE);         );         TIME(             zero_1(buffer, SIZE);         );         TIME(             zero_sizet(buffer, SIZE);         );         return 0;     }

results:

zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.

For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)

edit3: fixed bug in test code, test results are not affected

edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.

216

asked Sep 06 '10 23:09

maep

2 Answers

x86 is rather broad range of devices.

For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.

For chips with mmx, an assembly loop with movq could hit 64bits at a time.

You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.

for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps

Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.

104

answered Oct 05 '22 16:10

Tim

memset is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).

That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset implementation by doing it yourself. memset and its friends in the standard library are always fun targets for one-upmanship programming. :)

answered Oct 05 '22 16:10

Ben Zotto

Related questions
                            
                                How to write C/C++ code correctly when null pointer is not all bits zero
                            
                                Hiding members in a C struct
                            
                                Extending python - to swig, not to swig or Cython
                            
                                Returning from a void function [closed]
                            
                                Arguments to main in C [duplicate]
                            
                                Complex C declaration
                            
                                Giving Command line arguments in Xcode in C program
                            
                                Microsoft Crypto API Disable Use of RSAES-OAEP Key Transport Algorithm
                            
                                How and when to align to cache line size?
                            
                                What does "[*]" (star modifier) mean in C? [duplicate]
                            
                                PDF specifications for coders: Adobe or ISO?
                            
                                Is copying 2D arrays with "memcpy" technically undefined behaviour?
                            
                                Why are string literals l-value while all other literals are r-value?
                            
                                How can a C compiler be written in C? [duplicate]
                            
                                Passing an ellipsis to another variadic function [duplicate]
                            
                                Mmh, who are you PRIu64?
                            
                                The difference between C and C++ regarding the ++ operator
                            
                                Compiling without libc
                            
                                g++ linker: force static linking if static library exists?
                            
                                C - freeing structs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With