Beating or meeting OS X memset (and memset_pattern4)

Question

My question is based on another SO question: Why does _mm_stream_ps produce L1/LL cache misses?

After reading it and being intrigued by it, I tried to replicate the results and see for myself which was faster: naive loop, unrolled naive loop, _mm_stream_ps (unrolled), _mm_store_ps (unrolled) and last but not least memset_pattern4. (the last one takes a 4 byte pattern, such as a float, and plasters it all over the destination array, which should do the same as all the other functions, it's probably OS X exclusive though).

I've made sure to align the start of my array on a cacheline (64 bytes, I checked) and pass the array in an argument as well as any other performance tweaks the were mentioned in the previous question.

Somebody else wanted to know the same thing on gamedev: http://www.gamedev.net/topic/532112-fast-memset/

The conclusions of that thread mirror my own: when the destination array is smaller than the largest (L3) cache, _mm_store_ps is faster than _mm_stream_ps. When the destination array is larger, _mm_stream_ps is faster. I'm not entirely sure why __mm_store_ps is faster in the first case, since I never use those values in the cache, but I get why _mm_stream_ps wins out in the latter case. It's made for this situation: write bytes to memory which you won't need immediately (or ever).

Here are some results with a destination array 256 times larger than L3 cache (in my case, 1.5GB), compiled with gcc 4.8:

gcc-4.8 stream.c -o stream -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops && ./stream

bench L3-MASS, array 1610612736 bytes (402653184 floats, 0 remainder, 0x104803040 pointer)
warm up round...
      6% (  20.81148 ms) : MEMSET CHEAT
      8% (  28.49419 ms) : MEMSET PATTER
    100% ( 371.40385 ms) : NAIVE  NORMAL
     54% ( 202.01147 ms) : NAIVE  UNROLL
     31% ( 113.53433 ms) : STREAM NORMAL
     30% ( 111.41691 ms) : STREAM UNROLL
     51% ( 190.70412 ms) : STORE  NORMAL
     51% ( 189.15338 ms) : STORE  UNROLL
     51% ( 189.36182 ms) : STORE  PREFET

So what do we learn from this? memset_pattern4 is unbelievably fast. I included bog-standard memset even though it just uses a 1-byte pattern for comparison. In essence, memset cheats, but memset_pattern4 does not, and it's still wicked fast.

I've tried looking at the assembly for what I believe is the source code to memset_pattern4 in the OS X string library:

Apple's libc, memset_pattern4: http://www.opensource.apple.com/source/Libc/Libc-825.25/string/memset_pattern.c?txt
This references a so-called bcopy function. Let's dig for that: String library: http://www.opensource.apple.com/source/Libc/Libc-763.13/x86_64/string/
And most likely the SSE 4.2 version is being used in my case: http://www.opensource.apple.com/source/Libc/Libc-763.13/x86_64/string/bcopy_sse42.s

My knowledge of asm reaches (by now) far enough that I see they're using the movdqa instruction where it matters (in the LAlignedLoop section), which is basically a SSE move instruction for integers (not floats), intrinsic: _mm_store_si128. Not that it should matter here, bits and bytes, right?

There also seems to be a pure asm implementation of memset_pattern4, which seems different as it doesn't call bcopy: http://www.opensource.apple.com/source/Libc/Libc-763.13/x86_64/string/memset.s (EDIT: this is the correct one, as verified by running under gdb)

...damn, this one seems to use non-temporal (_mm_stream_ps stores for very long arrays => movntdq %xmm0,(%rdi,%rcx)..., look in the LVeryLong section of the funcion), which is exactly what I do! So how can that be it much faster? Maybe that's not the memset_pattern4 I'm looking for.

So, what is memset_pattern4 doing under the hood and why is it 5x faster than my best attempt? Even though I've been trying to learn enough x86 assembly to be able to dissect the function I'm afraid it's a bit out of my league for now to debug performance issues in optimized-to-death functions.

NOTE: for those curious, this microbenchmark also serves to illustrate the sheer awesomeness of clang and its advanced vectorization (-fslp-vectorize), it manages to make the naive loop the fastest one save for memset in almost all cases. It seems to be about as good as the best combination of _mm_store_ps and _mm_stream_ps.

CODE: here's the code I use to perform my benchmark (as gist: https://gist.github.com/6571379):

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>

/**
 * compile and run:
 *
 * OSX:
 *    clang stream.c -o stream -std=c11 -O3 -g -ftree-vectorize -fslp-vectorize -march=native -minline-all-stringops && ./stream
 *    gcc-4.8 stream.c -o stream -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops && ./stream
 *
 * linux:
 *    clang stream.c -o stream -lrt -std=c11 -O3 -ftree-vectorize -fslp-vectorize -march=native && ./stream
 *    gcc-4.8 stream.c -o stream -lrt -std=c11 -O3 -ftree-vectorize -march=native && ./stream
 *
 * to generate the assembly:
 *    gcc-4.8 -S stream.c -o stream.s -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops
 *    gobjdump -dS stream > stream.obj.s
 *
 * clang is the (very clear) winner here, the SLP vectorizer is absolutely killer, it even turns the
 * plain naive loop into something hyper-performant
 */

/* posix headers */
#include <sys/time.h>

/* intrinsics */
#include <x86intrin.h>

#define ARRAY_SIZE(x) ((sizeof(x)/sizeof(0[x])) / ((size_t)(!(sizeof(x) % sizeof(0[x])))))


/**
 * some stats from my system
 *
 * sudo sysctl -a | grep cache
 *
 * hw.cachelinesize = 64
 * hw.l1icachesize = 32768
 * hw.l1dcachesize = 32768
 * hw.l2cachesize = 262144
 * hw.l3cachesize = 6291456
 */

/* most processors these days (2013) have a 64 byte cache line */
#define FACTOR          1024
#define CACHE_LINE      64
#define FLOATS_PER_LINE (CACHE_LINE / sizeof(float))
#define L1_CACHE_BYTES  32768
#define L2_CACHE_BYTES  262144
#define L3_CACHE_BYTES  6291456


#ifdef __MACH__
#include <mach/mach_time.h>

double ns_conversion_factor;
double us_conversion_factor;
double ms_conversion_factor;

void timeinit() {
    mach_timebase_info_data_t timebase;
    mach_timebase_info(&timebase);

    ns_conversion_factor = (double)timebase.numer / (double)timebase.denom;
    us_conversion_factor = (double)timebase.numer / (double)timebase.denom / 1000;
    ms_conversion_factor = (double)timebase.numer / (double)timebase.denom / 1000000;
}

double nsticks() {
    return mach_absolute_time() * ns_conversion_factor;
}

double msticks() {
    return mach_absolute_time() * ms_conversion_factor;
}

#else

void timeinit() {
    /* do nothing */
}

double nsticks() {
    timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);

    return ((double)ts.tv_sec) / 1000000000 + ((double)ts.tv_nsec);
}

double msticks() {
    timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);

    return ((double)ts.tv_sec) / 1000 + ((double)ts.tv_nsec) * 1000000;
}

#endif


void *aligned_malloc(size_t size, size_t alignment) {
    void *pa, *ptr;

    pa = malloc((size+alignment-1)+sizeof(void *));
    if (!pa) return NULL;

    ptr=(void*)( ((intptr_t)pa+sizeof(void *)+alignment-1)&~(alignment-1) );
    *((void **)ptr-1)=pa;

    return ptr;
}

void aligned_free(void *ptr) {
    if (ptr) free(*((void **)ptr-1));
}

void pollute_cache(uint8_t volatile *arr, size_t length) {
    for (int i = 0; i < length; ++i) {
        arr[i] = (arr[i] > 0xFE) ? 0xAA : 0x55;
    }
}

void pollute_cache_standalone() {
    const size_t pollute_len = 2 * L3_CACHE_BYTES;
    uint8_t *arr             = aligned_malloc(pollute_len * sizeof(uint8_t), 64);

    for (int i = 0; i < pollute_len; ++i) {
        arr[i] = (arr[i] > 0xFE) ? 0xAA : 0x55;
    }

    aligned_free(arr);
}

/**
 * returns the time passed, in milliseconds
 */
double tim(const char *name, double baseline, void (*pre)(void), void (*func)(float *, size_t), float * restrict arr, size_t length) {
    struct timeval t1, t2;

    if (pre) pre();

    const double ms1 = msticks();
    func(arr, length);
    const double ms2 = msticks();

    const double ms = (ms2 - ms1);

    if (baseline == -2.0) return ms;

    /* first run, equal to baseline (itself) by definition */
    if (baseline == -1.0) baseline = ms;

    if (baseline != 0.0) {
        fprintf(stderr, "%7.0f%% (%10.5f ms) : %s
", (ms / baseline) * 100, ms, name);
    }
    else {
        fprintf(stderr, "%7.3f ms : %s
", ms, name);
    }

    return ms;
}

void func0(float * const restrict arr, size_t length) {
    memset(arr, 0x05, length);
}

#ifdef __MACH__

void funcB(float * const restrict arr, size_t length) {
    const float val = 5.0f;
    memset_pattern4(arr, &val,length);
}

#endif

void func1(float * const restrict arr, size_t length) {
    for (int i = 0; i < length; ++i) {
        arr[i] = 5.0f;
    }
}

void func2(float * const restrict arr, size_t length) {
    for(int i = 0; i < length; i += 4) {
        arr[i]   = 5.0f;
        arr[i+1] = 5.0f;
        arr[i+2] = 5.0f;
        arr[i+3] = 5.0f;
    }
}

void func3(float * const restrict arr, size_t length) {
    const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 4) {
        _mm_stream_ps(&arr[i], buf);
    }

    _mm_mfence();
}

void func4(float * const restrict arr, size_t length) {
    const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 16) {
        _mm_stream_ps(&arr[i + 0], buf);
        _mm_stream_ps(&arr[i + 4], buf);
        _mm_stream_ps(&arr[i + 8], buf);
        _mm_stream_ps(&arr[i + 12], buf);
    }

    _mm_mfence();
}

void func5(float * const restrict arr, size_t length) {
    const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 4) {
        _mm_store_ps(&arr[i], buf);
    }
}

void fstore_prefetch(float * const restrict arr, size_t length) {
    const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 16) {
        __builtin_prefetch(&arr[i + FLOATS_PER_LINE * 32], 1, 0);
        _mm_store_ps(&arr[i + 0], buf);
        _mm_store_ps(&arr[i + 4], buf);
        _mm_store_ps(&arr[i + 8], buf);
        _mm_store_ps(&arr[i + 12], buf);
    }
}

void func6(float * const restrict arr, size_t length) {
    const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 16) {
        _mm_store_ps(&arr[i + 0], buf);
        _mm_store_ps(&arr[i + 4], buf);
        _mm_store_ps(&arr[i + 8], buf);
        _mm_store_ps(&arr[i + 12], buf);
    }
}

#ifdef __AVX__

void func7(float * restrict arr, size_t length) {
    const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 8) {
        _mm256_stream_ps(&arr[i], buf);
    }
}

void func8(float * restrict arr, size_t length) {
    const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 32) {
        _mm256_stream_ps(&arr[i + 0], buf);
        _mm256_stream_ps(&arr[i + 8], buf);
        _mm256_stream_ps(&arr[i + 16], buf);
        _mm256_stream_ps(&arr[i + 24], buf);
    }
}

void func9(float * restrict arr, size_t length) {
    const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 8) {
        _mm256_store_ps(&arr[i], buf);
    }
}

void funcA(float * restrict arr, size_t length) {
    const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);

    for (int i = 0; i < length; i += 32) {
        _mm256_store_ps(&arr[i + 0], buf);
        _mm256_store_ps(&arr[i + 8], buf);
        _mm256_store_ps(&arr[i + 16], buf);
        _mm256_store_ps(&arr[i + 24], buf);
    }
}

#endif

void bench(const char * restrict name, float * restrict arr, size_t length) {
    fprintf(stderr, "bench %s, array %zu bytes (%zu floats, %zu remainder, %p pointer)
", name, length, length / sizeof(float), length % sizeof(float), arr);

    size_t nfloats = length / sizeof(float);

    fprintf(stderr, "warm up round...");
    func1(arr, nfloats);
    fprintf(stderr, "done
");

    double baseline = tim("func1: NAIVE ", -2.0, NULL, func1, arr, nfloats);

    tim("MEMSET CHEAT ", baseline, NULL, func0, arr, nfloats);
#ifdef __MACH__
    tim("MEMSET PATTER", baseline, NULL, funcB, arr, nfloats);
#endif
    tim("NAIVE  NORMAL", -1.0, NULL, func1, arr, nfloats);
    tim("NAIVE  UNROLL", baseline, NULL, func2, arr, nfloats);
    tim("STREAM NORMAL", baseline, NULL, func3, arr, nfloats);
    tim("STREAM UNROLL", baseline, NULL, func4, arr, nfloats);
    tim("STORE  NORMAL", baseline, NULL, func5, arr, nfloats);
    tim("STORE  UNROLL", baseline, NULL, func6, arr, nfloats);
    tim("STORE  PREFET", baseline, NULL, fstore_prefetch, arr, nfloats);

    // for (int i = 0; i < 1; ++i) {
    //     tim("func0: MEMSET (cache polluted)", NULL, func0, arr, nfloats);
    //     tim("func1: NAIVE  (cache polluted)", pollute_cache_standalone, func1, arr, nfloats);
    //     tim("func2: UNROLL (cache polluted)", pollute_cache_standalone, func2, arr, nfloats);
    //     tim("func3: STREAM (cache polluted)", pollute_cache_standalone, func3, arr, nfloats);
    //     tim("func4: STRUN  (cache polluted)", pollute_cache_standalone, func4, arr, nfloats);
    //     tim("func5: STORE  (cache polluted)", pollute_cache_standalone, func5, arr, nfloats);
    //     tim("func6: STOUN  (cache polluted)", pollute_cache_standalone, func6, arr, nfloats);
    // }
}

int main() {
    timeinit();

    static const struct {
        const char *name;
        size_t bytes;
    } sizes[] = {
        { "L1-HALF", L1_CACHE_BYTES / 2 },
        { "L1-FULL", L1_CACHE_BYTES },
        { "L2-HALF", L2_CACHE_BYTES / 2 },
        { "L2-FULL", L2_CACHE_BYTES },
        { "L3-HALF", L3_CACHE_BYTES / 2 },
        { "L3-FULL", L3_CACHE_BYTES },
        { "L3-DOUB", L3_CACHE_BYTES * 2 },
        { "L3-HUGE", L3_CACHE_BYTES * 64 },
        { "L3-MASS", L3_CACHE_BYTES * 256 }
    };

    for (int i = 0; i < ARRAY_SIZE(sizes); ++i) {
        size_t bytes = sizes[i].bytes;

        /* align to cache line */
        float *arr = aligned_malloc(bytes, CACHE_LINE);

        bench(sizes[i].name, arr, bytes);

        aligned_free(arr);
    }

    return 0;
}

EDIT: I went digging a bit further and after editing the assembly that gcc generates to make it more or less the same as the one apple uses (memset.s, label LVeryLong, i.e.: 4 unrolled movntdq instructions in a tight loop). To my surprise, I receive equal performance as my functions that use _mm_store_ps (movaps). This perplexes me, as I would have expected it to be either

as fast as memset_pattern4 (presumably unrolled movntdq)
as fast as unrolled _mm_stream_ps (movntdq)

But no, it seems to be the same as _mm_store_ps, imagine that, maybe I'm doing something wrong. Running objdump on the resulting binary confirms that it is using movntdq, which suprises me even more, what the hell is going on?

Because I hit a dead end there, I decided to step through the executable in a debugger and setup a breakpoint at memset_pattern4. Stepping into the function, I noticed that it does exactly what I thought it would do, a tight loop with four unrolled movntdq:

   0x00007fff92a5f7d2 <+318>:   jmp    0x7fff92a5f7e0 <memset_pattern4+332>
   0x00007fff92a5f7d4 <+320>:   nopw   0x0(%rax,%rax,1)
   0x00007fff92a5f7da <+326>:   nopw   0x0(%rax,%rax,1)
   0x00007fff92a5f7e0 <+332>:   movntdq %xmm0,(%rdi,%rcx,1)
   0x00007fff92a5f7e5 <+337>:   movntdq %xmm0,0x10(%rdi,%rcx,1)
   0x00007fff92a5f7eb <+343>:   movntdq %xmm0,0x20(%rdi,%rcx,1)
   0x00007fff92a5f7f1 <+349>:   movntdq %xmm0,0x30(%rdi,%rcx,1)
   0x00007fff92a5f7f7 <+355>:   add    $0x40,%rcx
=> 0x00007fff92a5f7fb <+359>:   jne    0x7fff92a5f7e0 <memset_pattern4+332>
   0x00007fff92a5f7fd <+361>:   sfence

So, what makes Apple's sauce that much more magical than mine, I wonder...

EDIT 2: I was wrong twice here, Apple's magic sauce is not that magical, I was just passing in an array that was 4x smaller than what I was passing to my functions. Courtesy to @PaulR for noticing! Secondly I was editing the assembly of the function, but gcc had already inlined it. So I was editing a copy that was never used.

CONCLUSION:

Some other things I found out:

Clang and gcc are really good, with the right intrinsics, they optimize a lot (and clang even does a great job without intrinsics, when the SLP vectorizer is enabled). They will also inline function pointers.
Clang will replace a naive loop with a constant into a memset call, clearing up another confusing result I had.
non-temporal store (i.e.: stream) is only beneficial with huge writes
memset is really well optimized, it will automatically switch between regular store and non-temporal store (stream) based on the length of the array to write. I'm not sure how much of this is true on platforms other than OSX
when writing a benchmark, make absolutely sure the function does what you think it does, and that the compiler is not outsmarting you. The first case was my problem here, I had not supplied the correct arguments.

EDIT: I recently stumbled upon the intel optimization guide, if at all interested in these things, read some parts of this first (start at 3.7.6, perhaps).

Paul R · Accepted Answer

I think you have a couple of bugs here:

void func0(float * const restrict arr, size_t length) {
    memset(arr, 0x05, length);
}

and similarly here:

void funcB(float * const restrict arr, size_t length) {
    const float val = 5.0f;
    memset_pattern4(arr, &val,length);
}

These should actually be:

void func0(float * const restrict arr, size_t length) {
    memset(arr, 0x05, length * sizeof(float));
}

and:

void funcB(float * const restrict arr, size_t length) {
    const float val = 5.0f;
    memset_pattern4(arr, &val, length * sizeof(float));
}

This will give timing which is 4x more optimistic than it should be for these two cases.

On my 3 year old Core i7 MacBook Pro (8 GB RAM) the fixed code gives me:

bench L3-HUGE, array 402653184 bytes (100663296 floats, 0 remainder, 0x108ed8040 pointer)
warm up round...done
     99% (  69.43037 ms) : MEMSET CHEAT 
    106% (  73.98113 ms) : MEMSET PATTER
    100% (  72.40429 ms) : NAIVE  NORMAL
    120% (  83.98352 ms) : NAIVE  UNROLL
    102% (  71.75789 ms) : STREAM NORMAL
    102% (  71.59420 ms) : STREAM UNROLL
    115% (  80.63817 ms) : STORE  NORMAL
    123% (  86.58758 ms) : STORE  UNROLL
    123% (  86.22740 ms) : STORE  PREFET
bench L3-MASS, array 1610612736 bytes (402653184 floats, 0 remainder, 0x108ed8040 pointer)
warm up round...done
     83% ( 274.71955 ms) : MEMSET CHEAT 
     83% ( 275.19793 ms) : MEMSET PATTER
    100% ( 272.21942 ms) : NAIVE  NORMAL
     94% ( 309.73151 ms) : NAIVE  UNROLL
     82% ( 271.38751 ms) : STREAM NORMAL
     82% ( 270.27244 ms) : STREAM UNROLL
     94% ( 308.49498 ms) : STORE  NORMAL
     94% ( 308.72266 ms) : STORE  UNROLL
     95% ( 311.64157 ms) : STORE  PREFET

Beating or meeting OS X memset (and memset_pattern4)

Tags:

performance

c

optimization

assembly

simd

Aktau

1 Answers

Paul R

Recent Activity

Donate For Us

Beating or meeting OS X memset (and memset_pattern4)

Tags:

performance

c

optimization

assembly

simd

Aktau

1 Answers

Paul R

Related questions

Recent Activity

Donate For Us