My question is based on another SO question: Why does _mm_stream_ps produce L1/LL cache misses?
After reading it and being intrigued by it, I tried to replicate the results and see for myself which was faster: naive loop, unrolled naive loop, _mm_stream_ps
(unrolled), _mm_store_ps
(unrolled) and last but not least memset_pattern4
. (the last one takes a 4 byte pattern, such as a float, and plasters it all over the destination array, which should do the same as all the other functions, it's probably OS X exclusive though).
I've made sure to align the start of my array on a cacheline (64 bytes, I checked) and pass the array in an argument as well as any other performance tweaks the were mentioned in the previous question.
Somebody else wanted to know the same thing on gamedev: http://www.gamedev.net/topic/532112-fast-memset/
The conclusions of that thread mirror my own: when the destination array is smaller than the largest (L3) cache, _mm_store_ps
is faster than _mm_stream_ps
. When the destination array is larger, _mm_stream_ps
is faster. I'm not entirely sure why __mm_store_ps
is faster in the first case, since I never use those values in the cache, but I get why _mm_stream_ps
wins out in the latter case. It's made for this situation: write bytes to memory which you won't need immediately (or ever).
Here are some results with a destination array 256 times larger than L3 cache (in my case, 1.5GB), compiled with gcc 4.8:
gcc-4.8 stream.c -o stream -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops && ./stream
bench L3-MASS, array 1610612736 bytes (402653184 floats, 0 remainder, 0x104803040 pointer)
warm up round...
6% ( 20.81148 ms) : MEMSET CHEAT
8% ( 28.49419 ms) : MEMSET PATTER
100% ( 371.40385 ms) : NAIVE NORMAL
54% ( 202.01147 ms) : NAIVE UNROLL
31% ( 113.53433 ms) : STREAM NORMAL
30% ( 111.41691 ms) : STREAM UNROLL
51% ( 190.70412 ms) : STORE NORMAL
51% ( 189.15338 ms) : STORE UNROLL
51% ( 189.36182 ms) : STORE PREFET
So what do we learn from this? memset_pattern4
is unbelievably fast. I included bog-standard memset
even though it just uses a 1-byte pattern for comparison. In essence, memset
cheats, but memset_pattern4
does not, and it's still wicked fast.
I've tried looking at the assembly for what I believe is the source code to memset_pattern4
in the OS X string library:
memset_pattern4
: http://www.opensource.apple.com/source/Libc/Libc-825.25/string/memset_pattern.c?txt
My knowledge of asm reaches (by now) far enough that I see they're using the movdqa
instruction where it matters (in the LAlignedLoop
section), which is basically a SSE move instruction for integers (not floats), intrinsic: _mm_store_si128
. Not that it should matter here, bits and bytes, right?
memset_pattern4
, which seems different as it doesn't call bcopy
: http://www.opensource.apple.com/source/Libc/Libc-763.13/x86_64/string/memset.s (EDIT: this is the correct one, as verified by running under gdb)...damn, this one seems to use non-temporal (_mm_stream_ps
stores for very long arrays => movntdq %xmm0,(%rdi,%rcx)...
, look in the LVeryLong
section of the funcion), which is exactly what I do! So how can that be it much faster? Maybe that's not the memset_pattern4
I'm looking for.
So, what is memset_pattern4
doing under the hood and why is it 5x faster than my best attempt? Even though I've been trying to learn enough x86 assembly to be able to dissect the function I'm afraid it's a bit out of my league for now to debug performance issues in optimized-to-death functions.
NOTE: for those curious, this microbenchmark also serves to illustrate the sheer awesomeness of clang and its advanced vectorization (-fslp-vectorize
), it manages to make the naive loop the fastest one save for memset in almost all cases. It seems to be about as good as the best combination of _mm_store_ps
and _mm_stream_ps
.
CODE: here's the code I use to perform my benchmark (as gist: https://gist.github.com/6571379):
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>
/**
* compile and run:
*
* OSX:
* clang stream.c -o stream -std=c11 -O3 -g -ftree-vectorize -fslp-vectorize -march=native -minline-all-stringops && ./stream
* gcc-4.8 stream.c -o stream -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops && ./stream
*
* linux:
* clang stream.c -o stream -lrt -std=c11 -O3 -ftree-vectorize -fslp-vectorize -march=native && ./stream
* gcc-4.8 stream.c -o stream -lrt -std=c11 -O3 -ftree-vectorize -march=native && ./stream
*
* to generate the assembly:
* gcc-4.8 -S stream.c -o stream.s -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops
* gobjdump -dS stream > stream.obj.s
*
* clang is the (very clear) winner here, the SLP vectorizer is absolutely killer, it even turns the
* plain naive loop into something hyper-performant
*/
/* posix headers */
#include <sys/time.h>
/* intrinsics */
#include <x86intrin.h>
#define ARRAY_SIZE(x) ((sizeof(x)/sizeof(0[x])) / ((size_t)(!(sizeof(x) % sizeof(0[x])))))
/**
* some stats from my system
*
* sudo sysctl -a | grep cache
*
* hw.cachelinesize = 64
* hw.l1icachesize = 32768
* hw.l1dcachesize = 32768
* hw.l2cachesize = 262144
* hw.l3cachesize = 6291456
*/
/* most processors these days (2013) have a 64 byte cache line */
#define FACTOR 1024
#define CACHE_LINE 64
#define FLOATS_PER_LINE (CACHE_LINE / sizeof(float))
#define L1_CACHE_BYTES 32768
#define L2_CACHE_BYTES 262144
#define L3_CACHE_BYTES 6291456
#ifdef __MACH__
#include <mach/mach_time.h>
double ns_conversion_factor;
double us_conversion_factor;
double ms_conversion_factor;
void timeinit() {
mach_timebase_info_data_t timebase;
mach_timebase_info(&timebase);
ns_conversion_factor = (double)timebase.numer / (double)timebase.denom;
us_conversion_factor = (double)timebase.numer / (double)timebase.denom / 1000;
ms_conversion_factor = (double)timebase.numer / (double)timebase.denom / 1000000;
}
double nsticks() {
return mach_absolute_time() * ns_conversion_factor;
}
double msticks() {
return mach_absolute_time() * ms_conversion_factor;
}
#else
void timeinit() {
/* do nothing */
}
double nsticks() {
timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ((double)ts.tv_sec) / 1000000000 + ((double)ts.tv_nsec);
}
double msticks() {
timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ((double)ts.tv_sec) / 1000 + ((double)ts.tv_nsec) * 1000000;
}
#endif
void *aligned_malloc(size_t size, size_t alignment) {
void *pa, *ptr;
pa = malloc((size+alignment-1)+sizeof(void *));
if (!pa) return NULL;
ptr=(void*)( ((intptr_t)pa+sizeof(void *)+alignment-1)&~(alignment-1) );
*((void **)ptr-1)=pa;
return ptr;
}
void aligned_free(void *ptr) {
if (ptr) free(*((void **)ptr-1));
}
void pollute_cache(uint8_t volatile *arr, size_t length) {
for (int i = 0; i < length; ++i) {
arr[i] = (arr[i] > 0xFE) ? 0xAA : 0x55;
}
}
void pollute_cache_standalone() {
const size_t pollute_len = 2 * L3_CACHE_BYTES;
uint8_t *arr = aligned_malloc(pollute_len * sizeof(uint8_t), 64);
for (int i = 0; i < pollute_len; ++i) {
arr[i] = (arr[i] > 0xFE) ? 0xAA : 0x55;
}
aligned_free(arr);
}
/**
* returns the time passed, in milliseconds
*/
double tim(const char *name, double baseline, void (*pre)(void), void (*func)(float *, size_t), float * restrict arr, size_t length) {
struct timeval t1, t2;
if (pre) pre();
const double ms1 = msticks();
func(arr, length);
const double ms2 = msticks();
const double ms = (ms2 - ms1);
if (baseline == -2.0) return ms;
/* first run, equal to baseline (itself) by definition */
if (baseline == -1.0) baseline = ms;
if (baseline != 0.0) {
fprintf(stderr, "%7.0f%% (%10.5f ms) : %s\n", (ms / baseline) * 100, ms, name);
}
else {
fprintf(stderr, "%7.3f ms : %s\n", ms, name);
}
return ms;
}
void func0(float * const restrict arr, size_t length) {
memset(arr, 0x05, length);
}
#ifdef __MACH__
void funcB(float * const restrict arr, size_t length) {
const float val = 5.0f;
memset_pattern4(arr, &val,length);
}
#endif
void func1(float * const restrict arr, size_t length) {
for (int i = 0; i < length; ++i) {
arr[i] = 5.0f;
}
}
void func2(float * const restrict arr, size_t length) {
for(int i = 0; i < length; i += 4) {
arr[i] = 5.0f;
arr[i+1] = 5.0f;
arr[i+2] = 5.0f;
arr[i+3] = 5.0f;
}
}
void func3(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 4) {
_mm_stream_ps(&arr[i], buf);
}
_mm_mfence();
}
void func4(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 16) {
_mm_stream_ps(&arr[i + 0], buf);
_mm_stream_ps(&arr[i + 4], buf);
_mm_stream_ps(&arr[i + 8], buf);
_mm_stream_ps(&arr[i + 12], buf);
}
_mm_mfence();
}
void func5(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 4) {
_mm_store_ps(&arr[i], buf);
}
}
void fstore_prefetch(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 16) {
__builtin_prefetch(&arr[i + FLOATS_PER_LINE * 32], 1, 0);
_mm_store_ps(&arr[i + 0], buf);
_mm_store_ps(&arr[i + 4], buf);
_mm_store_ps(&arr[i + 8], buf);
_mm_store_ps(&arr[i + 12], buf);
}
}
void func6(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 16) {
_mm_store_ps(&arr[i + 0], buf);
_mm_store_ps(&arr[i + 4], buf);
_mm_store_ps(&arr[i + 8], buf);
_mm_store_ps(&arr[i + 12], buf);
}
}
#ifdef __AVX__
void func7(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 8) {
_mm256_stream_ps(&arr[i], buf);
}
}
void func8(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 32) {
_mm256_stream_ps(&arr[i + 0], buf);
_mm256_stream_ps(&arr[i + 8], buf);
_mm256_stream_ps(&arr[i + 16], buf);
_mm256_stream_ps(&arr[i + 24], buf);
}
}
void func9(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 8) {
_mm256_store_ps(&arr[i], buf);
}
}
void funcA(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 32) {
_mm256_store_ps(&arr[i + 0], buf);
_mm256_store_ps(&arr[i + 8], buf);
_mm256_store_ps(&arr[i + 16], buf);
_mm256_store_ps(&arr[i + 24], buf);
}
}
#endif
void bench(const char * restrict name, float * restrict arr, size_t length) {
fprintf(stderr, "bench %s, array %zu bytes (%zu floats, %zu remainder, %p pointer)\n", name, length, length / sizeof(float), length % sizeof(float), arr);
size_t nfloats = length / sizeof(float);
fprintf(stderr, "warm up round...");
func1(arr, nfloats);
fprintf(stderr, "done\n");
double baseline = tim("func1: NAIVE ", -2.0, NULL, func1, arr, nfloats);
tim("MEMSET CHEAT ", baseline, NULL, func0, arr, nfloats);
#ifdef __MACH__
tim("MEMSET PATTER", baseline, NULL, funcB, arr, nfloats);
#endif
tim("NAIVE NORMAL", -1.0, NULL, func1, arr, nfloats);
tim("NAIVE UNROLL", baseline, NULL, func2, arr, nfloats);
tim("STREAM NORMAL", baseline, NULL, func3, arr, nfloats);
tim("STREAM UNROLL", baseline, NULL, func4, arr, nfloats);
tim("STORE NORMAL", baseline, NULL, func5, arr, nfloats);
tim("STORE UNROLL", baseline, NULL, func6, arr, nfloats);
tim("STORE PREFET", baseline, NULL, fstore_prefetch, arr, nfloats);
// for (int i = 0; i < 1; ++i) {
// tim("func0: MEMSET (cache polluted)", NULL, func0, arr, nfloats);
// tim("func1: NAIVE (cache polluted)", pollute_cache_standalone, func1, arr, nfloats);
// tim("func2: UNROLL (cache polluted)", pollute_cache_standalone, func2, arr, nfloats);
// tim("func3: STREAM (cache polluted)", pollute_cache_standalone, func3, arr, nfloats);
// tim("func4: STRUN (cache polluted)", pollute_cache_standalone, func4, arr, nfloats);
// tim("func5: STORE (cache polluted)", pollute_cache_standalone, func5, arr, nfloats);
// tim("func6: STOUN (cache polluted)", pollute_cache_standalone, func6, arr, nfloats);
// }
}
int main() {
timeinit();
static const struct {
const char *name;
size_t bytes;
} sizes[] = {
{ "L1-HALF", L1_CACHE_BYTES / 2 },
{ "L1-FULL", L1_CACHE_BYTES },
{ "L2-HALF", L2_CACHE_BYTES / 2 },
{ "L2-FULL", L2_CACHE_BYTES },
{ "L3-HALF", L3_CACHE_BYTES / 2 },
{ "L3-FULL", L3_CACHE_BYTES },
{ "L3-DOUB", L3_CACHE_BYTES * 2 },
{ "L3-HUGE", L3_CACHE_BYTES * 64 },
{ "L3-MASS", L3_CACHE_BYTES * 256 }
};
for (int i = 0; i < ARRAY_SIZE(sizes); ++i) {
size_t bytes = sizes[i].bytes;
/* align to cache line */
float *arr = aligned_malloc(bytes, CACHE_LINE);
bench(sizes[i].name, arr, bytes);
aligned_free(arr);
}
return 0;
}
EDIT: I went digging a bit further and after editing the assembly that gcc generates to make it more or less the same as the one apple uses (memset.s
, label LVeryLong
, i.e.: 4 unrolled movntdq
instructions in a tight loop). To my surprise, I receive equal performance as my functions that use _mm_store_ps
(movaps
). This perplexes me, as I would have expected it to be either
memset_pattern4
(presumably unrolled movntdq
)_mm_stream_ps
(movntdq
)But no, it seems to be the same as _mm_store_ps
, imagine that, maybe I'm doing something wrong. Running objdump on the resulting binary confirms that it is using movntdq
, which suprises me even more, what the hell is going on?
Because I hit a dead end there, I decided to step through the executable in a debugger and setup a breakpoint at memset_pattern4
. Stepping into the function, I noticed that it does exactly what I thought it would do, a tight loop with four unrolled movntdq
:
0x00007fff92a5f7d2 <+318>: jmp 0x7fff92a5f7e0 <memset_pattern4+332>
0x00007fff92a5f7d4 <+320>: nopw 0x0(%rax,%rax,1)
0x00007fff92a5f7da <+326>: nopw 0x0(%rax,%rax,1)
0x00007fff92a5f7e0 <+332>: movntdq %xmm0,(%rdi,%rcx,1)
0x00007fff92a5f7e5 <+337>: movntdq %xmm0,0x10(%rdi,%rcx,1)
0x00007fff92a5f7eb <+343>: movntdq %xmm0,0x20(%rdi,%rcx,1)
0x00007fff92a5f7f1 <+349>: movntdq %xmm0,0x30(%rdi,%rcx,1)
0x00007fff92a5f7f7 <+355>: add $0x40,%rcx
=> 0x00007fff92a5f7fb <+359>: jne 0x7fff92a5f7e0 <memset_pattern4+332>
0x00007fff92a5f7fd <+361>: sfence
So, what makes Apple's sauce that much more magical than mine, I wonder...
EDIT 2: I was wrong twice here, Apple's magic sauce is not that magical, I was just passing in an array that was 4x smaller than what I was passing to my functions. Courtesy to @PaulR for noticing! Secondly I was editing the assembly of the function, but gcc had already inlined it. So I was editing a copy that was never used.
CONCLUSION:
Some other things I found out:
memset
call, clearing up another confusing result I had.memset
is really well optimized, it will automatically switch between regular store and non-temporal store (stream) based on the length of the array to write. I'm not sure how much of this is true on platforms other than OSXEDIT: I recently stumbled upon the intel optimization guide, if at all interested in these things, read some parts of this first (start at 3.7.6, perhaps).
I think you have a couple of bugs here:
void func0(float * const restrict arr, size_t length) {
memset(arr, 0x05, length);
}
and similarly here:
void funcB(float * const restrict arr, size_t length) {
const float val = 5.0f;
memset_pattern4(arr, &val,length);
}
These should actually be:
void func0(float * const restrict arr, size_t length) {
memset(arr, 0x05, length * sizeof(float));
}
and:
void funcB(float * const restrict arr, size_t length) {
const float val = 5.0f;
memset_pattern4(arr, &val, length * sizeof(float));
}
This will give timing which is 4x more optimistic than it should be for these two cases.
On my 3 year old Core i7 MacBook Pro (8 GB RAM) the fixed code gives me:
bench L3-HUGE, array 402653184 bytes (100663296 floats, 0 remainder, 0x108ed8040 pointer)
warm up round...done
99% ( 69.43037 ms) : MEMSET CHEAT
106% ( 73.98113 ms) : MEMSET PATTER
100% ( 72.40429 ms) : NAIVE NORMAL
120% ( 83.98352 ms) : NAIVE UNROLL
102% ( 71.75789 ms) : STREAM NORMAL
102% ( 71.59420 ms) : STREAM UNROLL
115% ( 80.63817 ms) : STORE NORMAL
123% ( 86.58758 ms) : STORE UNROLL
123% ( 86.22740 ms) : STORE PREFET
bench L3-MASS, array 1610612736 bytes (402653184 floats, 0 remainder, 0x108ed8040 pointer)
warm up round...done
83% ( 274.71955 ms) : MEMSET CHEAT
83% ( 275.19793 ms) : MEMSET PATTER
100% ( 272.21942 ms) : NAIVE NORMAL
94% ( 309.73151 ms) : NAIVE UNROLL
82% ( 271.38751 ms) : STREAM NORMAL
82% ( 270.27244 ms) : STREAM UNROLL
94% ( 308.49498 ms) : STORE NORMAL
94% ( 308.72266 ms) : STORE UNROLL
95% ( 311.64157 ms) : STORE PREFET
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With