I'm tried to improve performance of copy operation via SSE and AVX:
#include <immintrin.h>
const int sz = 1024;
float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
float a=0;
std::generate(mas, mas+sz, [&](){return ++a;});
const int nn = 1000;//Number of iteration in tester loops
std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3;
//std::copy testing
start1 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
std::copy(mas, mas+sz, tar);
end1 = std::chrono::system_clock::now();
float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=4, _tar+=4)
{
__m128 buffer = _mm_load_ps(_mas);
_mm_store_ps(_tar, buffer);
}
}
end2 = std::chrono::system_clock::now();
float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
//AVX-copy testing
start3 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=8, _tar+=8)
{
__m256 buffer = _mm256_load_ps(_mas);
_mm256_store_ps(_tar, buffer);
}
}
end3 = std::chrono::system_clock::now();
float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
_mm_free(mas);
_mm_free(tar);
It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:
nn=10: SSE-gain=3, AVX-gain=6;
nn=100: SSE-gain=0.75, AVX-gain=1.5;
nn=1000: SSE-gain=0.55, AVX-gain=1.1;
Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?
The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:
for blah blah:
sleep(500ms)
std::copy
sse
axv
output:
SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy
So in this case, AVX is a bunch faster than std::copy
. What happens when I change to test case to..
for blah blah:
sleep(500ms)
sse
axv
std::copy
Notice that absolutely nothing changed, except the order of the tests.
SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy
Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.
This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.
This is an very interesting question, but I believe non of the answers so far is correct because the question itself is so misleading.
The title should be changed to "How does one reach the theoretical memory I/O bandwidth ?"
No matter what instruction set is used, CPU is so much faster than RAM that pure block memory copy is 100% I/O bounded. And this explains why there is little difference between SSE and AVX performance.
For small buffers hot in L1D cache, AVX can copy significantly faster than SSE on CPUs like Haswell where 256b loads/stores really do use a 256b data path to L1D cache instead of splitting into two 128b operations.
Ironically, ancient X86 instruction rep stosq performs much better than SSE and AVX in terms of memory copy!
The article here explains how to saturate memory bandwidth really well and it has rich references to explore further as well.
See also Enhanced REP MOVSB for memcpy here on SO, where @BeeOnRope's answer discusses NT stores (and non-RFO stores done by rep stosb/stosq
) vs. regular stores, and how single-core memory bandwidth is often limited by max concurrency / latency, not by the memory controller itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With