Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SSE micro-optimization instruction order

I have noticed that sometimes MSVC 2010 doesn't reorder SSE instructions at all. I thought I didn't have to care about instruction order inside my loop since the compiler handles that best, which doesn't seem to be the case.

How should I think about this? What determines the best instruction order? I know some instruction have higher latency than others and that some instructions can run in parallel/async on cpu level. What metrics are relevant in the context? Where can I find them?

I know that I could avoid this question by profiling, however such profilers are expensive (VTune XE) and I would like to know the theory behind it, not just emperical results.

Also should I care about software prefetching (_mm_prefetch) or can I assume that the cpu will do a better job than me?

Lets say I have the following function. Should I interleave some of the instructions? Should I do the stores before the streams, do all the loads in order and then do calculations, etc...? Do I need to consider USWC vs non-USWC, and temporal vs non-temporal?

            auto cur128     = reinterpret_cast<__m128i*>(cur);
            auto prev128    = reinterpret_cast<const __m128i*>(prev);
            auto dest128    = reinterpret_cast<__m128i*>(dest;
            auto end        = cur128 + count/16;

            while(cur128 != end)            
            {
                auto xmm0 = _mm_add_epi8(_mm_load_si128(cur128+0), _mm_load_si128(prev128+0));
                auto xmm1 = _mm_add_epi8(_mm_load_si128(cur128+1), _mm_load_si128(prev128+1));
                auto xmm2 = _mm_add_epi8(_mm_load_si128(cur128+2), _mm_load_si128(prev128+2));
                auto xmm3 = _mm_add_epi8(_mm_load_si128(cur128+3), _mm_load_si128(prev128+3));

                                    // dest128 is USWC memory
                _mm_stream_si128(dest128+0, xmm0);  
                _mm_stream_si128(dest128+1, xmm1);
                _mm_stream_si128(dest128+2, xmm2);;
                _mm_stream_si128(dest128+3, xmm3);

                                    // cur128 is temporal, and will be used next time, which is why I choose store over stream
                _mm_store_si128 (cur128+0, xmm0);               
                _mm_store_si128 (cur128+1, xmm1);                   
                _mm_store_si128 (cur128+2, xmm2);                   
                _mm_store_si128 (cur128+3, xmm3);

                cur128  += 4;
                dest128 += 4;
                prev128 += 4;
            }

           std::swap(cur, prev);
like image 545
ronag Avatar asked Sep 01 '11 10:09

ronag


2 Answers

I agree with everyone that testing and tweaking is the best approach. But there are some tricks to help it.

First of all, MSVC does re-order SSE instruction. Your example is probably too simple or already optimal.

Generally speaking, if you have enough registers to do so, full interleaving tends to give the best results. To take it a step further, unroll your loops enough to use all the registers, but not too much to spill. In your example, the loop is completely bound by memory accesses, so there isn't much room to do any better.

In most cases, it isn't necessary to get the order of the instructions perfect to achieve optimal performance. As long as it's "close enough", either the compiler, or the hardware's out-of-order execution will fix it for you.

The method I use to determine if my code is optimal is critical-path and bottleneck analysis. After I write the loop, I look up what instructions use which resources. Using that information, I can calculate upper-bound on performance, which I then compare with the actual results to see how close/far I am from optimal.

For example, suppose I have a loop with 100 adds and 50 multiplies. On both Intel and AMD (pre-Bulldozer), each core can sustain one SSE/AVX add and one SSE/AVX multiply per cycle. Since my loop has 100 adds, I know I cannot do any better than 100 cycles. Yes, the multiplier will be idle half the time, but the adder is the bottleneck.

Now I go and time my loop and I get 105 cycles per iteration. That means I'm pretty close to optimal and there's not much more to gain. But if I get 250 cycles, then that means something's wrong with the loop and it's worth tinkering with it more.

Critical-path analysis follows the same idea. Look up the latencies for all the instructions and find the cycle time of the critical path of the loop. If your actual performance is very close to it, you're already optimal.

Agner Fog has a great reference for the internal details of the current processors: http://www.agner.org/optimize/microarchitecture.pdf

like image 199
Mysticial Avatar answered Sep 29 '22 23:09

Mysticial


I've just build this using VS2010 32bit compiler and I get the following:

void F (void *cur, const void *prev, void *dest, int count)
{
00901000  push        ebp  
00901001  mov         ebp,esp  
00901003  and         esp,0FFFFFFF8h  
  __m128i *cur128     = reinterpret_cast<__m128i*>(cur);
00901006  mov         eax,220h  
0090100B  jmp         F+10h (901010h)  
0090100D  lea         ecx,[ecx]  
  const __m128i *prev128    = reinterpret_cast<const __m128i*>(prev);
  __m128i *dest128    = reinterpret_cast<__m128i*>(dest);
  __m128i *end        = cur128 + count/16;

  while(cur128 != end)            
  {
    auto xmm0 = _mm_add_epi8(_mm_load_si128(cur128+0), _mm_load_si128(prev128+0));
00901010  movdqa      xmm0,xmmword ptr [eax-220h]  
    auto xmm1 = _mm_add_epi8(_mm_load_si128(cur128+1), _mm_load_si128(prev128+1));
00901018  movdqa      xmm1,xmmword ptr [eax-210h]  
    auto xmm2 = _mm_add_epi8(_mm_load_si128(cur128+2), _mm_load_si128(prev128+2));
00901020  movdqa      xmm2,xmmword ptr [eax-200h]  
    auto xmm3 = _mm_add_epi8(_mm_load_si128(cur128+3), _mm_load_si128(prev128+3));
00901028  movdqa      xmm3,xmmword ptr [eax-1F0h]  
00901030  paddb       xmm0,xmmword ptr [eax-120h]  
00901038  paddb       xmm1,xmmword ptr [eax-110h]  
00901040  paddb       xmm2,xmmword ptr [eax-100h]  
00901048  paddb       xmm3,xmmword ptr [eax-0F0h]  

    // dest128 is USWC memory
    _mm_stream_si128(dest128+0, xmm0);  
00901050  movntdq     xmmword ptr [eax-20h],xmm0  
    _mm_stream_si128(dest128+1, xmm1);
00901055  movntdq     xmmword ptr [eax-10h],xmm1  
    _mm_stream_si128(dest128+2, xmm2);;
0090105A  movntdq     xmmword ptr [eax],xmm2  
    _mm_stream_si128(dest128+3, xmm3);
0090105E  movntdq     xmmword ptr [eax+10h],xmm3  

    // cur128 is temporal, and will be used next time, which is why I choose store over stream
    _mm_store_si128 (cur128+0, xmm0);               
00901063  movdqa      xmmword ptr [eax-220h],xmm0  
    _mm_store_si128 (cur128+1, xmm1);                   
0090106B  movdqa      xmmword ptr [eax-210h],xmm1  
    _mm_store_si128 (cur128+2, xmm2);                   
00901073  movdqa      xmmword ptr [eax-200h],xmm2  
    _mm_store_si128 (cur128+3, xmm3);
0090107B  movdqa      xmmword ptr [eax-1F0h],xmm3  

    cur128  += 4;
00901083  add         eax,40h  
00901086  lea         ecx,[eax-220h]  
0090108C  cmp         ecx,10h  
0090108F  jne         F+10h (901010h)  
    dest128 += 4;
    prev128 += 4;
  }
}

which shows that the compiler is reordering the instructions, following the general rule of "don't use a register immediately after writing to the register". It's also turned two loads and an add into a single load and an add from memory. There's no reason you couldn't write the code like this yourself and use all the SIMD registers rather than the four you're currently using. You may want to match the total number of bytes loaded to the size of a cache line. This will give the hardware prefetching a chance to fill the next cache line before you need it.

Also, the prefetch, especially in code the reads memory sequentially, is often not necessary. The MMU can prefetch up to four streams at a time.

like image 24
Skizz Avatar answered Sep 29 '22 22:09

Skizz