Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve performance of following loop

I have a simple loop in C where I convert magnitude and angle to real and imaginary parts. I have two versions of the loop as. Version 1 is a simple for loop where I perform the conversion using following code

for(k = 0; k < n; k++){
    xReal[k] = Mag[k] * cos(Angle[k]);
    xImag[k] = Mag[k] * sin(Angle[k]);
}

An Version 2 where Intrinsics are used to vectorize the loop.

__m256d cosVec, sinVec;
__m256d resultReal, resultImag;
__m256d angVec, voltVec;
for(k = 0; k < SysData->totNumOfBus; k+=4){

    voltVec = _mm256_loadu_pd(volt + k);
    angVec = _mm256_loadu_pd(theta + k);

    sinVec = _mm256_sincos_pd(&cosVec, angVec);

    resultImag = _mm256_mul_pd(voltVec, sinVec);
    resultReal = _mm256_mul_pd(voltVec, cosVec);

    _mm256_store_pd(xReal+k, resultReal);
    _mm256_store_pd(xImag+k, resultImag);

}

On a Core i7 2600k @3.4GHz processor, these loops give following results:

Version 1: n = 18562320, Time: 0.2sec
Version 2: n = 18562320, Time: 0.16sec

A simple calculations with these values show that in version 1, each iteration takes almost 36 cycles to be completed whereas it takes 117 cycles for Version 2 to be completed. Considering the fact that calculation of sine and cosine functions is naturally expensive, these number seems not to be terrible. However, This loop is a serious bottleneck of my function as profiling shows that almost 1/3 of the time is spent inside the loop. So, I am wondering if there is any way to expedite this loop (e.g. calculating sine and cosine functions differently). It is appreciated if help me work around this problem and let me know whether there is room to improve the performance of this loop.

Thanks in advance for your help

PS: I am using icc to compile the code. Also, I should mention that data are not aligned (and cannot be). However, aligning data only leads to a minor performance improvement (Less than 1 percent).

like image 290
Pouya Avatar asked Aug 12 '13 23:08

Pouya


People also ask

Which for loop is faster?

The for loop is the fastest but poorly readable. The foreach is fast, and iteration is controllable. The for…of takes time, but it's sweeter. The for…in takes time, hence less convenient.

Which loop is faster for or While?

Efficiency, and While vs For Using for: % Time elapsed: 0.0010001659 seconds. Using while: % Time elapsed: 0.026000023 seconds. The main reason that While is much slower is because the while loop checks the condition after each iteration, so if you are going to write this code, just use a for loop instead.


2 Answers

I recommend to make tayler series based sin/cos function and _mm256_stream_pd() to store data. Here is base sample code.

    __m256d sin_req[10];
    __m256d cos_req[10];
    __m256d one_pd =  _mm256_set1_pd(1.0);

    for(int i=0; i<10; ++i)
    {
        sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+1) );
        cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+0) );
    }

    for(int i=0; i<count; i+=4)
    {
            __m256d voltVec = _mm256_load_pd(volt + i);
            __m256d angVec = _mm256_load_pd(theta + i);

            // sin/cos by taylor series
            __m256d angleSq = angVec * angVec;
            __m256d sinVec = angVec;
            __m256d cosVec = one_pd;
            __m256d sin_serise = sinVec;
            __m256d cos_serise = one_pd;
            for(int j=0; j<10; ++j)
            {
                sin_serise = sin_serise * angleSq; // [1]
                cos_serise = cos_serise * angleSq;
                sinVec = sinVec + sin_serise * sin_req[j];
                cosVec = cosVec + cos_serise * cos_req[j];
            }

            __m256d resultReal = voltVec * sinVec;
            __m256d resultImag = voltVec * cosVec;

            _mm256_store_pd(xReal + i, resultReal);
            _mm256_store_pd(xImag + i, resultImag );
    }

I could get 57~58 CPU cycles for 4 components calculation.

I searched google and ran some tests for accuracy of my sin/cos. Some articles say 10 iteration is double-precision accurate while -M_PI/2 < angle < +M_PI/2. And my test result show it's more accurate than math.h's sin/cos at -M_PI < angle < +M_PI range. You can increase iteration for more accuracy for large angle valued if needed.

However I'll go deeper optimizing this code. This code have latency problem calculation tayor series. AVX's multiply latency is 5 CPU cycle, this means we can't run one iteration faster than 5 cycle because [1] uses result from previous iteration's result.

We can simply unroll it like this.

    for(int i=0; i<count; i+=8)
    {
        __m256d voltVec0 = _mm256_load_pd(volt + i + 0);
        __m256d voltVec1 = _mm256_load_pd(volt + i + 4);
        __m256d angVec0  = _mm256_load_pd(theta + i + 0);
        __m256d angVec1  = _mm256_load_pd(theta + i + 4);
        __m256d sinVec0;
        __m256d sinVec1;
        __m256d cosVec0;
        __m256d cosVec1;

        __m256d angleSq0 = angVec0 * angVec0;
        __m256d angleSq1 = angVec1 * angVec1;
        sinVec0 = angVec0;
        sinVec1 = angVec1;
        cosVec0 = one_pd;
        cosVec1 = one_pd;
        __m256d sin_serise0 = sinVec0;
        __m256d sin_serise1 = sinVec1;
        __m256d cos_serise0 = one_pd;
        __m256d cos_serise1 = one_pd;

        for(int j=0; j<10; ++j)
        {
            sin_serise0 = sin_serise0 * angleSq0;
            cos_serise0 = cos_serise0 * angleSq0;
            sin_serise1 = sin_serise1 * angleSq1;
            cos_serise1 = cos_serise1 * angleSq1;
            sinVec0 = sinVec0 + sin_serise0 * sin_req[j];
            cosVec0 = cosVec0 + cos_serise0 * cos_req[j];
            sinVec1 = sinVec1 + sin_serise1 * sin_req[j];
            cosVec1 = cosVec1 + cos_serise1 * cos_req[j];
        }

        __m256d realResult0 = voltVec0 * sinVec0;
        __m256d imagResult0 = voltVec0 * cosVec0;
        __m256d realResult1 = voltVec1 * sinVec1;
        __m256d imagResult1 = voltVec1 * cosVec1;

        _mm256_store_pd(xReal + i + 0, realResult0);
        _mm256_store_pd(xImag + i + 0, imagResult0);
        _mm256_store_pd(xReal + i + 4, realResult1);
        _mm256_store_pd(xImag + i + 4, imagResult1);
    }

This result 51~51.5 cycles for 4 component calculation. (102~103 cycle for 8 component)

It eliminated mutiply latency in taylor calculation loop and uses 85% of AVX multiply unit. Unrolling will solve lots of latency problems while it does not swap registers to memory. Generate asm file while compiling and see how your compiler handle your code. I tried unroll more but it resulted bad because it could not fit in 16 AVX registers.

Now we go with memory optmize. replace _mm256_store_ps() to _mm256_stream_ps().

    _mm256_stream_pd(xReal + i + 0, realResult0);
    _mm256_stream_pd(xImag + i + 0, imagResult0);
    _mm256_stream_pd(xReal + i + 4, realResult1);
    _mm256_stream_pd(xImag + i + 4, imagResult1);

Replacing memory write code result 48 cycles for 4 component calculation.

_mm256_stream_pd() is always faster if you are not going to read it back. It skip cache system and send data direct to memory controller and does not pollute your cache. You will get more databus/cache space for reading datas by using _mm256_stream_pd().

Let's try prefetch.

    for(int i=0; i<count; i+=8)
    {
    _mm_prefetch((const CHAR *)(volt + i + 5 * 8), _MM_HINT_T0);
    _mm_prefetch((const CHAR *)(theta + i + 5 * 8), _MM_HINT_T0);

            // calculations here.
    }

Now I got 45.6~45.8 CPU cycles per calculation. 94% busy of AVX multiply unit.

Prefech hints cache for faster read. I recommend to prefech before 400~500 CPU cycles based on physical memory's RAS-CAS latency. Physical memory latency can take up to 300 cycles on worst case. may vary by hardware configuration, will not be smaller than 200 cycle even if you use expensive low RAS-CAS delay memory.

0.064 sec (count = 18562320)

End of sin/cos optimization. :-)

like image 73
zupet Avatar answered Nov 14 '22 20:11

zupet


please check:

  1. whether the start address of the array is aligned to 16byte. i7 support high latency unaligned avx load without complain "bus error"

  2. please check the cache hit and miss rate using a profile tool. It seems memory access is the bottleneck of version 2 of the loop

  3. you can downgrade the precision, or using a result table for sin and cos calculation.

  4. please consider how much performance improvement you plan to achieve. Since the version 1 of the loop take only 1/3 of total running time. if you optimize the loop to zero, the performance only improve 30%

like image 27
Kun Ling Avatar answered Nov 14 '22 22:11

Kun Ling