I have a simple loop in <code>C</code> where I convert <code>magnitude</code> and <code>angle</code> to <code>real</code> and <code>imaginary</code> parts. I have two versions of the loop as. <code>Version 1</code> is a simple for loop where I perform the conversion using following code <pre class="prettyprint"><code>for(k = 0; k < n; k++){ xReal[k] = Mag[k] * cos(Angle[k]); xImag[k] = Mag[k] * sin(Angle[k]); } </code></pre> An <code>Version 2</code> where <code>Intrinsics</code> are used to vectorize the loop. <pre class="prettyprint"><code>__m256d cosVec, sinVec; __m256d resultReal, resultImag; __m256d angVec, voltVec; for(k = 0; k < SysData->totNumOfBus; k+=4){ voltVec = _mm256_loadu_pd(volt + k); angVec = _mm256_loadu_pd(theta + k); sinVec = _mm256_sincos_pd(&cosVec, angVec); resultImag = _mm256_mul_pd(voltVec, sinVec); resultReal = _mm256_mul_pd(voltVec, cosVec); _mm256_store_pd(xReal+k, resultReal); _mm256_store_pd(xImag+k, resultImag); } </code></pre> On a <code>Core i7 2600k @3.4GHz</code> processor, these loops give following results: <pre class="prettyprint"><code>Version 1: n = 18562320, Time: 0.2sec Version 2: n = 18562320, Time: 0.16sec </code></pre> A simple calculations with these values show that in <code>version 1</code>, each iteration takes almost <code>36</code> cycles to be completed whereas it takes <code>117</code> cycles for <code>Version 2</code> to be completed. Considering the fact that calculation of <code>sine</code> and <code>cosine</code> functions is naturally expensive, these number seems not to be terrible. However, This loop is a serious bottleneck of my function as profiling shows that almost <code>1/3</code> of the time is spent inside the loop. So, I am wondering if there is any way to expedite this loop (e.g. calculating <code>sine</code> and <code>cosine</code> functions differently). It is appreciated if help me work around this problem and let me know whether there is room to improve the performance of this loop. Thanks in advance for your help PS: I am using <code>icc</code> to compile the code. Also, I should mention that data are not aligned (and cannot be). However, aligning data only leads to a minor performance improvement (Less than 1 percent).

please check: <ol> <li>whether the start address of the array is aligned to 16byte. i7 support high latency unaligned avx load without complain "bus error"</li> <li>please check the cache hit and miss rate using a profile tool. It seems memory access is the bottleneck of version 2 of the loop</li> <li>you can downgrade the precision, or using a result table for sin and cos calculation.</li> <li>please consider how much performance improvement you plan to achieve. Since the version 1 of the loop take only 1/3 of total running time. if you optimize the loop to zero, the performance only improve 30% </li> </ol>

How to improve performance of following loop

Tags:

performance

c

for-loop

sse

intrinsics

I have a simple loop in C where I convert magnitude and angle to real and imaginary parts. I have two versions of the loop as. Version 1 is a simple for loop where I perform the conversion using following code

for(k = 0; k < n; k++){
    xReal[k] = Mag[k] * cos(Angle[k]);
    xImag[k] = Mag[k] * sin(Angle[k]);
}

An Version 2 where Intrinsics are used to vectorize the loop.

__m256d cosVec, sinVec;
__m256d resultReal, resultImag;
__m256d angVec, voltVec;
for(k = 0; k < SysData->totNumOfBus; k+=4){

    voltVec = _mm256_loadu_pd(volt + k);
    angVec = _mm256_loadu_pd(theta + k);

    sinVec = _mm256_sincos_pd(&cosVec, angVec);

    resultImag = _mm256_mul_pd(voltVec, sinVec);
    resultReal = _mm256_mul_pd(voltVec, cosVec);

    _mm256_store_pd(xReal+k, resultReal);
    _mm256_store_pd(xImag+k, resultImag);

}

On a Core i7 2600k @3.4GHz processor, these loops give following results:

Version 1: n = 18562320, Time: 0.2sec
Version 2: n = 18562320, Time: 0.16sec

A simple calculations with these values show that in version 1, each iteration takes almost 36 cycles to be completed whereas it takes 117 cycles for Version 2 to be completed. Considering the fact that calculation of sine and cosine functions is naturally expensive, these number seems not to be terrible. However, This loop is a serious bottleneck of my function as profiling shows that almost 1/3 of the time is spent inside the loop. So, I am wondering if there is any way to expedite this loop (e.g. calculating sine and cosine functions differently). It is appreciated if help me work around this problem and let me know whether there is room to improve the performance of this loop.

Thanks in advance for your help

PS: I am using icc to compile the code. Also, I should mention that data are not aligned (and cannot be). However, aligning data only leads to a minor performance improvement (Less than 1 percent).

290

asked Aug 12 '13 23:08

Pouya

2 Answers

I recommend to make tayler series based sin/cos function and _mm256_stream_pd() to store data. Here is base sample code.

    __m256d sin_req[10];
    __m256d cos_req[10];
    __m256d one_pd =  _mm256_set1_pd(1.0);

    for(int i=0; i<10; ++i)
    {
        sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+1) );
        cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+0) );
    }

    for(int i=0; i<count; i+=4)
    {
            __m256d voltVec = _mm256_load_pd(volt + i);
            __m256d angVec = _mm256_load_pd(theta + i);

            // sin/cos by taylor series
            __m256d angleSq = angVec * angVec;
            __m256d sinVec = angVec;
            __m256d cosVec = one_pd;
            __m256d sin_serise = sinVec;
            __m256d cos_serise = one_pd;
            for(int j=0; j<10; ++j)
            {
                sin_serise = sin_serise * angleSq; // [1]
                cos_serise = cos_serise * angleSq;
                sinVec = sinVec + sin_serise * sin_req[j];
                cosVec = cosVec + cos_serise * cos_req[j];
            }

            __m256d resultReal = voltVec * sinVec;
            __m256d resultImag = voltVec * cosVec;

            _mm256_store_pd(xReal + i, resultReal);
            _mm256_store_pd(xImag + i, resultImag );
    }

I could get 57~58 CPU cycles for 4 components calculation.

I searched google and ran some tests for accuracy of my sin/cos. Some articles say 10 iteration is double-precision accurate while -M_PI/2 < angle < +M_PI/2. And my test result show it's more accurate than math.h's sin/cos at -M_PI < angle < +M_PI range. You can increase iteration for more accuracy for large angle valued if needed.

However I'll go deeper optimizing this code. This code have latency problem calculation tayor series. AVX's multiply latency is 5 CPU cycle, this means we can't run one iteration faster than 5 cycle because [1] uses result from previous iteration's result.

We can simply unroll it like this.

    for(int i=0; i<count; i+=8)
    {
        __m256d voltVec0 = _mm256_load_pd(volt + i + 0);
        __m256d voltVec1 = _mm256_load_pd(volt + i + 4);
        __m256d angVec0  = _mm256_load_pd(theta + i + 0);
        __m256d angVec1  = _mm256_load_pd(theta + i + 4);
        __m256d sinVec0;
        __m256d sinVec1;
        __m256d cosVec0;
        __m256d cosVec1;

        __m256d angleSq0 = angVec0 * angVec0;
        __m256d angleSq1 = angVec1 * angVec1;
        sinVec0 = angVec0;
        sinVec1 = angVec1;
        cosVec0 = one_pd;
        cosVec1 = one_pd;
        __m256d sin_serise0 = sinVec0;
        __m256d sin_serise1 = sinVec1;
        __m256d cos_serise0 = one_pd;
        __m256d cos_serise1 = one_pd;

        for(int j=0; j<10; ++j)
        {
            sin_serise0 = sin_serise0 * angleSq0;
            cos_serise0 = cos_serise0 * angleSq0;
            sin_serise1 = sin_serise1 * angleSq1;
            cos_serise1 = cos_serise1 * angleSq1;
            sinVec0 = sinVec0 + sin_serise0 * sin_req[j];
            cosVec0 = cosVec0 + cos_serise0 * cos_req[j];
            sinVec1 = sinVec1 + sin_serise1 * sin_req[j];
            cosVec1 = cosVec1 + cos_serise1 * cos_req[j];
        }

        __m256d realResult0 = voltVec0 * sinVec0;
        __m256d imagResult0 = voltVec0 * cosVec0;
        __m256d realResult1 = voltVec1 * sinVec1;
        __m256d imagResult1 = voltVec1 * cosVec1;

        _mm256_store_pd(xReal + i + 0, realResult0);
        _mm256_store_pd(xImag + i + 0, imagResult0);
        _mm256_store_pd(xReal + i + 4, realResult1);
        _mm256_store_pd(xImag + i + 4, imagResult1);
    }

This result 51~51.5 cycles for 4 component calculation. (102~103 cycle for 8 component)

It eliminated mutiply latency in taylor calculation loop and uses 85% of AVX multiply unit. Unrolling will solve lots of latency problems while it does not swap registers to memory. Generate asm file while compiling and see how your compiler handle your code. I tried unroll more but it resulted bad because it could not fit in 16 AVX registers.

Now we go with memory optmize. replace _mm256_store_ps() to _mm256_stream_ps().

    _mm256_stream_pd(xReal + i + 0, realResult0);
    _mm256_stream_pd(xImag + i + 0, imagResult0);
    _mm256_stream_pd(xReal + i + 4, realResult1);
    _mm256_stream_pd(xImag + i + 4, imagResult1);

Replacing memory write code result 48 cycles for 4 component calculation.

_mm256_stream_pd() is always faster if you are not going to read it back. It skip cache system and send data direct to memory controller and does not pollute your cache. You will get more databus/cache space for reading datas by using _mm256_stream_pd().

Let's try prefetch.

    for(int i=0; i<count; i+=8)
    {
    _mm_prefetch((const CHAR *)(volt + i + 5 * 8), _MM_HINT_T0);
    _mm_prefetch((const CHAR *)(theta + i + 5 * 8), _MM_HINT_T0);

            // calculations here.
    }

Now I got 45.6~45.8 CPU cycles per calculation. 94% busy of AVX multiply unit.

Prefech hints cache for faster read. I recommend to prefech before 400~500 CPU cycles based on physical memory's RAS-CAS latency. Physical memory latency can take up to 300 cycles on worst case. may vary by hardware configuration, will not be smaller than 200 cycle even if you use expensive low RAS-CAS delay memory.

0.064 sec (count = 18562320)

End of sin/cos optimization. :-)

answered Nov 14 '22 20:11

zupet

please check:

whether the start address of the array is aligned to 16byte. i7 support high latency unaligned avx load without complain "bus error"
please check the cache hit and miss rate using a profile tool. It seems memory access is the bottleneck of version 2 of the loop
you can downgrade the precision, or using a result table for sin and cos calculation.
please consider how much performance improvement you plan to achieve. Since the version 1 of the loop take only 1/3 of total running time. if you optimize the loop to zero, the performance only improve 30%

answered Nov 14 '22 22:11

Kun Ling

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to improve performance of following loop

Tags:

performance

c

for-loop

sse

intrinsics

Pouya

People also ask

2 Answers

zupet

Kun Ling

Recent Activity

Donate For Us

How to improve performance of following loop

Tags:

performance

c

for-loop

sse

intrinsics

Pouya

People also ask

2 Answers

zupet

Kun Ling

Related questions

Recent Activity

Donate For Us