Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SIMD code runs slower than scalar code

elma and elmc are both unsigned long arrays. So are res1 and res2.

unsigned long simdstore[2];  
__m128i *p, simda, simdb, simdc;  
p = (__m128i *) simdstore;  

for (i = 0; i < _polylen; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k++)  
    {
        //res1[i + k] ^= _mulpre1[u1][k];  
        //res2[i + k] ^= _mulpre2[u2][k];               

        simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdc = _mm_xor_si128 (simda, simdb);  
        _mm_store_si128 (p, simdc);  
        res1[i + k] = simdstore[0];  
        res2[i + k] = simdstore[1];                     
    }     
}  

Within the for loop is included both the non-simd and simd version of XOR of elements. First two lines within the second for loop do the explicit XOR, whereas the rest implements the simd version of the same operation.

This loop is called from outside hundreds of times, so optimizing this loop will help bring down the total computation time.

The problem is simd code runs many times slower than the scalar code.

EDIT: Done partial unrolling

__m128i *p1, *p2, *p3, *p4;  
p1 = (__m128i *) simdstore1;  
p2 = (__m128i *) simdstore2;  
p3 = (__m128i *) simdstore3;  
p4 = (__m128i *) simdstore4;  

for (i = 0; i < 20; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k = k + 4)  
    {
        simda1  = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simda2  = _mm_set_epi64x (_mulpre2[u2][k + 1], _mulpre1[u1][k + 1]);  
        simda3  = _mm_set_epi64x (_mulpre2[u2][k + 2], _mulpre1[u1][k + 2]);  
        simda4  = _mm_set_epi64x (_mulpre2[u2][k + 3], _mulpre1[u1][k + 3]);  

        simdb1  = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdb2  = _mm_set_epi64x (res2[i + k + 1], res1[i + k + 1]);  
        simdb3  = _mm_set_epi64x (res2[i + k + 2], res1[i + k + 2]);  
        simdb4  = _mm_set_epi64x (res2[i + k + 3], res1[i + k + 3]);  

        simdc1  = _mm_xor_si128 (simda1, simdb1);  
        simdc2  = _mm_xor_si128 (simda2, simdb2);  
        simdc3  = _mm_xor_si128 (simda3, simdb3);  
        simdc4  = _mm_xor_si128 (simda4, simdb4);  

        _mm_store_si128 (p1, simdc1);  
        _mm_store_si128 (p2, simdc2);  
        _mm_store_si128 (p3, simdc3);  
        _mm_store_si128 (p4, simdc4);  

        res1[i + k]= simdstore1[0];  
        res2[i + k]= simdstore1[1]; 
        res1[i + k + 1]= simdstore2[0];  
        res2[i + k + 1]= simdstore2[1];   
        res1[i + k + 2]= simdstore3[0];  
        res2[i + k + 2]= simdstore3[1]; 
        res1[i + k + 3]= simdstore4[0];  
        res2[i + k + 3]= simdstore4[1];   
    }  
}  

But, the result does not change much; it still takes twice as long as scalar code.

like image 678
anup Avatar asked Dec 09 '10 04:12

anup


1 Answers

Disclaimer: I come from a PowerPC background, so what I'm saying here might be complete hogwash. But you're stalling your vector pipeline since you try to access your results right away.

It is best to keep everything in your vector pipeline. As soon as you do any kind of conversion from vector to int or float, or storing the result into memory, you're stalling.

The best mode of operation when dealing with SSE or VMX is: Load, process, store. Load the data into your vector registers, do all the vector processing, then store it to memory.

I would recommend: Reserve several __m128i registers, unroll your loop several times, then store it.

EDIT: Also, if you unroll, and if you align res1 and res2 by 16 bytes, you can store your results directly in memory without going through this simdstore indirection, which is probably a LHS and another stall.

EDIT: Forgot the obvious. If your polylen is typically large, don't forget to do a data cache prefetch on every iteration.

like image 97
EboMike Avatar answered Sep 21 '22 06:09

EboMike