Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using SSE to speed up computation - store, load and alignment

Tags:

c++

sse

In my project I have implemented basic class CVector. This class contains float* pointer to raw floating point array. This array is allocated dynamicly using standard malloc() function.

Now I have to speed up some computation using such vectors. Unfortunately as the memory isn't alocated using _mm_malloc() it is not aligned.

As I understand I have two options:

1) Rewrite code which allocates memory to use _mm_malloc() and for example use the code like this:

void sub(float* v1, float* v2, float* v3, int size) 
{  
    __m128* p_v1 = (__m128*)v1;  
    __m128* p_v2 = (__m128*)v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        res = _mm_sub_ps(*p_v1,*p_v2);  
        _mm_store_ps(v3,res);  
        ++p_v1;  
        ++p_v2;  
        v3 += 4;  
    }
}

2) The second option is to use _mm_loadu_ps() instruction to load __m128 from unaligned memory and then use it for computation.

void sub(float* v1, float* v2, float* v3, int size)
{  
    __m128 p_v1;  
    __m128 p_v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        p_v1 = _mm_loadu_ps(v1);   
        p_v2 = _mm_loadu_ps(v2);  
        res = _mm_sub_ps(p_v1,p_v2);    
        _mm_store_ps(v3,res);  
        v1 += 4;  
        v2 += 4;  
        v3 += 4;  
    }
}

So my question is which option will be better or faster?

like image 623
user606521 Avatar asked Feb 25 '11 14:02

user606521


2 Answers

Reading unaligned SSE values is extraordinary expensive. Check the Intel manuals, volume 4, chapter 2.2.5.1. The core type makes a difference, i7 has extra hardware to make it less costly. But reading a value that straddles the cpu cache line boundary is still 4.5 times slower than reading an aligned value. It is ten times slower on previous architectures.

That's massive, get the memory aligned to avoid that perf hit. Never heard of _mm_malloc, use _aligned_malloc() from the Microsoft CRT to get properly aligned memory from the heap.

like image 175
Hans Passant Avatar answered Oct 21 '22 06:10

Hans Passant


take a look at bullet physics. it's been used for a a handful of movies and well known games (GTA4 and others). You can either take a look at their super optimized vector, matrix and other math classes, or just use them instead. it's published under zlib license so you can just use it as you wish. Don't reinvent the wheel. Bullet, nvidia physx, havok and other physics libraries are well tested and optimized by really smart guys

like image 42
cppanda Avatar answered Oct 21 '22 08:10

cppanda