Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compilation of a simple c++ program using SSE intrinsics

Tags:

c++

x86

simd

sse

I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming

I am using the GCC compiler on Ubuntu 10.10 with an Intel Core i7 960 CPU

Here is a code based on the article which I attempted:

For two arrays of length ARRAY_SIZE it calculates

fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5

Here is the code

#include <iostream>
#include <iomanip>
#include <ctime>
#include <stdlib.h>
#include <xmmintrin.h> // Contain the SSE compiler intrinsics
#include <malloc.h>
void myssefunction(
          float* pArray1,                   // [in] first source array
          float* pArray2,                   // [in] second source array
          float* pResult,                   // [out] result array
          int nSize)                        // [in] size of all arrays
{
    int nLoop = nSize/ 4;

    __m128 m1, m2, m3, m4;

    __m128* pSrc1 = (__m128*) pArray1;
    __m128* pSrc2 = (__m128*) pArray2;
    __m128* pDest = (__m128*) pResult;


    __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5

    for ( int i = 0; i < nLoop; i++ )
    {
        m1 = _mm_mul_ps(*pSrc1, *pSrc1);        // m1 = *pSrc1 * *pSrc1
        m2 = _mm_mul_ps(*pSrc2, *pSrc2);        // m2 = *pSrc2 * *pSrc2
        m3 = _mm_add_ps(m1, m2);                // m3 = m1 + m2
        m4 = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
        *pDest = _mm_add_ps(m4, m0_5);          // *pDest = m4 + 0.5

        pSrc1++;
        pSrc2++;
        pDest++;
    }
}

int main(int argc, char *argv[])
{
  int ARRAY_SIZE = atoi(argv[1]);
  float* m_fArray1 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
  float* m_fArray2 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
  float* m_fArray3 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);

  for (int i = 0; i < ARRAY_SIZE; ++i)
    {
      m_fArray1[i] = ((float)rand())/RAND_MAX;
      m_fArray2[i] = ((float)rand())/RAND_MAX;
    }

  myssefunction(m_fArray1 , m_fArray2 , m_fArray3, ARRAY_SIZE);

  _aligned_free(m_fArray1);
   _aligned_free(m_fArray2);
   _aligned_free(m_fArray3);

  return 0;
}

I get the following compiltation error

[Programming/SSE]$ g++ -g -Wall -msse sseintro.cpp 
sseintro.cpp: In function ‘int main(int, char**)’:
sseintro.cpp:41: error: ‘_aligned_malloc’ was not declared in this scope
sseintro.cpp:53: error: ‘_aligned_free’ was not declared in this scope
[Programming/SSE]$ 

Where am I messing up? Am I missing some header files? I seem to have included all the relevant ones.

like image 240
smilingbuddha Avatar asked Aug 21 '12 13:08

smilingbuddha


2 Answers

_aligned_malloc and _aligned_free are Microsoft-isms. Use posix_memalign or memalign on Linux et al. For Mac OS X you can just use malloc, as it is always 16 byte aligned. For portable SSE code you generally want to implement wrapper functions for aligned memory allocations, e.g.

void * malloc_simd(const size_t size)
{
#if defined WIN32           // WIN32
    return _aligned_malloc(size, 16);
#elif defined __linux__     // Linux
    return memalign(16, size);
#elif defined __MACH__      // Mac OS X
    return malloc(size);
#else                       // other (use valloc for page-aligned memory)
    return valloc(size);
#endif
}

Implementation of free_simd is left as an exercise for the reader.

like image 118
Paul R Avatar answered Oct 11 '22 21:10

Paul R


Short answer: use _mm_malloc and _mm_free from xmmintrin.h instead of _aligned_malloc and _aligned_free.

Discussion

You should not use _aligned_malloc, _aligned_free, posix_memalign, memalign, or whatever else when you are writing SSE/AVX code. These are all compiler/platform-specific functions (either MSVC or GCC or POSIX).

Intel introduced functions _mm_malloc and _mm_free in Intel compiler specifically for SIMD computations (see this). The other compilers with x86 target architecture added them too (just as they add Intel intrinsics regularly). In this sense they are the only cross-platform solution: they should be available in every compiler supporting SSE.

These functions are declared in xmmintrin.h header. Any header for later SSE/AVX version automatically includes previous ones, so it would be enough to include only smmintrin.h or emmintrin.h for example.

like image 23
stgatilov Avatar answered Oct 11 '22 21:10

stgatilov