Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Taking advantage of SSE and other CPU extensions

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.

I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried out on the same set of data, so once the data is read into the cache initially, there shouldn't be any cache misses to stall it. However I'm not sure about going about this.

  • Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.

  • I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?

  • What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?

like image 359
Fire Lancer Avatar asked Dec 12 '09 19:12

Fire Lancer

1 Answers

For your second point there are several solutions as long as you can separate out the differences into different functions:

  • plain old C function pointers
  • dynamic linking (which generally relies on C function pointers)
  • if you're using C++, having different classes that represent the support for different architectures and using virtual functions can help immensely with this.

Note that because you'd be relying on indirect function calls, the functions that abstract the different operations generally need to represent somewhat higher level functionality or you may lose whatever gains you get from the optimized instruction in the call overhead (in other words don't abstract the individual SSE operations - abstract the work you're doing).

Here's an example using function pointers:

typedef int (*scale_func_ptr)( int scalar, int* pData, int count);

int non_sse_scale( int scalar, int* pData, int count)
    // do whatever work needs done, without SSE so it'll work on older CPUs

    return 0;

int sse_scale( int scalar, in pData, int count)
    // equivalent code, but uses SSE

    return 0;

// at initialization

scale_func_ptr scale_func = non_sse_scale;

if (useSSE) {
    scale_func = sse_scale;

// now, when you want to do the work:

scale_func( 12, theData_ptr, 512);  // this will call the routine that tailored to SSE 
                                    // if the CPU supports it, otherwise calls the non-SSE
                                    // version of the function
like image 146
Michael Burr Avatar answered Oct 29 '22 05:10

Michael Burr