I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner loop that constitutes the hot-spot looks like this: <pre class="prettyprint"><code>for (i = 0; i < numberOfGFVectorsInFragment; i++) { // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them. __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr); __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray); __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector); __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor); _mm_store_si128(encodedFragmentResultArray, updatedResultVector); encodedFragmentResultArray++; currentMessageFragmentPtr++; } </code></pre>

Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the <code>_mm_set_epi32</code> packing operations. Internally, <code>_mm_set_epi32</code>, in your case will probably be implemented as a series of <code>unpacklo/hi</code> instructions. Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled. If you're willing to use SSE4.1, you can try this. It might be faster, it might not: <pre class="prettyprint"><code> int* logSumArray = (int*)(&logSumVector); __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3); </code></pre> I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well. What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...

Optimzing SSE-code

Tags:

java

c

optimization

sse

intel-vtune

I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.

Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.

The inner loop that constitutes the hot-spot looks like this:

for (i = 0; i < numberOfGFVectorsInFragment; i++)   {

        // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
        __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
        __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);

        __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);

        __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
        _mm_store_si128(encodedFragmentResultArray, updatedResultVector);

        encodedFragmentResultArray++;
        currentMessageFragmentPtr++;
    }

450

asked Oct 17 '11 14:10

Yrlec

2 Answers

Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the _mm_set_epi32 packing operations. Internally, _mm_set_epi32, in your case will probably be implemented as a series of unpacklo/hi instructions.

Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.

If you're willing to use SSE4.1, you can try this. It might be faster, it might not:

    int* logSumArray = (int*)(&logSumVector);

    __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);

I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.

What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...

149

answered Sep 23 '22 02:09

Mysticial

Maybe try http://web.eecs.utk.edu/~plank/plank/papers/CS-07-593/. The functions with "region" in their names are supposedly fast. They don't seem to use any kind of special instruction sets, but maybe they've been optimized in other ways...

answered Sep 25 '22 02:09

David

Related questions
                            
                                DataTable in JSF
                            
                                Java garbage collection "real" time is much bigger than "user" +"system"
                            
                                How to forbid directory name editing in JFileChooser?
                            
                                OrientDB having trouble with Unicode, Turkish, and enums
                            
                                Apache MINA Vysper documentation?
                            
                                How to catch Attribute-events with a StAX XML-parser?
                            
                                Multiplying polynomial by constant in Java
                            
                                How to parse .plist file in Java?
                            
                                AES encryption for .NET, Java (android) and iOS
                            
                                Android run bash command in app
                            
                                EJB application shutdown hook
                            
                                How to make jpeg lossless in java?
                            
                                Programming 3d Games in Java
                            
                                Money Example from Kent Beck's TDD by example
                            
                                MongoDB+Java - parsing JSON via com.mongodb.util.JSON.parse
                            
                                Checking for absence of super class in annotation processor
                            
                                mvc with multiple windows design
                            
                                Parsing very large XML files and marshalling to Java Objects
                            
                                Hibernate + Oracle IN clause limitation, how to solve it?
                            
                                NoClassDefFoundError: scala/ScalaObject in mixed Scala/Java project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With