Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can the multiplication of chars/digits be made more performant?

Tags:

performance

c

gcc

I have following code where a sum is calculated, based on a very large series.

The series char *a is a char array, which contains digits only (0..9).

I wanted to ask if there is any possibility to make the code faster. It is currently a bottle neck in a distributed computing application.

A small reproduction code. Not the actual code, and more simplified.

int top = 999999999;

char *a;
a = (char*) calloc(top+1, sizeof(char));

// ... fill a with initial values ...

for (int i=0; i<10; ++i) {
    unsigned long long int sum = 0;

    for (m = 1, k = top; m < k; ++m, --k) {
        // Here is the bottle neck!!
        sum += a[m]*a[k];
    }

    printf("%d\n", sum);

    // ... Add something at the end of a, and increase top ...
}

I have already tried following:

  1. Optimizing the code with -O3 (gcc compiler). The compiler line is now:

    gcc -c -Wall -fopenmp -Wno-unused-function -O3 -std=c99 -g0 -march=native -pipe -D_FILE_OFFSET_BITS=64 -m64 -fwhole-program -fprefetch-loop-arrays -funsafe-loop-optimizations -Wunsafe-loop-optimizations -fselective-scheduling -fselective-scheduling2 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops -fgcse-sm -fgcse-lm -fgcse-las -fmodulo-sched -fgcse-after-reload -fsee -DLIBDIVIDE_USE_SSE2 -DLIBDIVIDE_USE_SSE4_1 xxx.c -o xxx.o
    
  2. Using of GNU openMP to split the for-loop to multiple cores

    unsigned long long int halfway = (top>>1) + 1; // = top/2 + 1
    // digits is defined as top+1
    
    #pragma omp parallel // firstprivate/*shared*/(a, digits, halfway)
    for (unsigned long long int m = 1; m < halfway; ++m) {
        sum += a[m] * a[digits-m];
    }
    

    Result: Much, much faster, but requires more cores, and I still would like to make it faster.

  3. Casting a[m] to unsigned long long int before multiplication

    sum += (unsigned long long int)a[m] * a[k];
    

    Result: A small performance boost.

  4. Using a multiplication lookup table, because an array-lookup is faster than the actual multiplication.

    sum += multiply_lookup[a[m]][a[k]]; // a[m]*a[k];
    

    Result: A small performance boost.

  5. I have tried to find a mathematical solution to reduce operations, but it seems like nothing can be optimized, mathematically seen.

I have following idea for optimization:

I have read that the multiplication of floats (asm fmul) is much faster than the multiplication of integers (asm mul). Just changing int to float doesn't help -- but I think the code might become much more performant if the work is done using MMX or SSE instruction sets, or if the work is done by the FPU. Although I have some assembler knowledge, I have no knowledge about these topics.

However, if you have additional ideas how to optimize it, I am glad to hear them.

Update Some additional information:

  • The series grows by 1 element after each loop.
  • While the series grows, top gets increased.
  • When top is reaching the array limit, a will get increased by 100000 bytes using realloc().
  • Platform: Debian Linux Jessie x64, on an Intel(R) Xeon(R) CPU X3440 @ 2.53GHz

Additional off-topic question: Do you know the mathematical name of this sum, where the pairs of elements of the series are multiplied from outside to inside?

like image 787
Daniel Marschall Avatar asked Dec 13 '15 17:12

Daniel Marschall


People also ask

What is the fastest multiplication method?

The Karatsuba algorithm was the first multiplication algorithm asymptotically faster than the quadratic "grade school" algorithm. The Toom–Cook algorithm (1963) is a faster generalization of Karatsuba's method, and the Schönhage–Strassen algorithm (1971) is even faster, for sufficiently large n.

Is multiplication more efficient than division?

Multiplication is faster than division. At university I was taught that division takes six times that of multiplication. The actual timings are architecture dependent but in general multiplication will never be slower or even as slow as division.

Is floating point multiplication faster than addition?

There is probably very little difference in time between multiplication and addition. division on the other hand is still significantly slower then multiplication because of its recursive nature.


1 Answers

You can use the little-known PMADDUBSW (Multiply and Add Packed Signed and Unsigned Bytes) for this. The signed/unsigned business doesn't matter here, everything is in the interval [0 .. 9] anyway. The add is saturating, but that doesn't matter here because 9*9 is only 81. With intrinsics that's _mm_maddubs_epi16. Because the k index goes down, you have to byte-reverse it, which you can do with PSHUFB (_mm_shuffle_epi8). An annoying thing happens when the indexes "meet" in the middle, you can do that part one by one..

Here's a try, only slightly tested:

__m128i sum = _mm_setzero_si128();
int m, k;
for (m = 1, k = top - 15; m + 15 < k; m += 16, k -= 16) {
   __m128i am = _mm_loadu_si128((__m128i*)(a + m));
   __m128i ak = _mm_loadu_si128((__m128i*)(a + k));
   ak = _mm_shuffle_epi8(ak, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15));
   sum = _mm_add_epi16(sum, _mm_maddubs_epi16(am, ak));
}
// could use phaddw, but I do this the long way to avoid overflow slightly longer
sum = _mm_add_epi32(_mm_unpacklo_epi16(sum, _mm_setzero_si128()),
                    _mm_unpackhi_epi16(sum, _mm_setzero_si128()));
sum = _mm_hadd_epi32(sum, sum);
sum = _mm_hadd_epi32(sum, sum);
int s = _mm_cvtsi128_si32(sum);
// this is for the "tail"
k += 15;
for (; m < k; ++m, --k)
    s += a[m] * a[k];

Also I ignore overflow. You can do this for (216-1)/(2*81) = 404 iterations and still definitely have no overflow. If you need more, periodically add this to a 32bit result.

In a quick benchmark, this is about 7 times as fast as the simple way (tested with 2KB of random data on a 4770K, taking the best out of a hundred runs for each).

Using pointers as suggested by an other answer improves it further, to about 9 times as fast as the simple way. With indices there was some weird sign-extension going on.

int foobar(char* a, int top)
{
    __m128i sum = _mm_setzero_si128();

    char *m, *k;
    for (m = a + 1, k = a + top - 15; m + 15 < k; m += 16, k -= 16) {
       __m128i am = _mm_loadu_si128((__m128i*)(m));
       __m128i ak = _mm_loadu_si128((__m128i*)(k));
       ak = _mm_shuffle_epi8(ak, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15));
       sum = _mm_add_epi16(sum, _mm_maddubs_epi16(am, ak));
    }

    sum = _mm_add_epi32(_mm_unpacklo_epi16(sum, _mm_setzero_si128()),
                        _mm_unpackhi_epi16(sum, _mm_setzero_si128()));
    sum = _mm_hadd_epi32(sum, sum);
    sum = _mm_hadd_epi32(sum, sum);
    int s = _mm_cvtsi128_si32(sum);

    k += 15;
    for (; m < k; ++m, --k)
        s += *m * *k;

    return s;
}

Split up in parts, still about 9 times as fast as the original despite the extra logic:

int foobar(char* a, int top)
{
    int s = 0;
    char *m, *k;
    for (m = a + 1, k = a + top - 15; m + 15 < k;) {
        __m128i sum = _mm_setzero_si128();
        for (int i = 0; i < 404 && m + 15 < k; m += 16, k -= 16, ++i) {
           __m128i am = _mm_loadu_si128((__m128i*)(m));
           __m128i ak = _mm_loadu_si128((__m128i*)(k));
           ak = _mm_shuffle_epi8(ak, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15));
           sum = _mm_add_epi16(sum, _mm_maddubs_epi16(am, ak));
        }
        sum = _mm_add_epi32(_mm_unpacklo_epi16(sum, _mm_setzero_si128()),
                            _mm_unpackhi_epi16(sum, _mm_setzero_si128()));
        sum = _mm_hadd_epi32(sum, sum);
        sum = _mm_hadd_epi32(sum, sum);
        s += _mm_cvtsi128_si32(sum);
    }

    k += 15;
    for (; m < k; ++m, --k)
        s += *m * *k;

    return s;
}
like image 88
harold Avatar answered Oct 10 '22 11:10

harold