Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast weighted mean & variance of 10 bins

I would like to speed up a part of my code but I don't think there is a possible better way to do the following calculation:

float invSum = 1.0f / float(sum);

for (int i = 0; i < numBins; ++i)
{
    histVec[i] *= invSum;
}

for (int i = 0; i < numBins; ++i)
{
    float midPoint = (float)i*binSize + binOffset;
    float f = histVec[i];
    fmean += f * midPoint;
}

for (int i = 0; i < numBins; ++i)
{
    float midPoint = (float)i*binSize + binOffset;
    float f = histVec[i];
    float diff = midPoint - fmean;
    var += f * hwk::sqr(diff);
}

numBins in the for-loops is typically 10 but this bit of code is called very often (frequency of 80 frames per seconds, called at least 8 times per frame)

I tried to use some SSE methods but it is only slightly speeding up this code. I think I could avoid calculating twice midPoint but I am not sure how. Is there a better way to compute fmean and var?

Here is the SSE code:

// make hist contain a multiple of 4 valid values
    for (int i = numBins; i < ((numBins + 3) & ~3); i++) 
        hist[i] = 0;

// find sum of bins in inHist
    __m128i iSum4 = _mm_set1_epi32(0);
    for (int i = 0; i < numBins; i += 4) 
    {
        __m128i a = *((__m128i *) &inHist[i]);
        iSum4 = _mm_add_epi32(iSum4, a);
    }

    int iSum = iSum4.m128i_i32[0] + iSum4.m128i_i32[1] + iSum4.m128i_i32[2] + iSum4.m128i_i32[3];

    //float stdevB, meanB;

    if (iSum == 0.0f)
    {
        stdev = 0.0;
        mean = 0.0;
    }
    else
    {   
        // Set histVec to normalised values in inHist
        __m128 invSum = _mm_set1_ps(1.0f / float(iSum));
        for (int i = 0; i < numBins; i += 4) 
        {
            __m128i a = *((__m128i *) &inHist[i]);
            __m128  b = _mm_cvtepi32_ps(a);
            __m128  c = _mm_mul_ps(b, invSum);
            _mm_store_ps(&histVec[i], c);
        }

        float binSize = 256.0f / (float)numBins;
        float halfBinSize = binSize * 0.5f;
        float binOffset = halfBinSize;

        __m128 binSizeMask = _mm_set1_ps(binSize);
        __m128 binOffsetMask = _mm_set1_ps(binOffset);
        __m128 fmean4 = _mm_set1_ps(0.0f);
        for (int i = 0; i < numBins; i += 4)
        {
            __m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
            __m128  idx_m128 = _mm_cvtepi32_ps(idx4);
            __m128  histVec4 = _mm_load_ps(&histVec[i]);
            __m128  midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
            fmean4 = _mm_add_ps(fmean4, _mm_mul_ps(histVec4, midPoint4));
        }
        fmean4 = _mm_hadd_ps(fmean4, fmean4); // 01 23 01 23
        fmean4 = _mm_hadd_ps(fmean4, fmean4); // 0123 0123 0123 0123

        float fmean = fmean4.m128_f32[0]; 

        //fmean4 = _mm_set1_ps(fmean);
        __m128 var4 = _mm_set1_ps(0.0f);
        for (int i = 0; i < numBins; i+=4)
        {
            __m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
            __m128  idx_m128 = _mm_cvtepi32_ps(idx4);
            __m128  histVec4 = _mm_load_ps(&histVec[i]);
            __m128  midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
            __m128  diff4 = _mm_sub_ps(midPoint4, fmean4);
            var4 = _mm_add_ps(var4, _mm_mul_ps(histVec4, _mm_mul_ps(diff4, diff4)));
        }

        var4 = _mm_hadd_ps(var4, var4); // 01 23 01 23
        var4 = _mm_hadd_ps(var4, var4); // 0123 0123 0123 0123
        float var = var4.m128_f32[0]; 

        stdev = sqrt(var);
        mean = fmean;
    }

I might be doing something wrong since I dont have a lot of improvement as I was expecting. Is there something in the SSE code that might possibly slow down the process?

(editor's note: the SSE part of this question was originally asked as https://stackoverflow.com/questions/31837817/foor-loop-optimisation-sse-comparison, which was closed as a duplicate.)

like image 209
trexgris Avatar asked Aug 05 '15 13:08

trexgris


1 Answers

I only just realized that your data array starts out as an array of int, since you didn't have declarations in your code. I can see in the SSE version that you start with integers, and only store a float version of it later.

Keeping everything integer will let us do the loop-counter-vector with a simple ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); Aki Suihkonen's answer has some transformations that should let it optimize a lot better. Especially, the auto-vectorizer should be able to do more even without -ffast-math. In fact, it does quite well. You could do better with intrinsics, esp. saving some vector 32bit multiplies and shortening the dependency chain.


My old answer, based on just trying to optimize your code as written, assuming FP input:

You may be able to combine all 3 loops into one, using the algorithm @Jason linked to. It might not be profitable, though, since it involves a division. For small numbers of bins, probably just loop multiple times.

Start by reading the guides at http://agner.org/optimize/. A couple of the techniques in his Optimising Assembly guide will speed up your SSE attempt (which I edited into this question for you).

  • combine your loops where possible, so you do more with the data for each time it's loaded / stored.

  • multiple accumulators to hide the latency of loop-carried dependency chains. (Even FP add takes 3 cycles on recent Intel CPUs.) This won't apply for really short arrays like your case.

  • instead of int->float conversion on every iteration, use a float loop counter as well as the int loop counter. (add a vector of _mm_set1_ps(4.0f) every iteration.) _mm_set... with variable args is something to avoid in loops, when possible. It takes several instructions (esp. when each arg to setr has to be calculated separately.)

gcc -O3 manages to auto-vectorize the first loop, but not the others. With -O3 -ffast-math, it auto-vectorizes more. -ffast-math allows it to do FP operations in a different order than the code specifies. e.g. adding up the array in 4 elements of a vector, and only combining the 4 accumulators at the end.

Telling gcc that the input pointer is aligned by 16 lets gcc auto-vectorize with a lot less overhead (no scalar loops for unaligned portions).

// return mean
float fpstats(float histVec[], float sum, float binSize, float binOffset, long numBins, float *variance_p)
{
    numBins += 3;
    numBins &= ~3;  // round up to multiple of 4.  This is just a quick hack to make the code fast and simple.
    histVec = (float*)__builtin_assume_aligned(histVec, 16);

    float invSum = 1.0f / float(sum);
    float var = 0, fmean = 0;

    for (int i = 0; i < numBins; ++i)
    {
        histVec[i] *= invSum;
        float midPoint = (float)i*binSize + binOffset;
        float f = histVec[i];
        fmean += f * midPoint;
    }

    for (int i = 0; i < numBins; ++i)
    {
        float midPoint = (float)i*binSize + binOffset;
        float f = histVec[i];
        float diff = midPoint - fmean;
//        var += f * hwk::sqr(diff);
        var += f * (diff * diff);
    }
    *variance_p = var;
    return fmean;
}

gcc generates some weird code for the 2nd loop.

        # broadcasting fmean after the 1st loop
        subss   %xmm0, %xmm2    # fmean, D.2466
        shufps  $0, %xmm2, %xmm2        # vect_cst_.16
.L5: ## top of 2nd loop 
        movdqa  %xmm3, %xmm5    # vect_vec_iv_.8, vect_vec_iv_.8
        cvtdq2ps        %xmm3, %xmm3    # vect_vec_iv_.8, vect__32.9
        movq    %rcx, %rsi      # D.2465, D.2467
        addq    $1, %rcx        #, D.2465
        mulps   %xmm1, %xmm3    # vect_cst_.11, vect__33.10
        salq    $4, %rsi        #, D.2467
        paddd   %xmm7, %xmm5    # vect_cst_.7, vect_vec_iv_.8
        addps   %xmm2, %xmm3    # vect_cst_.16, vect_diff_39.15
        mulps   %xmm3, %xmm3    # vect_diff_39.15, vect_powmult_53.17
        mulps   (%rdi,%rsi), %xmm3      # MEM[base: histVec_10, index: _107, offset: 0B], vect__41.18
        addps   %xmm3, %xmm4    # vect__41.18, vect_var_42.19
        cmpq    %rcx, %rax      # D.2465, bnd.26
        ja      .L8     #,   ### <--- This is insane.
        haddps  %xmm4, %xmm4    # vect_var_42.19, tmp160
        haddps  %xmm4, %xmm4    # tmp160, vect_var_42.21
.L2:
        movss   %xmm4, (%rdx)   # var, *variance_p_44(D)
        ret
        .p2align 4,,10
        .p2align 3
.L8:
        movdqa  %xmm5, %xmm3    # vect_vec_iv_.8, vect_vec_iv_.8
        jmp     .L5     #

So instead of just jumping back to the top every iteration, gcc decides to jump ahead to copy a register, and then unconditionally jmp back to the top of the loop. The uop loop buffer may remove the front-end overhead of this sillyness, but gcc should have structured the loop so it didn't copy xmm5->xmm3 and then xmm3->xmm5 every iteration, because that's silly. It should have the conditional jump just go to the top of the loop.

Also note the technique gcc used to get a float version of the loop counter: start with an integer vector of 1 2 3 4, and add set1_epi32(4). Use that as an input for packed int->float cvtdq2ps. On Intel HW, that instruction runs on the FP-add port, and has 3 cycle latency, same as packed FP add. gcc prob. would have done better to just add a vector of set1_ps(4.0), even though this creates a 3-cycle loop-carried dependency chain, instead of 1 cycle vector int add, with a 3 cycle convert forking off on every iteration.

small iteration count

You say this will often be used on exactly 10 bins? A specialized version for just 10 bins could give a big speedup, by avoiding all the loop overhead and keeping everything in registers.

With that small a problem size, you can have the FP weights just sitting there in memory, instead of re-computing them with integer->float conversion every time.

Also, 10 bins is going to mean a lot of horizontal operations relative to the amount of vertical operations, since you only have 2 and a half vectors worth of data.

If exactly 10 is really common, specialize a version for that. If under-16 is common, specialize a version for that. (They can and should share the const float weights[] = { 0.0f, 1.0f, 2.0f, ...}; array.)

You probably will want to use intrinsics for the specialized small-problem versions, rather than auto-vectorization.

Having zero-padding after the end of the useful data in your array might still be a good idea in your specialized version(s). However, you can load the last 2 floats and clear the upper 64b of a vector register with a movq instruction. (__m128i _mm_cvtsi64_si128 (__int64 a)). Cast this to __m128 and you're good to go.

like image 90
Peter Cordes Avatar answered Sep 30 '22 02:09

Peter Cordes