Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

INTEL SIMD: why is inplace multiplication so slow?

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.

The simplest can be boiled down to something like these:

void scale(float* dst, const float* src, int count, float factor)
{
    __m128 factorV = _mm_set1_ps(factorV);

    for(int i = 0; i < count; i+= 4)
    {
        __m128 in = _mm_load_ps(src);
        in = _mm_mul_ps(in, factorV);
        _mm_store_ps(dst, in);

        dst += 4;
        src += 4;
    }
}

testing code:

for(int i = 0; i < 1000000; i++)
{
    scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}

When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)

Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..

--

This has been answered in the comments. It's denormals during artificial testing.

like image 596
Eike Avatar asked Mar 06 '23 00:03

Eike


1 Answers

Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.


I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)


Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

like image 198
Peter Cordes Avatar answered Mar 15 '23 13:03

Peter Cordes