Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert signed short to float in C++ SIMD

Tags:

c++

simd

sse

avx2

I have an array of signed short that I want to divide by 2048 and get an array of float as a result.

I found SSE: convert short integer to float that allows to convert unsigned shorts to floats, but I want to handle signed shorts also.

The code below works but only for positive shorts.

// We want to divide some signed short by 2048 and get a float.
const auto floatScale = _mm256_set1_ps(2048);

short* shortsInput = /* values from somewhere */;
float* floatsOutput = /* initialized */;

__m128i* m128iInput = (__m128i*)&shortsInput[0];

// Converts the short vectors to 2 float vectors. This works, but only for positive shorts.
__m128i m128iLow = _mm_unpacklo_epi16(m128iInput[0], _mm_setzero_si128());
__m128i m128iHigh = _mm_unpackhi_epi16(m128iInput[0], _mm_setzero_si128());
__m128 m128Low = _mm_cvtepi32_ps(m128iLow);
__m128 m128High = _mm_cvtepi32_ps(m128iHigh);

// Puts the 2 __m128 vectors into 1 __m256.
__m256 singleComplete = _mm256_castps128_ps256(m128Low);
singleComplete = _mm256_insertf128_ps(singleComplete, m128High, 1);

// Finally do the math
__m256 scaledVect = _mm256_div_ps(singleComplete, floatScale);

// and puts the result where needed.
_mm256_storeu_ps(floatsOutput[0], scaledVect);

How can I convert my signed shorts to floats? Or maybe there's a better way to tackle this problem?


EDIT: I tried the different answers compared to a non SIMD algorithm, doing it 10M times over a 2048 array, on an AMD Ryzen 7 2700 at ~3.2GHz. I'm using Visual 15.7.3 with mostly the default config:

/permissive- /Yu"stdafx.h" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl 
/Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE"
/D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope
/arch:AVX2 /Gd /Oi /MD /openmp /FC /Fa"x64\Release\" /EHsc /nologo
/Fo"x64\Release\" /Fp"x64\Release\test.pch" /diagnostics:classic 

Note that I'm very new to SIMD and haven't use C++ for ages. Here's what I get (I reran each test separately and not one after the other and got better results like that):

  • No SIMD: 7300ms
  • wim's answer: 2300ms
  • chtz's SSE2 answer: 1650ms
  • chtz's AVX2 answer: 2100ms

So I get a nice speedup by using SIMD, and chtz's SSE2 answer, though being more verbose and complex to understand, is faster. (At least when compiled with AVX enabled, so it avoids extra instructions to copy registers by using 3-operand VEX-coded instructions. On Intel CPUs, the AVX2 versions should be significantly faster than the 128-bit version.)

Here's my test code:

const int size = 2048;
const int loopSize = (int)1e7;

float* noSimd(short* shortsInput) {
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j++) {
            floatsOutput[j] = shortsInput[j] / 2048.0f;
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld noSimd\n", totalTime);

    return floatsOutput;
}

float* wimMethod(short* shortsInput) {
    const auto floatScale = _mm256_set1_ps(1.0f / 2048.0f);
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j += 8) {
            __m128i short_vec = _mm_loadu_si128((__m128i*)&shortsInput[j]);
            __m256i int_vec = _mm256_cvtepi16_epi32(short_vec);
            __m256  singleComplete = _mm256_cvtepi32_ps(int_vec);

            // Finally do the math
            __m256 scaledVect = _mm256_mul_ps(singleComplete, floatScale);

            // and puts the result where needed.
            _mm256_storeu_ps(&floatsOutput[j], scaledVect);
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld wimMethod\n", totalTime);

    return floatsOutput;
}

float* chtzMethodSSE2(short* shortsInput) {
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j += 8) {
            // get input:
            __m128i val = _mm_loadu_si128((__m128i*)&shortsInput[j]);
            // add 0x8000 to wrap to unsigned short domain:
            val = _mm_add_epi16(val, const0x8000);
            // interleave with upper part of float(1<<23)/2048.f:
            __m128i lo = _mm_unpacklo_epi16(val, const0x4580);
            __m128i hi = _mm_unpackhi_epi16(val, const0x4580);
            // interpret as float and subtract float((1<<23) + (0x8000))/2048.f
            __m128 lo_f = _mm_sub_ps(_mm_castsi128_ps(lo), constFloat);
            __m128 hi_f = _mm_sub_ps(_mm_castsi128_ps(hi), constFloat);
            // store:
            _mm_storeu_ps(&floatsOutput[j], lo_f);
            _mm_storeu_ps(&floatsOutput[j] + 4, hi_f);
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld chtzMethod\n", totalTime);

    return floatsOutput;
}

float* chtzMethodAVX2(short* shortsInput) {
    const auto floatScale = _mm256_set1_ps(1.0f / 2048.0f);
    float* floatsOutput = new float[size];

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < loopSize; i++) {
        for (int j = 0; j < size; j += 8) {

            // get input:
            __m128i val = _mm_loadu_si128((__m128i*)&shortsInput[j]);
            // interleave with 0x0000
            __m256i val_unpacked = _mm256_cvtepu16_epi32(val);

            // 0x4580'8000
            const __m256 magic = _mm256_set1_ps(float((1 << 23) + (1 << 15)) / 2048.f);
            const __m256i magic_i = _mm256_castps_si256(magic);

            /// convert by xor-ing and subtracting magic value:
            // VPXOR avoids port5 bottlenecks on Intel CPUs before SKL
            __m256 val_f = _mm256_castsi256_ps(_mm256_xor_si256(val_unpacked, magic_i));
            __m256 converted = _mm256_sub_ps(val_f, magic);
            // store:
            _mm256_storeu_ps(&floatsOutput[j], converted);
        }
    }

    auto stopTime = std::chrono::high_resolution_clock::now();
    long long totalTime = (stopTime - startTime).count();

    printf("%lld chtzMethod2\n", totalTime);

    return floatsOutput;
}
like image 565
user276648 Avatar asked May 30 '18 06:05

user276648


2 Answers

You can replace the standard way of doing convert epi16->epi32->float and multiplying by 1.f/2048.f, by manually composing a float.

This works because the divisor is a power of 2, so manually composing the float just means a different exponent.

Thanks to @PeterCordes, here is an optimized AVX2 version of this idea, using XOR to set the upper bytes of the 32-bit float at the same time as flipping the sign bit of the integer value. FP SUB turns those low bits of the mantissa into the correct FP value:

// get input:
__m128i val = _mm_loadu_si128((__m128i*)input);
// interleave with 0x0000
__m256i val_unpacked = _mm256_cvtepu16_epi32(val);

// 0x4580'8000
const __m256 magic = _mm256_set1_ps(float((1<<23) + (1<<15))/2048.f);
const __m256i magic_i = _mm256_castps_si256(magic);

/// convert by xor-ing and subtracting magic value:
// VPXOR avoids port5 bottlenecks on Intel CPUs before SKL
__m256 val_f = _mm256_castsi256_ps(_mm256_xor_si256(val_unpacked, magic_i));
__m256 converted = _mm256_sub_ps(val_f, magic);
// store:
_mm256_storeu_ps(output, converted);

See it on the Godbolt compiler explorer with gcc and clang; on Skylake i7-6700k, a 2048 element loop that's hot in cache takes ~360 clock cycles, the same speed (to within measurement error) as @wim's version that does the standard sign-extend/convert/multiply (with a similar amount of loop unrolling). Tested by @PeterCordes with Linux perf. But on Ryzen, this could be significantly faster, because we avoid _mm256_cvtepi32_ps (Ryzen has 1 per 2 clock throughput for vcvtdq2ps ymm: http://agner.org/optimize/.)

The xor of 0x8000 with the lower half is equivalent to adding/subtracting 0x8000, since overflow/carry is ignored. And coincidentally, this allows to use the same magic constant for XOR-ing and subtracting.

Strangely, gcc and clang prefer to replace the subtraction with an addition of -magic, which will not re-use the constant ... They prefer using add because it's commutative, but in this case there's no benefit because they're not using it with a memory operand.


Here's an SSE2 version that does the signed/unsigned flip separately from setting the high 2 bytes of a 32-bit FP bit-pattern.

We're using one _mm_add_epi16, two _mm_unpackXX_epi16 and two _mm_sub_ps for 8 values (the _mm_castsi128_ps are no-ops, and the _mm_set would be cached in registers):

// get input:
__m128i val = _mm_loadu_si128((__m128i*)input);
// add 0x8000 to wrap to unsigned short domain:
// val = _mm_add_epi16(val, _mm_set1_epi16(0x8000));
val = _mm_xor_si128(val, _mm_set1_epi16(0x8000));  // PXOR runs on more ports, avoids competing with FP add/sub or unpack on Sandybridge/Haswell.

// interleave with upper part of float(1<<23)/2048.f:
__m128i lo = _mm_unpacklo_epi16(val, _mm_set1_epi16(0x4580));
__m128i hi = _mm_unpackhi_epi16(val, _mm_set1_epi16(0x4580));
// interpret as float and subtract float((1<<23) + (0x8000))/2048.f
__m128 lo_f = _mm_sub_ps(_mm_castsi128_ps(lo), _mm_set_ps1(float((1<<23) + (1<<15))/2048.f));
__m128 hi_f = _mm_sub_ps(_mm_castsi128_ps(hi), _mm_set_ps1(float((1<<23) + (1<<15))/2048.f));
// store:
_mm_storeu_ps(output, lo_f);
_mm_storeu_ps(output+4, hi_f);

Usage demonstration: https://ideone.com/b8BfJd

If your input would have been unsigned short, the _mm_add_epi16 would not be necessary (and the 1<<15 in the _mm_sub_ps would need to be removed, of course). Then you'd have Marat's answer on SSE: convert short integer to float.

This can easily be ported to AVX2 with twice as many conversions per iteration, but care has to be taken regarding the order of output elements (thanks @wim for pointing this out).


Also, for a pure SSE solution, one could simply use _mm_cvtpi16_ps, but that's an Intel library function. There's no single instruction that does this.

// cast input pointer:
__m64* input64 = (__m64*)input;
// convert and scale:
__m128 lo_f = _mm_mul_ps(_mm_cvtpi16_ps(input64[0]), _mm_set_ps1(1.f/2048.f));
__m128 hi_f = _mm_mul_ps(_mm_cvtpi16_ps(input64[1]), _mm_set_ps1(1.f/2048.f));

I did not benchmark any solution (nor check for theoretical throughputs or latencies)

like image 156
chtz Avatar answered Sep 28 '22 07:09

chtz


With AVX2 it is not necessary to convert the high and low pieces separately:

const auto floatScale = _mm256_set1_ps(1.0f/2048.0f);

short* shortsInput = /* values from somewhere */;
float* floatsOutput = /* initialized */;

__m128i short_vec = _mm_loadu_si128((__m128i*)shortsInput);
__m256i int_vec =  _mm256_cvtepi16_epi32 (short_vec);
__m256  singleComplete = _mm256_cvtepi32_ps (int_vec);

// Finally do the math
__m256 scaledVect = _mm256_mul_ps(singleComplete, floatScale);

// and puts the result where needed.
_mm256_storeu_ps(floatsOutput, scaledVect);

This compiles nicely on the Godbolt compiler explorer, and with input/output hot in L1d cache and aligned input/output arrays, converts an array of 2048 elements in ~360 clock cycles on Skylake i7-6700k (tested in a repeat loop). That's ~0.18 cycles per element, or ~5.7 conversions per clock cycle. Or ~1.4 cycles per vector, including the store. It's mostly bottlenecked on front-end throughput (3.75 fused-domain uops per clock), even with clang's loop unrolling, because a conversion is 5 uops.

Note that vpmovsxwd ymm, [mem] can't micro-fuse into a single uop even with a simple addressing mode on Haswell/Skylake, so in this case it's good that recent gcc/clang transform pointer-increments into indexed addressing with a single loop counter. With most memory-source vector instructions (like vpmovsxwd xmm, [mem]), that would have cost an extra uop: Micro fusion and addressing modes.

With one load and one store, it's ok that the stores can't run on Haswell/Skylake's port7 store AGU, which only handles non-indexed addressing modes.

Loop unrolling is needed for max throughput on Intel CPUs (if there are no memory bottlenecks), because the load + convert + store is already 4 uops. Same as with @chtz's answer.

Ideally use the vector result for further computation right away, if you only need to read the float values a couple times. It's only 3 instructions (but does have some latency for out-of-order exec to hide). Redoing the conversion when needed might be better than having a bigger cache footprint to store the twice-as-large float[] result in memory; it depends on your use-case and the hardware.

like image 21
wim Avatar answered Sep 28 '22 07:09

wim