Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is _mm_set_epi16 sometimes faster than _mm_load_si128?

I've understood it's best to avoid _mm_set_epi*, and instead rely on _mm_load_si128 (or even _mm_loadu_si128 with a small performance hit if the data is not aligned). However, the impact this has on performance seems inconsistent to me. The following is a good example.

Consider the two following functions that utilize SSE intrinsics:

static uint32_t clmul_load(uint16_t x, uint16_t y)
{
    const __m128i c = _mm_clmulepi64_si128(
      _mm_load_si128((__m128i const*)(&x)),
      _mm_load_si128((__m128i const*)(&y)), 0);

    return _mm_extract_epi32(c, 0);
}

static uint32_t clmul_set(uint16_t x, uint16_t y)
{
    const __m128i c = _mm_clmulepi64_si128(
      _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, x),
      _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, y), 0);

    return _mm_extract_epi32(c, 0);
}

The following function benchmarks the performance of the two:

template <typename F>
void benchmark(int t, F f)
{
    std::mt19937 rng(static_cast<unsigned int>(std::time(0)));
    std::uniform_int_distribution<uint32_t> uint_dist10(
      0, std::numeric_limits<uint32_t>::max());

    std::vector<uint32_t> vec(t);

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < t; ++i)
    {
        vec[i] = f(uint_dist10(rng), uint_dist10(rng));
    }

    auto duration = std::chrono::duration_cast<
      std::chrono::milliseconds>(
      std::chrono::high_resolution_clock::now() -
      start);

    std::cout << (duration.count() / 1000.0) << " seconds.\n";
}

Finally, the following main program does some testing:

int main()
{
    const int N = 10000000; 
    benchmark(N, clmul_load);
    benchmark(N, clmul_set);
}

On an i7 Haswell with MSVC 2013, a typical output is

0.208 seconds.  // _mm_load_si128
0.129 seconds.  // _mm_set_epi16

Using GCC with parameters -O3 -std=c++11 -march=native (with slightly older hardware), a typical output is

0.312 seconds.  // _mm_load_si128
0.262 seconds.  // _mm_set_epi16

What explains this? Are there actually cases where _mm_set_epi* is preferable over _mm_load_si128? There are other times where I've noticed _mm_load_si128 to perform better, but I can't really characterize those observations.

like image 428
Gideon Avatar asked Nov 16 '25 13:11

Gideon


2 Answers

Your compiler is optimizing away the "gather" behavior of your _mm_set_epi16() call since it really isn't needed. From g++ 4.8 (-O3) and gdb:

(gdb) disas clmul_load
Dump of assembler code for function clmul_load(uint16_t, uint16_t):
   0x0000000000400b80 <+0>:     mov    %di,-0xc(%rsp)
   0x0000000000400b85 <+5>:     mov    %si,-0x10(%rsp)
   0x0000000000400b8a <+10>:    vmovdqu -0xc(%rsp),%xmm0
   0x0000000000400b90 <+16>:    vmovdqu -0x10(%rsp),%xmm1
   0x0000000000400b96 <+22>:    vpclmullqlqdq %xmm1,%xmm0,%xmm0
   0x0000000000400b9c <+28>:    vmovd  %xmm0,%eax
   0x0000000000400ba0 <+32>:    retq
End of assembler dump.

(gdb) disas clmul_set
Dump of assembler code for function clmul_set(uint16_t, uint16_t):
   0x0000000000400bb0 <+0>:     vpxor  %xmm0,%xmm0,%xmm0
   0x0000000000400bb4 <+4>:     vpxor  %xmm1,%xmm1,%xmm1
   0x0000000000400bb8 <+8>:     vpinsrw $0x0,%edi,%xmm0,%xmm0
   0x0000000000400bbd <+13>:    vpinsrw $0x0,%esi,%xmm1,%xmm1
   0x0000000000400bc2 <+18>:    vpclmullqlqdq %xmm1,%xmm0,%xmm0
   0x0000000000400bc8 <+24>:    vmovd  %xmm0,%eax
   0x0000000000400bcc <+28>:    retq
End of assembler dump.

The vpinsrw (insert word) is ever-so-slightly faster than the unaligned double-quadword move from clmul_load, likely due to the internal load/store unit being able to do the smaller reads simultaneously but not the 16B ones. If you were doing more arbitrary loads, this would go away, obviously.

like image 131
Jeff Avatar answered Nov 18 '25 04:11

Jeff


The slowness of _mm_set_epi* comes from the need to scrape together various variables into a single vector. You'd have to examine the generated assembly to be certain, but my guess is that since most of the arguments to your _mm_set_epi16 calls are constants (and zeroes, at that), GCC is generating a fairly short and fast set of instructions for the intrinsic.

like image 39
Sneftel Avatar answered Nov 18 '25 05:11

Sneftel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!