Why is _mm_set_epi16 sometimes faster than _mm_load_si128?

Question

I've understood it's best to avoid _mm_set_epi*, and instead rely on _mm_load_si128 (or even _mm_loadu_si128 with a small performance hit if the data is not aligned). However, the impact this has on performance seems inconsistent to me. The following is a good example.

Consider the two following functions that utilize SSE intrinsics:

static uint32_t clmul_load(uint16_t x, uint16_t y)
{
    const __m128i c = _mm_clmulepi64_si128(
      _mm_load_si128((__m128i const*)(&x)),
      _mm_load_si128((__m128i const*)(&y)), 0);

    return _mm_extract_epi32(c, 0);
}

static uint32_t clmul_set(uint16_t x, uint16_t y)
{
    const __m128i c = _mm_clmulepi64_si128(
      _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, x),
      _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, y), 0);

    return _mm_extract_epi32(c, 0);
}

The following function benchmarks the performance of the two:

template <typename F>
void benchmark(int t, F f)
{
    std::mt19937 rng(static_cast<unsigned int>(std::time(0)));
    std::uniform_int_distribution<uint32_t> uint_dist10(
      0, std::numeric_limits<uint32_t>::max());

    std::vector<uint32_t> vec(t);

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < t; ++i)
    {
        vec[i] = f(uint_dist10(rng), uint_dist10(rng));
    }

    auto duration = std::chrono::duration_cast<
      std::chrono::milliseconds>(
      std::chrono::high_resolution_clock::now() -
      start);

    std::cout << (duration.count() / 1000.0) << " seconds.
";
}

Finally, the following main program does some testing:

int main()
{
    const int N = 10000000; 
    benchmark(N, clmul_load);
    benchmark(N, clmul_set);
}

On an i7 Haswell with MSVC 2013, a typical output is

0.208 seconds.  // _mm_load_si128
0.129 seconds.  // _mm_set_epi16

Using GCC with parameters -O3 -std=c++11 -march=native (with slightly older hardware), a typical output is

0.312 seconds.  // _mm_load_si128
0.262 seconds.  // _mm_set_epi16

What explains this? Are there actually cases where _mm_set_epi* is preferable over _mm_load_si128? There are other times where I've noticed _mm_load_si128 to perform better, but I can't really characterize those observations.

Jeff · Accepted Answer

Your compiler is optimizing away the "gather" behavior of your _mm_set_epi16() call since it really isn't needed. From g++ 4.8 (-O3) and gdb:

(gdb) disas clmul_load
Dump of assembler code for function clmul_load(uint16_t, uint16_t):
   0x0000000000400b80 <+0>:     mov    %di,-0xc(%rsp)
   0x0000000000400b85 <+5>:     mov    %si,-0x10(%rsp)
   0x0000000000400b8a <+10>:    vmovdqu -0xc(%rsp),%xmm0
   0x0000000000400b90 <+16>:    vmovdqu -0x10(%rsp),%xmm1
   0x0000000000400b96 <+22>:    vpclmullqlqdq %xmm1,%xmm0,%xmm0
   0x0000000000400b9c <+28>:    vmovd  %xmm0,%eax
   0x0000000000400ba0 <+32>:    retq
End of assembler dump.

(gdb) disas clmul_set
Dump of assembler code for function clmul_set(uint16_t, uint16_t):
   0x0000000000400bb0 <+0>:     vpxor  %xmm0,%xmm0,%xmm0
   0x0000000000400bb4 <+4>:     vpxor  %xmm1,%xmm1,%xmm1
   0x0000000000400bb8 <+8>:     vpinsrw $0x0,%edi,%xmm0,%xmm0
   0x0000000000400bbd <+13>:    vpinsrw $0x0,%esi,%xmm1,%xmm1
   0x0000000000400bc2 <+18>:    vpclmullqlqdq %xmm1,%xmm0,%xmm0
   0x0000000000400bc8 <+24>:    vmovd  %xmm0,%eax
   0x0000000000400bcc <+28>:    retq
End of assembler dump.

The vpinsrw (insert word) is ever-so-slightly faster than the unaligned double-quadword move from clmul_load, likely due to the internal load/store unit being able to do the smaller reads simultaneously but not the 16B ones. If you were doing more arbitrary loads, this would go away, obviously.

Sneftel · Answer

The slowness of _mm_set_epi* comes from the need to scrape together various variables into a single vector. You'd have to examine the generated assembly to be certain, but my guess is that since most of the arguments to your _mm_set_epi16 calls are constants (and zeroes, at that), GCC is generating a fairly short and fast set of instructions for the intrinsic.

Why is _mm_set_epi16 sometimes faster than _mm_load_si128?

Tags:

c++

sse

intrinsics

Gideon

2 Answers

Jeff

Sneftel

Recent Activity

Donate For Us

Why is _mm_set_epi16 sometimes faster than _mm_load_si128?

Tags:

c++

sse

intrinsics

Gideon

2 Answers

Jeff

Sneftel

Related questions

Recent Activity

Donate For Us