Proper way to enable SSE4 on a per-function / per-block of code basis?

For one of my OS X programs, I have a few optimized cases which use SSE4.1 instructions. On SSE3-only machines, the non-optimized branch is ran:

// SupportsSSE4_1 returns true on CPUs that support SSE4.1, false otherwise
if (SupportsSSE4_1()) {

    // Code that uses _mm_dp_ps, an SSE4 instruction

    ...

    __m128 hDelta   = _mm_sub_ps(here128, right128);
    __m128 vDelta   = _mm_sub_ps(here128, down128);

    hDelta = _mm_sqrt_ss(_mm_dp_ps(hDelta, hDelta, 0x71));
    vDelta = _mm_sqrt_ss(_mm_dp_ps(vDelta, vDelta, 0x71));

    ...

} else {
    // Equivalent code that uses SSE3 instructions
    ...
}

In order to get the above to compile, I had to set CLANG_X86_VECTOR_INSTRUCTIONS to sse4.1.

However, this seems to instruct clang that it's ok to use the ROUNDSD instruction anywhere in my program. Hence, the program is crashing on SSE3-only machines with SIGILL: ILL_ILLOPC.

What's the best practice for enabling SSE4.1 for just the lines the code inside of true branch of the SupportsSSE4_1() if block?

How many SSE4 instructions are available?

47 of the SSE4 instructions are available in 45 nm Intel processors based on the successor of Intel Core TM microarchitecture (code named Penryn). This subset of 47 SSE4 instruction is referred to as SSE4.1 in this document.

What does SSE4 mean?

Here are some details: " Intel Streaming SIMD Extensions 4 (SSE4) introduces 54 new instructions in Intel 64 processors made from 45 nm process technology. 47 of the SSE4 instructions are available in 45 nm Intel processors based on the successor of Intel Core TM microarchitecture (code named Penryn).

How do I write an algorithm in SSE?

What you have to do is to find the specific algorithm, learn the SSE instructions and rewrite the algorithm with those instructions manually. You can write in pure assembly, or use intrinsic functions, which can be called from C/C++, and will issue SSE instructions when compiled.

Does MSVC support SSE4 instruction in AVX?

For example, the MSVC implementation of the C++20 library function std::popcount will do a runtime check of the processor to see if it can use the SSE4.2 popcnt instruction. But if you target AVX, it skips the runtime check and just assumes the processor supports it. I think gcc and clang do have specific options for enabling SSE4 and SSE4.2.

There is currently no way to target different ISA extensions at block / function granularity in clang. You can only do it at file granularity (put your SSE4.1 code into a separate file and specify that file to use -msse4.1). If this is an important feature for you, please file a bug report to request it!

However, I should note that the actually benefit of DPPS is pretty small in most real scenarios (and using DPPS even slows down some code sequences!). Unless this particular code sequence is critical, and you have carefully measured the effect of using DPPS, it may not be worth the hassle to special case for SSE4.1 even if that compiler feature is available.

Proper way to enable SSE4 on a per-function / per-block of code basis?

Tags:

xcode

llvm

clang

sse

iccir

People also ask

1 Answers

Stephen Canon

Recent Activity

Donate For Us

Proper way to enable SSE4 on a per-function / per-block of code basis?

Tags:

xcode

llvm

clang

sse

iccir

People also ask

1 Answers

Stephen Canon

Related questions

Recent Activity

Donate For Us