I have this C:
#include <stddef.h>
size_t findChar(unsigned int length, char* __attribute__((aligned(16))) restrict string) {
for (size_t i = 0; i < length; i += 2) {
if (string[i] == '[' || string[i] == ' ') {
return i;
}
}
return -1;
}
It checks every other character of a string and returns the first index of the string that is [
or
. With x86-64 GCC 10.2 -O3 -march=skylake -mtune=skylake
, this is the assembly output:
findChar:
mov edi, edi
test rdi, rdi
je .L4
xor eax, eax
.L3:
movzx edx, BYTE PTR [rsi+rax]
cmp dl, 91
je .L1
cmp dl, 32
je .L1
add rax, 2
cmp rax, rdi
jb .L3
.L4:
mov rax, -1
.L1:
ret
It seems like it could be optimized significantly, because I see multiple branches. How can I write my C so that the compiler optimizes it with SIMD, string instructions, and/or vectorization?
How do I write my code to signal to the compiler that this code can be optimized?
Interactive assembly output on Godbolt: https://godbolt.org/z/W19Gz8x73
Changing it to a VLA with an explicitly declared length doesn't help much: https://godbolt.org/z/bb5fzbdM1
This is the version of the code modified so that the function would only return every 100 characters: https://godbolt.org/z/h8MjbP1cf
Each element is a u8 , and so that means that each slice would be 128 bits of data. Using SIMD, we could put both a and b into 128 bit registers, add them together in a single instruction, and then copy the resulting 128 bits into c . That'd be much faster!
One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”.
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.
Following are the Optimization: 1. O1: Optimizing compilation at O1 includes more time and memory to break down larger functions. The compiler makes an attempt to reduce both code and execution time. At O1 hardly any optimizations produce great results, but O1 is a setback for an attempt for better optimizations.
Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem: Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best.
If codes are converted into executable programs with a compiler, and the compiler itself is also an executable program, how was the source code of the first compiler turned into an executable program?
The compiler makes an attempt to reduce both code and execution time. At O1 hardly any optimizations produce great results, but O1 is a setback for an attempt for better optimizations. Below is the implementation of previous program with O1 optimization: << " secs."; Execution time: 0.384945 secs. 2.
I don’t know how to convince compiler to emit good auto-vectorized code for that. But I know how to vectorize manually. Since you’re compiling for Skylake, here’s AVX2 version of your function. Untested.
#include <stddef.h>
#include <immintrin.h>
ptrdiff_t findCharAvx2( size_t length, const char* str )
{
const __m256i andMask = _mm256_set1_epi16( 0xFF );
const __m256i search1 = _mm256_set1_epi16( '[' );
const __m256i search2 = _mm256_set1_epi16( ' ' );
const char* const ptrStart = str;
const char* const ptrEnd = str + length;
const char* const ptrEndAligned = str + ( length / 32 ) * 32;
for( ; str < ptrEndAligned; str += 32 )
{
// Load 32 bytes, zero out half of them
__m256i vec = _mm256_loadu_si256( ( const __m256i * )str );
vec = _mm256_and_si256( andMask, vec );
// Compare 16-bit lanes for equality, combine with OR
const __m256i cmp1 = _mm256_cmpeq_epi16( vec, search1 );
const __m256i cmp2 = _mm256_cmpeq_epi16( vec, search2 );
const __m256i any = _mm256_or_si256( cmp1, cmp2 );
const int mask = _mm256_movemask_epi8( any );
// If neither character is found, mask will be 0.
// Otherwise, the least significant set bit = index of the first matching byte in `any` vector
#ifdef _MSC_VER
unsigned long bitIndex;
// That's how actual instruction works, it returns 2 things at once, flag and index
if( 0 == _BitScanForward( &bitIndex, (unsigned long)mask ) )
continue;
#else
if( 0 == mask )
continue;
const int bitIndex = __builtin_ctz( mask );
#endif
return ( str - ptrStart ) + bitIndex;
}
// Handle the remainder
for( ; str < ptrEnd; str += 2 )
{
const char c = *str;
if( c == '[' || c == ' ' )
return str - ptrStart;
}
return -1;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With