Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?

I have this C:

#include <stddef.h>
size_t findChar(unsigned int length, char*  __attribute__((aligned(16))) restrict string) {
    for (size_t i = 0; i < length; i += 2) {
        if (string[i] == '[' || string[i] == ' ') {
            return i;
        }
    }
    return -1;
}

It checks every other character of a string and returns the first index of the string that is [ or . With x86-64 GCC 10.2 -O3 -march=skylake -mtune=skylake, this is the assembly output:

findChar:
        mov     edi, edi
        test    rdi, rdi
        je      .L4
        xor     eax, eax
.L3:
        movzx   edx, BYTE PTR [rsi+rax]
        cmp     dl, 91
        je      .L1
        cmp     dl, 32
        je      .L1
        add     rax, 2
        cmp     rax, rdi
        jb      .L3
.L4:
        mov     rax, -1
.L1:
        ret

It seems like it could be optimized significantly, because I see multiple branches. How can I write my C so that the compiler optimizes it with SIMD, string instructions, and/or vectorization?

How do I write my code to signal to the compiler that this code can be optimized?

Interactive assembly output on Godbolt: https://godbolt.org/z/W19Gz8x73

Changing it to a VLA with an explicitly declared length doesn't help much: https://godbolt.org/z/bb5fzbdM1

This is the version of the code modified so that the function would only return every 100 characters: https://godbolt.org/z/h8MjbP1cf

like image 437
noɥʇʎԀʎzɐɹƆ Avatar asked Apr 05 '21 20:04

noɥʇʎԀʎzɐɹƆ


People also ask

Why is SIMD faster?

Each element is a u8 , and so that means that each slice would be 128 bits of data. Using SIMD, we could put both a and b into 128 bit registers, add them together in a single instruction, and then copy the resulting 128 bits into c . That'd be much faster!

Do compilers use SIMD?

One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”.

What is SIMD vectorization?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

Does GCC use SIMD?

The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations.

What are the optimization of compiler?

Following are the Optimization: 1. O1: Optimizing compilation at O1 includes more time and memory to break down larger functions. The compiler makes an attempt to reduce both code and execution time. At O1 hardly any optimizations produce great results, but O1 is a setback for an attempt for better optimizations.

How can I help the compiler create fast code?

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem: Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best.

Is the compiler itself also an executable program?

If codes are converted into executable programs with a compiler, and the compiler itself is also an executable program, how was the source code of the first compiler turned into an executable program?

What is O1 optimization in compiler?

The compiler makes an attempt to reduce both code and execution time. At O1 hardly any optimizations produce great results, but O1 is a setback for an attempt for better optimizations. Below is the implementation of previous program with O1 optimization: << " secs."; Execution time: 0.384945 secs. 2.


1 Answers

I don’t know how to convince compiler to emit good auto-vectorized code for that. But I know how to vectorize manually. Since you’re compiling for Skylake, here’s AVX2 version of your function. Untested.

#include <stddef.h>
#include <immintrin.h>

ptrdiff_t findCharAvx2( size_t length, const char* str )
{
    const __m256i andMask = _mm256_set1_epi16( 0xFF );
    const __m256i search1 = _mm256_set1_epi16( '[' );
    const __m256i search2 = _mm256_set1_epi16( ' ' );

    const char* const ptrStart = str;
    const char* const ptrEnd = str + length;
    const char* const ptrEndAligned = str + ( length / 32 ) * 32;
    for( ; str < ptrEndAligned; str += 32 )
    {
        // Load 32 bytes, zero out half of them
        __m256i vec = _mm256_loadu_si256( ( const __m256i * )str );
        vec = _mm256_and_si256( andMask, vec );

        // Compare 16-bit lanes for equality, combine with OR
        const __m256i cmp1 = _mm256_cmpeq_epi16( vec, search1 );
        const __m256i cmp2 = _mm256_cmpeq_epi16( vec, search2 );
        const __m256i any = _mm256_or_si256( cmp1, cmp2 );
        const int mask = _mm256_movemask_epi8( any );

        // If neither character is found, mask will be 0.
        // Otherwise, the least significant set bit = index of the first matching byte in `any` vector
#ifdef _MSC_VER
        unsigned long bitIndex;
        // That's how actual instruction works, it returns 2 things at once, flag and index
        if( 0 == _BitScanForward( &bitIndex, (unsigned long)mask ) )
            continue;
#else
        if( 0 == mask )
            continue;
        const int bitIndex = __builtin_ctz( mask );
#endif
        return ( str - ptrStart ) + bitIndex;
    }

    // Handle the remainder
    for( ; str < ptrEnd; str += 2 )
    {
        const char c = *str;
        if( c == '[' || c == ' ' )
            return str - ptrStart;
    }
    return -1;
}
like image 50
Soonts Avatar answered Nov 03 '22 14:11

Soonts