How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?

Tags:

I have this C:

#include <stddef.h>
size_t findChar(unsigned int length, char*  __attribute__((aligned(16))) restrict string) {
    for (size_t i = 0; i < length; i += 2) {
        if (string[i] == '[' || string[i] == ' ') {
            return i;
        }
    }
    return -1;
}

It checks every other character of a string and returns the first index of the string that is [ or . With x86-64 GCC 10.2 -O3 -march=skylake -mtune=skylake, this is the assembly output:

findChar:
        mov     edi, edi
        test    rdi, rdi
        je      .L4
        xor     eax, eax
.L3:
        movzx   edx, BYTE PTR [rsi+rax]
        cmp     dl, 91
        je      .L1
        cmp     dl, 32
        je      .L1
        add     rax, 2
        cmp     rax, rdi
        jb      .L3
.L4:
        mov     rax, -1
.L1:
        ret

It seems like it could be optimized significantly, because I see multiple branches. How can I write my C so that the compiler optimizes it with SIMD, string instructions, and/or vectorization?

How do I write my code to signal to the compiler that this code can be optimized?

Interactive assembly output on Godbolt: https://godbolt.org/z/W19Gz8x73

Changing it to a VLA with an explicitly declared length doesn't help much: https://godbolt.org/z/bb5fzbdM1

This is the version of the code modified so that the function would only return every 100 characters: https://godbolt.org/z/h8MjbP1cf

437

asked Apr 05 '21 20:04

noɥʇʎԀʎzɐɹƆ

1 Answers

I don’t know how to convince compiler to emit good auto-vectorized code for that. But I know how to vectorize manually. Since you’re compiling for Skylake, here’s AVX2 version of your function. Untested.

#include <stddef.h>
#include <immintrin.h>

ptrdiff_t findCharAvx2( size_t length, const char* str )
{
    const __m256i andMask = _mm256_set1_epi16( 0xFF );
    const __m256i search1 = _mm256_set1_epi16( '[' );
    const __m256i search2 = _mm256_set1_epi16( ' ' );

    const char* const ptrStart = str;
    const char* const ptrEnd = str + length;
    const char* const ptrEndAligned = str + ( length / 32 ) * 32;
    for( ; str < ptrEndAligned; str += 32 )
    {
        // Load 32 bytes, zero out half of them
        __m256i vec = _mm256_loadu_si256( ( const __m256i * )str );
        vec = _mm256_and_si256( andMask, vec );

        // Compare 16-bit lanes for equality, combine with OR
        const __m256i cmp1 = _mm256_cmpeq_epi16( vec, search1 );
        const __m256i cmp2 = _mm256_cmpeq_epi16( vec, search2 );
        const __m256i any = _mm256_or_si256( cmp1, cmp2 );
        const int mask = _mm256_movemask_epi8( any );

        // If neither character is found, mask will be 0.
        // Otherwise, the least significant set bit = index of the first matching byte in `any` vector
#ifdef _MSC_VER
        unsigned long bitIndex;
        // That's how actual instruction works, it returns 2 things at once, flag and index
        if( 0 == _BitScanForward( &bitIndex, (unsigned long)mask ) )
            continue;
#else
        if( 0 == mask )
            continue;
        const int bitIndex = __builtin_ctz( mask );
#endif
        return ( str - ptrStart ) + bitIndex;
    }

    // Handle the remainder
    for( ; str < ptrEnd; str += 2 )
    {
        const char c = *str;
        if( c == '[' || c == ' ' )
            return str - ptrStart;
    }
    return -1;
}

answered Nov 03 '22 14:11

Soonts

Related questions
                            
                                When during the socket lifetime should I set the TCP_QUICKACK option?
                            
                                How can libzip be used from C to create a zip file in memory?
                            
                                Socket read timeout under windows: strange hardcode in native method
                            
                                Force LP64 data model with GCC or Clang in Windows
                            
                                Is there a way to tell the C compiler that a pointer has no aliasing stores?
                            
                                Is there difference between scanf("%c",&x) and x=getchar()?
                            
                                Synchronizing two child processes with semaphores in c
                            
                                Delay loading dll in release mode
                            
                                Undefined behavior with pointer arithmetic on dynamically allocated memory
                            
                                What is ptr_munge in the apple argument to main?
                            
                                FFT Frequency Bins and PIC32
                            
                                Link keyrings in initramfs using syscall()
                            
                                When does SIGIO fire?
                            
                                libgit2 git_checkout_head with GIT_CHECKOUT_SAFE do nothing with working dir
                            
                                Example of an extended integer type?
                            
                                uint32_t * uint32_t = uint64_t vector multiplication with gcc
                            
                                How are FLT_DIG, DBL_DIG, and LDBL_DIG determined in C [duplicate]
                            
                                When does stack grow? How does OS know when to grow stack?
                            
                                How to fix Missing CSRF token in sentry
                            
                                How do you wrap a C function that returns a pointer to a malloc'd array with ctypes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?

Tags:

c

vectorization

compiler-optimization

assembly

simd

noɥʇʎԀʎzɐɹƆ

People also ask

1 Answers

Soonts

Recent Activity

Donate For Us