Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading data for GCC's vector extensions

GCC's vector extensions offer a nice, reasonably portable way of accessing some SIMD instructions on different hardware architectures without resorting to hardware specific intrinsics (or auto-vectorization).

A real use case, is calculating a simple additive checksum. The one thing that isn't clear is how to safely load data into a vector.

typedef char v16qi __attribute__ ((vector_size(16)));

static uint8_t checksum(uint8_t *buf, size_t size)
{
    assert(size%16 == 0);
    uint8_t sum = 0;

    vec16qi vec = {0};
    for (size_t i=0; i<(size/16); i++)
    {
        // XXX: Yuck! Is there a better way?
        vec += *((v16qi*) buf+i*16);
    }

    // Sum up the vector
    sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15];

    return sum;
}

Casting a pointer to the vector type appears to work, but I'm worried this might explode in a horrible fashion if SIMD hardware expects the vector types to be correctly aligned.

The only other option I've thought of is use a temp vector and explicitly load the values (via either a memcpy or element-wise assignment), but in testing this counteract most of speedup gained use of SIMD instructions. Ideally I'd imagine this would be something like a generic __builtin_load() function, but none seems to exist.

What's a safer way of loading data into a vector risking alignment issues?

like image 299
dcoles Avatar asked Feb 16 '12 19:02

dcoles


1 Answers

Edit (thanks Peter Cordes) You can cast pointers:

typedef char v16qi __attribute__ ((vector_size (16), aligned (16)));

v16qi vec = *(v16qi*)&buf[i]; // load
*(v16qi*)(buf + i) = vec; // store whole vector

This compiles to vmovdqa to load and vmovups to store. If the data isn't known to be aligned, set aligned (1) to generate vmovdqu. (godbolt)

Note that there are also several special-purpose builtins for loading and unloading these registers (Edit 2):

v16qi vec = _mm_loadu_si128((__m128i*)&buf[i]); // _mm_load_si128 for aligned
_mm_storeu_si128((__m128i*)&buf[i]), vec); // _mm_store_si128 for aligned

It seems to be necessary to use -flax-vector-conversions to go from chars to v16qi with this function.

See also: C - How to access elements of vector using GCC SSE vector extension
See also: SSE loading ints into __m128

(Tip: The best phrase to google is something like "gcc loading __m128i".)

like image 109
ZachB Avatar answered Sep 27 '22 23:09

ZachB