Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

uint32_t * uint32_t = uint64_t vector multiplication with gcc

I'm trying to multiply vectors of uint32_t producing the full 64-bit result in an uint64_t vector in gcc. The result I expect is for gcc to emit a single VPMULUDQ instruction. But what gcc outputs as code is horrible shuffling around of the individual uint32_t of the source vectors and then a full 64*64=64 multiplication. Here is what I've tried:

#include <stdint.h>

typedef uint32_t v8lu __attribute__ ((vector_size (32)));
typedef uint64_t v4llu __attribute__ ((vector_size (32)));

v4llu mul(v8lu x, v8lu y) {
    x[1] = 0; x[3] = 0; x[5] = 0; x[7] = 0;
    y[1] = 0; y[3] = 0; y[5] = 0; y[7] = 0;
    return (v4llu)x * (v4llu)y;
}

The first masks out the unwanted parts of the uint32_t vector in the hope that gcc would optimize away the unneeded parts of the 64*64=64 multiplication and then see the masking is pointless as well. No such luck.

v4llu mul2(v8lu x, v8lu y) {
    v4llu tx = {x[0], x[2], x[4], x[6]};
    v4llu ty = {y[0], y[2], y[4], y[6]};
    return tx * ty;
}

Here I try to create a uint64_t vector from scratch with only the used parts set. Again gcc should see the top 32 bits of each uint64_t are 0 and not do a full 64*64=64 multiply. Instead, a lot of extracting and putting back of the values happens, and a 64*64=64 multiply.

v4llu mul3(v8lu x, v8lu y) {
    v4llu t = {x[0] * (uint64_t)y[0], x[2] * (uint64_t)y[2], x[4] * (uint64_t)y[4], x[6] * (uint64_t)y[6]};
    return t;
}

Let's build the result vector by multiplying the parts. Maybe gcc sees that it can use VPMULUDQ to achieve exactly that. No luck, it falls back to 4 IMUL opcodes.

Is there a way to tell gcc what I want it to do (32*32=64 multiplication with everything perfectly placed)?

Note: Inline asm or the intrinsic isn't the answer. Writing the opcode by hand obviously works. But then I would have to write different versions of the code for many target architectures and feature sets. I want gcc to understand the problem and produce the right solution from a single source code.

like image 429
Goswin von Brederlow Avatar asked Nov 13 '19 13:11

Goswin von Brederlow


1 Answers

As noted in the comments by chtz both mul1 and mul2 are optimized right by clang. Code similar to mul3 but using a for loop will be optimized too (but not as well).

So to me it looks like the syntax is correct to express what the code should do and gcc simply lacks the smarts so far to optimize this properly.

like image 89
Goswin von Brederlow Avatar answered Nov 06 '22 21:11

Goswin von Brederlow