I am trying to vectorize the following function with clang according to this clang reference. It takes a vector of byte array and applies a mask according to this RFC.
static void apply_mask(vector<uint8_t> &payload, uint8_t (&masking_key)[4]) {
#pragma clang loop vectorize(enable) interleave(enable)
for (size_t i = 0; i < payload.size(); i++) {
payload[i] = payload[i] ^ masking_key[i % 4];
}
}
The following flags are passed to clang:
-O3
-Rpass=loop-vectorize
-Rpass-analysis=loop-vectorize
However, the vectorization fails with the following error:
WebSocket.cpp:5:
WebSocket.h:14:
In file included from boost/asio/io_service.hpp:767:
In file included from boost/asio/impl/io_service.hpp:19:
In file included from boost/asio/detail/service_registry.hpp:143:
In file included from boost/asio/detail/impl/service_registry.ipp:19:
c++/v1/vector:1498:18: remark: loop not vectorized: could not determine number
of loop iterations [-Rpass-analysis]
return this->__begin_[__n];
^
c++/v1/vector:1498:18: error: loop not vectorized: failed explicitly specified
loop vectorization [-Werror,-Wpass-failed]
How do I vectorize this for loop?
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
GCC Autovectorization flagsGCC is an advanced compiler, and with the optimization flags -O3 or -ftree-vectorize the compiler will search for loop vectorizations (remember to specify the -mavx flag too). The source code remains the same, but the compiled code by GCC is completely different.
Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.
Thanks to @PaulR and @PeterCordes. Unrolling the loop by a factor of 4 works.
void apply_mask(vector<uint8_t> &payload, const uint8_t (&masking_key)[4]) {
const size_t size = payload.size();
const size_t size4 = size / 4;
size_t i = 0;
uint8_t *p = &payload[0];
uint32_t *p32 = reinterpret_cast<uint32_t *>(p);
const uint32_t m = *reinterpret_cast<const uint32_t *>(&masking_key[0]);
#pragma clang loop vectorize(enable) interleave(enable)
for (i = 0; i < size4; i++) {
p32[i] = p32[i] ^ m;
}
for (i = (size4*4); i < size; i++) {
p[i] = p[i] ^ masking_key[i % 4];
}
}
gcc.godbolt code
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With