Do compilers like gcc, visual studio c++, the intel c++ compiler, clang, etc. vectorize code like the following?
std::vector<unsigned char> img( height * width * 3 );
unsigned char channelMultiplier[3];
// ... initialize img and channelMultiplier ...
for ( int y = 0; y < height; ++y )
for ( int x = 0; x < width; ++x )
for ( b = 0; b < 3; ++b )
img[ b+3*(x+width*y) ] = img[ b+3*(x+width*y) ] *
channelMultiplier[b] / 0x100;
How about the same for 32 bit image processing?
I do not think your tripple loop will auto-vectorize. IMO the problems are:
img
is accessed from another memory pointer and this will most likely block the vectorization. Basically you need to define a plain double array and hint the compiler that no other pointer is referring to that same location. I think you can do that using __restrict
. __restrict
tells the compiler that this pointer is the only pointer pointing to that memory location and that there are no other pointers, and thus there is no risk of side effects.The memory is not aligned by default and even if the compiler manages to auto-vectorize, the vectorization of unaligned memory is a lot slower than that of aligned memory. You need to ensure your memory is 32 memory bit address aligned to exploit auto-vectorization and AVX to the maximum and 16 bit address aligned to exploit SSE to the maximum i.e. always align to 32 memory bit address. This you can do dynamically via:
double* buffer = NULL;
posix_memalign((void**) &buffer, 32, size*sizeof(double));
...
free(buffer);
in MSVC you can do this with __declspec(align(32)) double array[size]
but you have to check with the specific compiler you are using to make sure you are using the correct alignment directives.
Another important thing, if you use GNU compiler use the flag -ftree-vectorizer-verbose=6
to check whether your loop is being auto-vectorized. If you use the Intel compiler then use -vec-report5
. Note that there are several levels of verbosity and information output i.e. the 6 and 5 numbers so checkout the compiler documentation. The higher the verbosity level the more vectorization information you will get for every loop in your code but the slower the compiler will compile in Release mode.
In general, I have been always surprised how NOT easy is to get the compiler to auto-vectorize, it is a common mistake to assume that because a loop looks canonical then the compiler will auto-vectorize it auto-magically.
UPDATE: and one more thing, make sure your img
is actually page-aligned posix_memalign((void**) &buffer, sysconf(_SC_PAGESIZE), size*sizeof(double));
(which implies AVX and SSE aligned). The problem is that if you have a big image, this loop will most likely end-up page-switching during execution and that's also very expensive. I think this is what is so-called TLB misses.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With