This question is just to try to get some more insights into loop vectorization, particularly using OpenMP4. The code given bellow generate 'size' random samples, then from these samples we extract a piece 'q' of 'qsize' samples from a position 'qpos'. The program then finds back the position of 'q' in the 'samples' array. This is the code:
#include <float.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <mm_malloc.h>
// SIMD size in floats, assuming 1 float = 4 bytes
#define VEC_SIZE 8
#define ALIGN (VEC_SIZE*sizeof(float))
int main(int argc, char *argv[])
{
if (argc!=4)
{
printf("Usage: %s <size> <qsize> <qpos>",argv[0]);
exit(1);
}
int size = atoi(argv[1]);
int qsize = atoi(argv[2]);
int qpos = atoi(argv[3]);
assert(qsize < size);
assert((qpos < size - qsize) && (qpos >= 0));
float *samples;
float *q;
samples = (float *) malloc(size*sizeof(float));
q = (float *) _mm_malloc(size*sizeof(float),ALIGN);
// Initialization
// - Randomly filling the samples
samples[0] = 0.0;
for (int i = 1 ; i < size; i++) //LOOP1
samples[i] = samples[i-1] + rand()/((float)RAND_MAX) - 0.5;
// - Getting q from the samples
#pragma omp simd aligned(q:ALIGN)
for (int i = 0; i < qsize; i++) //LOOP2
q[i] = samples[qpos+i];
// Finding the best match (since q is taken form the samples it self
// the position of the best match must be qpos)
float best_dist = FLT_MAX;
int pos = -1;
for (int i = 0; i < size - qsize; i++)//LOOP 3
{
float dist = 0;
#pragma omp simd aligned(q:ALIGN) reduction(+:dist)
for (int j = 0; j < qsize; j++)//LOOP4
dist += (q[j] - samples[i+j]) * (q[j] - samples[i+j]);
if (dist < best_dist)
{
best_dist = dist;
pos = i;
}
}
assert(pos==qpos);
printf("Done!\n");
free(samples);
_mm_free(q);
}
I'm compiling this using icc 15.0.0 and gcc 4.9.2 using the following commands:
icc vec-test.c -o icc-vec-test -std=c11 -qopt-report=3 -qopt-report-phase=vec -qopt-report-file=icc.vec -O3 -xHost -fopenmp
gcc vec-test.c -o gcc-vec-test -std=c11 -fopt-info-vec-missed-optimized=gcc.vec -O3 -march=native -fopenmp
'q' is aligned by using _mm_malloc(). It does not make sense to do the same for 'samples' since any ways the inner most loop (LOOP4) will always access unaligned elements of it.
Both gcc and icc reported vectorization of LOOP4 (actually, icc manages to autovectorize the loop if we omit the '#pragma omp simd', which gcc refuses to do, but that's just one extra observation). From the vectorization reports seems that none of the compiler generated a peeling loop. My questions:
1) How the compilers handle the fact that 'samples' is not alligned?
2) How much can this affects performance?
3) icc had no problem vectorizing LOOP2. However gcc can not: "note: not vectorized: not enough data-refs in basic block". Any ideas?
Thanks!
Here are some experience when I was test running the stream package for testing sustainable memory bandwidth.
1) Intel compiler will not generate code for checking alignment as far as I know, it will use some equivalent of movdqu for loading samples and movdqa for loading q
2) This depend on the ratio of memory bandwidth and flops available. Loop 4 only require a tiny amount of computation, my guess is that your current program on modern HPC will just be memory bandwidth bound given that the size of samples and q is large, fixing the alignment does not help much. However, if you limit your number of core used to be <4, you should be able to observe speed gain for aligning the sample.
3) Compiler does not determine vectorization base on alignment, the compiler will refuse to vectorize when it is not safe to vectorize due to data dependency. I have little experience in gcc so I cannot provide any suggestion for this.
For your information, checking the alignment at runtime and provide specialized routine that uses aligned load and in-register shifting can usually beat the compiler generated code. You can check Intel's L1 BLAS routines for how they do this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With