I understand I can set mcpu and mattr in EngineBuilder to generate vectorized code. But I find the clang front has to involve for AVX using -mavx. Otherwise the generated assembly uses only xmm register.
Is there a way to let LLVM know 8 floating numbers can be put in an AVX register without front end involved?
My testing code is just vector adding:
float a[N], b[N];
float c[N];
// initialize a and b
for (int i = 0; i < N; ++i)
c[i] = a[i] + b[i];
TL;DR: Yes. You just need to call opt
and tell it to vectorize your code.
You can most definitely do it without clang. The vectorizers are all about LLVM IR, they're not in clang.
I got this IR from your example by using clang without optimizations (yeah, I cheated, and then annotated a bit or two): (data layout and triple are important!)
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.9.0"
define float* @f(i32 %N, float* nocapture readonly %a, float* nocapture readonly %b, float* %c) {
entry:
%cmp10 = icmp sgt i32 %N, 0 ; check for early exit
br i1 %cmp10, label %for.body, label %for.end
for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds float* %a, i64 %indvars.iv
%0 = load float* %arrayidx, align 4 ; %0 = a[i]
%arrayidx2 = getelementptr inbounds float* %b, i64 %indvars.iv
%1 = load float* %arrayidx2, align 4 ; %1 = a[i]
%add = fadd float %0, %1 ; %add = %0 + %1
%arrayidx4 = getelementptr inbounds float* %c, i64 %indvars.iv
store float %add, float* %arrayidx4, align 4 ; c[i] = %add
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %N ; test for loop exit
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret float* %c
}
Now you want to vectorize the code. Let's run it through the loop vectorizer, then.
opt a.ll -S -march=x86-64 -mcpu=btver2 -loop-vectorize
(I ran it with -S
to get the output at the console)
Now we have vectorized IR with a huge vector.body
, as well as some checks, preheader, and additional bookkeeping code. You'll see this in the middle of the file:
%171 = getelementptr inbounds float* %b, i64 %98
%172 = insertelement <8 x float*> %170, float* %171, i32 7
%173 = getelementptr float* %109, i32 0
%174 = bitcast float* %173 to <8 x float>*
%wide.load18 = load <8 x float>* %174, align 4
%175 = getelementptr float* %109, i32 8
%176 = bitcast float* %175 to <8 x float>*
%wide.load19 = load <8 x float>* %176, align 4
%177 = getelementptr float* %109, i32 16
%178 = bitcast float* %177 to <8 x float>*
%wide.load20 = load <8 x float>* %178, align 4
%179 = getelementptr float* %109, i32 24
%180 = bitcast float* %179 to <8 x float>*
%wide.load21 = load <8 x float>* %180, align 4
%181 = fadd <8 x float> %wide.load, %wide.load18
%182 = fadd <8 x float> %wide.load15, %wide.load19
%183 = fadd <8 x float> %wide.load16, %wide.load20
%184 = fadd <8 x float> %wide.load17, %wide.load21
%185 = getelementptr inbounds float* %c, i64 %5
%186 = insertelement <8 x float*> undef, float* %185, i32 0
It's a bit complicated, but most of the floating point additions (fadd
) are there, and only done on vectors. Let's make it simpler and run the other optimizations with -O2
or -O3
. This will make the IR smaller and simpler by removing and/or collapsing parts of it that wouldn't be needed or profitable.
opt a.ll -S -march=x86-64 -mcpu=btver2 -loop-vectorize -O3
Well… Since we now have IR that already works on vectors, we just need to emit it. Let's take that last step, and call llc
:
opt a.ll -S -march=x86-64 -mcpu=core-avx2 -loop-vectorize -O3 | llc -mcpu=core-avx2
Looking at the disassembly, you have a tight inner loop (if you have the same names as me, this should be label LBB0_5
), and a bunch of bookkeeping code.
Your code is now vectorized.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With