Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LLVM IR: efficiently summing a vector

I'm writing a compiler that's generating LLVM IR instructions. I'm working extensively with vectors.

I would like to be able to sum all the elements in a vector. Right now I'm just extracting each element individually and adding them up manually, but it strikes me that this is precisely the sort of thing that the hardware should be able to help with (as it sounds like a pretty common operation). But there doesn't seem to be an intrinsic to do it.

What's the best way to do this? I'm using LLVM 3.2.

like image 953
David Given Avatar asked Feb 07 '13 00:02

David Given


1 Answers

First of all, even without using intrinsics, you can generate log(n) vector additions (with n being vector length) instead of n scalar additions, here's an example with vector size 8:

define i32 @sum(<8 x i32> %a) {
  %v1 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %v2 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %sum1 = add <4 x i32> %v1, %v2
  %v3 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
  %v4 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
  %sum2 = add <2 x i32> %v3, %v4
  %v5 = extractelement <2 x i32> %sum2, i32 0
  %v6 = extractelement <2 x i32> %sum2, i32 1
  %sum3 = add i32 %v5, %v6
  ret i32 %sum3
}

If your target has support for these vector additions then it seems highly likely the above will be lowered to use those instructions, giving you performance.

Regarding intrinsics, there are no target-independent intrinsics to handle this. If you're compiling to x86, though, you do have access to the hadd instrinsics (e.g. llvm.x86.int_x86_ssse3_phadd_sw_128 to add two <4 x i32> vectors together). You'll still have to do something similar to the above, only the add instructions could be replaced.

For more information about this you can search for "horizontal sum" or "horizontal vector sum"; for instance, here are some relevant stackoverflow questions for a horizontal sum on x86:

  • horizontal sum of 8 packed 32bit floats
  • Fastest way to do horizontal vector sum with AVX instructions
  • Fastest way to do horizontal float vector sum on x86
like image 52
Oak Avatar answered Oct 03 '22 15:10

Oak