Background (there may be a better way to do this):
I am developing a Julia library in which I manually manage memory; I mmap
a large block, and then mostly treat it like a stack: functions receive the pointer as an argument, and if they allocate an object, they'll return an incremented pointer to the callee.
That callee itself will likely not increment the pointer, and just return the original pointer it received, if it returns the pointer at all.
Whenever a function returns, as far as my library is concerned, anything beyond the current position of the pointer is garbage. I would like LLVM to be aware of this, so that it can optimize away any unnecessary stores.
Here is a test case demonstrating the problem: taking the dot product of two vectors of length 16. First, a few preliminary loads (these are my libraries, and are on GitHub: SIMDPirates, PaddedMatrices):
using SIMDPirates, PaddedMatrices
using SIMDPirates: lifetime_start, lifetime_end
b = @Mutable rand(16);
c = @Mutable rand(16);
a = FixedSizeVector{16,Float64}(undef);
b' * c # dot product
# 3.9704768664758925
Of course, we'd never include stores if we wrote a dot product by hand, but that's a lot harder to do when you're trying to generate code for arbitrary models. So we'll write a bad dot product that stores into a pointer:
@inline function storedot!(ptr, b, c)
ptrb = pointer(b)
ptrc = pointer(c)
ptra = ptr
for _ ∈ 1:4
vb = vload(Vec{4,Float64}, ptrb)
vc = vload(Vec{4,Float64}, ptrc)
vstore!(ptra, vmul(vb, vc))
ptra += 32
ptrb += 32
ptrc += 32
end
ptra = ptr
out = vload(Vec{4,Float64}, ptra)
for _ ∈ 1:3
ptra += 32
out = vadd(out, vload(Vec{4,Float64}, ptra))
end
vsum(out)
end
Instead of looping once and accumulating the dot product with fma
instructions, we loop twice, first calculating and storing products, and then summing.
What I want is for the compiler to figure out the correct thing.
Here are two versions calling it below. The first uses the llvm lifetime intrinsics to try and declare the pointer contents to be garbage:
function test_lifetime!(a, b, c)
ptra = pointer(a)
lifetime_start(Val(128), ptra)
d = storedot!(ptra, b, c)
lifetime_end(Val(128), ptra)
d
end
and the second, instead of using a preallocated pointer, creates a pointer with alloca
function test_alloca(b, c)
ptra = SIMDPirates.alloca(Val(16), Float64)
storedot!(ptra, b, c)
end
Both of course get the correct answer
test_lifetime!(a, b, c)
# 3.9704768664758925
test_alloca(b, c)
# 3.9704768664758925
But only the alloca version is optimized correctly. The alloca's assembly (AT&T syntax):
# julia> @code_native debuginfo=:none test_alloca(b, c)
.text
vmovupd (%rsi), %ymm0
vmovupd 32(%rsi), %ymm1
vmovupd 64(%rsi), %ymm2
vmovupd 96(%rsi), %ymm3
vmulpd (%rdi), %ymm0, %ymm0
vfmadd231pd 32(%rdi), %ymm1, %ymm0 # ymm0 = (ymm1 * mem) + ymm0
vfmadd231pd 64(%rdi), %ymm2, %ymm0 # ymm0 = (ymm2 * mem) + ymm0
vfmadd231pd 96(%rdi), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %xmm1, %xmm0, %xmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddsd %xmm1, %xmm0, %xmm0
vzeroupper
retq
nopw %cs:(%rax,%rax)
nopl (%rax,%rax)
As you can see, there are no moves into memory, and we have one vmul
and three vfmadd
s to calculate the dot product (before doing the vector reduction).
Unfortunately, this is not what we get from the version trying to use lifetimes:
# julia> @code_native debuginfo=:none test_lifetime!(a, b, c)
.text
vmovupd (%rdx), %ymm0
vmulpd (%rsi), %ymm0, %ymm0
vmovupd %ymm0, (%rdi)
vmovupd 32(%rdx), %ymm1
vmulpd 32(%rsi), %ymm1, %ymm1
vmovupd %ymm1, 32(%rdi)
vmovupd 64(%rdx), %ymm2
vmulpd 64(%rsi), %ymm2, %ymm2
vmovupd %ymm2, 64(%rdi)
vmovupd 96(%rdx), %ymm3
vaddpd %ymm0, %ymm1, %ymm0
vaddpd %ymm0, %ymm2, %ymm0
vfmadd231pd 96(%rsi), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %xmm1, %xmm0, %xmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddsd %xmm1, %xmm0, %xmm0
vzeroupper
retq
nopw %cs:(%rax,%rax)
nop
Here, we just get the loops as written: vmul
, store into memory, and then vadd
. One of the 4 however has been replaced with a fmadd
.
Also, it does not read from any of the stores, so I'd think the dead store elimination pass should have no trouble.
The associated llvm:
;; julia> @code_llvm debuginfo=:none test_alloca(b, c)
define double @julia_test_alloca_17840(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128)) {
top:
%2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
%3 = addrspacecast %jl_value_t addrspace(11)* %2 to %jl_value_t*
%4 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
%5 = addrspacecast %jl_value_t addrspace(11)* %4 to %jl_value_t*
%ptr.i20 = bitcast %jl_value_t* %3 to <4 x double>*
%res.i21 = load <4 x double>, <4 x double>* %ptr.i20, align 8
%ptr.i18 = bitcast %jl_value_t* %5 to <4 x double>*
%res.i19 = load <4 x double>, <4 x double>* %ptr.i18, align 8
%res.i17 = fmul fast <4 x double> %res.i19, %res.i21
%6 = bitcast %jl_value_t* %3 to i8*
%7 = getelementptr i8, i8* %6, i64 32
%8 = bitcast %jl_value_t* %5 to i8*
%9 = getelementptr i8, i8* %8, i64 32
%ptr.i20.1 = bitcast i8* %7 to <4 x double>*
%res.i21.1 = load <4 x double>, <4 x double>* %ptr.i20.1, align 8
%ptr.i18.1 = bitcast i8* %9 to <4 x double>*
%res.i19.1 = load <4 x double>, <4 x double>* %ptr.i18.1, align 8
%res.i17.1 = fmul fast <4 x double> %res.i19.1, %res.i21.1
%10 = getelementptr i8, i8* %6, i64 64
%11 = getelementptr i8, i8* %8, i64 64
%ptr.i20.2 = bitcast i8* %10 to <4 x double>*
%res.i21.2 = load <4 x double>, <4 x double>* %ptr.i20.2, align 8
%ptr.i18.2 = bitcast i8* %11 to <4 x double>*
%res.i19.2 = load <4 x double>, <4 x double>* %ptr.i18.2, align 8
%res.i17.2 = fmul fast <4 x double> %res.i19.2, %res.i21.2
%12 = getelementptr i8, i8* %6, i64 96
%13 = getelementptr i8, i8* %8, i64 96
%ptr.i20.3 = bitcast i8* %12 to <4 x double>*
%res.i21.3 = load <4 x double>, <4 x double>* %ptr.i20.3, align 8
%ptr.i18.3 = bitcast i8* %13 to <4 x double>*
%res.i19.3 = load <4 x double>, <4 x double>* %ptr.i18.3, align 8
%res.i17.3 = fmul fast <4 x double> %res.i19.3, %res.i21.3
%res.i12 = fadd fast <4 x double> %res.i17.1, %res.i17
%res.i12.1 = fadd fast <4 x double> %res.i17.2, %res.i12
%res.i12.2 = fadd fast <4 x double> %res.i17.3, %res.i12.1
%vec_2_1.i = shufflevector <4 x double> %res.i12.2, <4 x double> undef, <2 x i32> <i32 0, i32 1>
%vec_2_2.i = shufflevector <4 x double> %res.i12.2, <4 x double> undef, <2 x i32> <i32 2, i32 3>
%vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i
%vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
%vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
%vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
%res.i = extractelement <1 x double> %vec_1.i, i32 0
ret double %res.i
}
It elided the alloca
and store
s.
However, trying to use lifetimes:
;; julia> @code_llvm debuginfo=:none test_lifetime!(a, b, c)
define double @"julia_test_lifetime!_17839"(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128)) {
980 top:
%3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
%4 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
%.ptr = bitcast %jl_value_t* %4 to i8*
call void @llvm.lifetime.start.p0i8(i64 256, i8* %.ptr)
%5 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
%6 = addrspacecast %jl_value_t addrspace(11)* %5 to %jl_value_t*
%7 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
%8 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
%ptr.i22 = bitcast %jl_value_t* %6 to <4 x double>*
%res.i23 = load <4 x double>, <4 x double>* %ptr.i22, align 8
%ptr.i20 = bitcast %jl_value_t* %8 to <4 x double>*
%res.i21 = load <4 x double>, <4 x double>* %ptr.i20, align 8
%res.i19 = fmul fast <4 x double> %res.i21, %res.i23
%ptr.i18 = bitcast %jl_value_t* %4 to <4 x double>*
store <4 x double> %res.i19, <4 x double>* %ptr.i18, align 8
%9 = getelementptr i8, i8* %.ptr, i64 32
%10 = bitcast %jl_value_t* %6 to i8*
%11 = getelementptr i8, i8* %10, i64 32
%12 = bitcast %jl_value_t* %8 to i8*
%13 = getelementptr i8, i8* %12, i64 32
%ptr.i22.1 = bitcast i8* %11 to <4 x double>*
%res.i23.1 = load <4 x double>, <4 x double>* %ptr.i22.1, align 8
%ptr.i20.1 = bitcast i8* %13 to <4 x double>*
%res.i21.1 = load <4 x double>, <4 x double>* %ptr.i20.1, align 8
%res.i19.1 = fmul fast <4 x double> %res.i21.1, %res.i23.1
%ptr.i18.1 = bitcast i8* %9 to <4 x double>*
store <4 x double> %res.i19.1, <4 x double>* %ptr.i18.1, align 8
%14 = getelementptr i8, i8* %.ptr, i64 64
%15 = getelementptr i8, i8* %10, i64 64
%16 = getelementptr i8, i8* %12, i64 64
%ptr.i22.2 = bitcast i8* %15 to <4 x double>*
%res.i23.2 = load <4 x double>, <4 x double>* %ptr.i22.2, align 8
%ptr.i20.2 = bitcast i8* %16 to <4 x double>*
%res.i21.2 = load <4 x double>, <4 x double>* %ptr.i20.2, align 8
%res.i19.2 = fmul fast <4 x double> %res.i21.2, %res.i23.2
%ptr.i18.2 = bitcast i8* %14 to <4 x double>*
store <4 x double> %res.i19.2, <4 x double>* %ptr.i18.2, align 8
%17 = getelementptr i8, i8* %10, i64 96
%18 = getelementptr i8, i8* %12, i64 96
%ptr.i22.3 = bitcast i8* %17 to <4 x double>*
%res.i23.3 = load <4 x double>, <4 x double>* %ptr.i22.3, align 8
%ptr.i20.3 = bitcast i8* %18 to <4 x double>*
%res.i21.3 = load <4 x double>, <4 x double>* %ptr.i20.3, align 8
%res.i19.3 = fmul fast <4 x double> %res.i21.3, %res.i23.3
%res.i13 = fadd fast <4 x double> %res.i19.1, %res.i19
%res.i13.1 = fadd fast <4 x double> %res.i19.2, %res.i13
%res.i13.2 = fadd fast <4 x double> %res.i19.3, %res.i13.1
%vec_2_1.i = shufflevector <4 x double> %res.i13.2, <4 x double> undef, <2 x i32> <i32 0, i32 1>
%vec_2_2.i = shufflevector <4 x double> %res.i13.2, <4 x double> undef, <2 x i32> <i32 2, i32 3>
%vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i
%vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
%vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
%vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
%res.i = extractelement <1 x double> %vec_1.i, i32 0
call void @llvm.lifetime.end.p0i8(i64 256, i8* %.ptr)
ret double %res.i
}
The lifetime start and lifetime ends are there, but so are three of the four stores. I can confirm that the 4th store is gone:
julia> fill!(a, 0.0)'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
julia> test_lifetime!(a, b, c)
3.9704768664758925
julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
0.157677 0.152386 0.507693 0.00696963 0.0651712 0.241523 0.129705 0.175321 0.236032 0.0314141 0.199595 0.404153 0.0 0.0 0.0 0.0
While without specifying lifetime all four must of course happen:
julia> function teststore!(a, b, c)
storedot!(pointer(a), b, c)
end
test_store! (generic function with 1 method)
julia> fill!(a, 0.0); test_store!(a, b, c)
3.9704768664758925
julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
0.157677 0.152386 0.507693 0.00696963 0.0651712 0.241523 0.129705 0.175321 0.236032 0.0314141 0.199595 0.404153 0.256597 0.0376403 0.889331 0.479269
Yet, unlike with the alloca
, it was not able to elide all 4 stores.
For reference, I built Julia with LLVM 8.0.1.
I'm not using alloca
in place of my stack pointer for two reasons:
a) I got bugs when calling non-inlined functions with alloca
-created pointers. Replacing those pointers with others made the bugs disappear, as did inlining the functions. If there is a way to resolve that, I could at least use alloca
in a lot more places.
b) I couldn't find out how to get Julia to have more than 4MB of stack per thread available to alloca. I think 4MB is plenty for many of my use cases, but not all. A limit like that isn't great if I'm aiming to write fairly general software.
My questions:
alloca
?LLVM divides the entire compilation process into three steps: Frontend: Convert the high-level language to IR. Middle-End: Perform optimization in the IR layer. Backend: Convert the IR into the assembly language of the corresponding hardware platform.
The align 4 ensures that the address will be a multiple of 4 store i32 0, i32* %1. This sets the 32 bit integer pointed to by %1 to the 32 bit value 0. It's like saying *x = 1 in C++ ret i32 0. This returns from the function with a 32 bit return value of 0.
An address space is a fundamental part of the type of a pointer value and the type of operations that manipulate memory. LLVM affords a default address space (numbered zero) and places a number of assumptions on pointer values within that address space: The pointer must have a fixed integral value.
I edited in the following bullet after originally posting the question:
Turns out it was exactly this problem. If ptra
did alias b
or c
, eliding the stores would have been invalid.
Writing instead:
a = @Mutable rand(48);
a[Static(1:16)]' * a[Static(17:32)]
# 2.5295415040590425
function test_lifetime!(a)
ptra = pointer(a)
b = PtrVector{16,Float64,16}(ptra)
c = PtrVector{16,Float64,16}(ptra + 128)
ptra += 256
lifetime_start(Val(128), ptra)
d = storedot!(ptra, b, c)
lifetime_end(Val(128), ptra)
d
end
test_lifetime!(a)
# 2.5295415040590425
Does in fact elide all the stores:
# julia> @code_native debuginfo=:none test_lifetime!(a)
.text
vmovupd 128(%rdi), %ymm0
vmovupd 160(%rdi), %ymm1
vmovupd 192(%rdi), %ymm2
vmovupd 224(%rdi), %ymm3
vmulpd (%rdi), %ymm0, %ymm0
vfmadd231pd 32(%rdi), %ymm1, %ymm0 # ymm0 = (ymm1 * mem) + ymm0
vfmadd231pd 64(%rdi), %ymm2, %ymm0 # ymm0 = (ymm2 * mem) + ymm0
vfmadd231pd 96(%rdi), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %xmm1, %xmm0, %xmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddsd %xmm1, %xmm0, %xmm0
vzeroupper
retq
nop
So the answer is: LLVM knows the alloca pointer cannot alias one of the inputs, therefore it is safe not to store.
The behavior I wanted in my question (without an alias check) would have been unsafe / liable to get incorrect results: one of the stores into ptra
could change the contents of b
or c
. Hence, all but the very last store actually do have to be performed.
In this last test, I defined each of a
, b
, and c
at different offsets from the same pointer, so that stores into a
are guaranteed not to change b
or c
, letting LLVM actually eliminate the stores. Perfect!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With