How to tell LLVM that it can optimize away stores?

Tags:

Background (there may be a better way to do this): I am developing a Julia library in which I manually manage memory; I mmap a large block, and then mostly treat it like a stack: functions receive the pointer as an argument, and if they allocate an object, they'll return an incremented pointer to the callee. That callee itself will likely not increment the pointer, and just return the original pointer it received, if it returns the pointer at all.

Whenever a function returns, as far as my library is concerned, anything beyond the current position of the pointer is garbage. I would like LLVM to be aware of this, so that it can optimize away any unnecessary stores.

Here is a test case demonstrating the problem: taking the dot product of two vectors of length 16. First, a few preliminary loads (these are my libraries, and are on GitHub: SIMDPirates, PaddedMatrices):

using SIMDPirates, PaddedMatrices
using SIMDPirates: lifetime_start, lifetime_end
b = @Mutable rand(16);
c = @Mutable rand(16);
a = FixedSizeVector{16,Float64}(undef);
b' * c # dot product
# 3.9704768664758925

Of course, we'd never include stores if we wrote a dot product by hand, but that's a lot harder to do when you're trying to generate code for arbitrary models. So we'll write a bad dot product that stores into a pointer:

@inline function storedot!(ptr, b, c)
    ptrb = pointer(b)
    ptrc = pointer(c)
    ptra = ptr
    for _ ∈ 1:4
        vb = vload(Vec{4,Float64}, ptrb)
        vc = vload(Vec{4,Float64}, ptrc)
        vstore!(ptra, vmul(vb, vc))
        ptra += 32
        ptrb += 32
        ptrc += 32
    end
    ptra = ptr
    out = vload(Vec{4,Float64}, ptra)
    for _ ∈ 1:3
        ptra += 32
        out = vadd(out, vload(Vec{4,Float64}, ptra))
    end
    vsum(out)
end

Instead of looping once and accumulating the dot product with fma instructions, we loop twice, first calculating and storing products, and then summing. What I want is for the compiler to figure out the correct thing.

Here are two versions calling it below. The first uses the llvm lifetime intrinsics to try and declare the pointer contents to be garbage:

function test_lifetime!(a, b, c)
    ptra = pointer(a)
    lifetime_start(Val(128), ptra)
    d = storedot!(ptra, b, c)
    lifetime_end(Val(128), ptra)
    d
end

and the second, instead of using a preallocated pointer, creates a pointer with alloca

function test_alloca(b, c)
    ptra = SIMDPirates.alloca(Val(16), Float64)
    storedot!(ptra, b, c)
end

Both of course get the correct answer

test_lifetime!(a, b, c)
# 3.9704768664758925
test_alloca(b, c)
# 3.9704768664758925

But only the alloca version is optimized correctly. The alloca's assembly (AT&T syntax):

# julia> @code_native debuginfo=:none test_alloca(b, c)
        .text
        vmovupd (%rsi), %ymm0
        vmovupd 32(%rsi), %ymm1
        vmovupd 64(%rsi), %ymm2
        vmovupd 96(%rsi), %ymm3
        vmulpd  (%rdi), %ymm0, %ymm0
        vfmadd231pd     32(%rdi), %ymm1, %ymm0 # ymm0 = (ymm1 * mem) + ymm0
        vfmadd231pd     64(%rdi), %ymm2, %ymm0 # ymm0 = (ymm2 * mem) + ymm0
        vfmadd231pd     96(%rdi), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
        vextractf128    $1, %ymm0, %xmm1
        vaddpd  %xmm1, %xmm0, %xmm0
        vpermilpd       $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
        vaddsd  %xmm1, %xmm0, %xmm0
        vzeroupper
        retq
        nopw    %cs:(%rax,%rax)
        nopl    (%rax,%rax)

As you can see, there are no moves into memory, and we have one vmul and three vfmadds to calculate the dot product (before doing the vector reduction).

Unfortunately, this is not what we get from the version trying to use lifetimes:

 # julia> @code_native debuginfo=:none test_lifetime!(a, b, c)
        .text
        vmovupd (%rdx), %ymm0
        vmulpd  (%rsi), %ymm0, %ymm0
        vmovupd %ymm0, (%rdi)
        vmovupd 32(%rdx), %ymm1
        vmulpd  32(%rsi), %ymm1, %ymm1
        vmovupd %ymm1, 32(%rdi)
        vmovupd 64(%rdx), %ymm2
        vmulpd  64(%rsi), %ymm2, %ymm2
        vmovupd %ymm2, 64(%rdi)
        vmovupd 96(%rdx), %ymm3
        vaddpd  %ymm0, %ymm1, %ymm0
        vaddpd  %ymm0, %ymm2, %ymm0
        vfmadd231pd     96(%rsi), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
        vextractf128    $1, %ymm0, %xmm1
        vaddpd  %xmm1, %xmm0, %xmm0
        vpermilpd       $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
        vaddsd  %xmm1, %xmm0, %xmm0
        vzeroupper
        retq
        nopw    %cs:(%rax,%rax)
        nop

Here, we just get the loops as written: vmul, store into memory, and then vadd. One of the 4 however has been replaced with a fmadd.

Also, it does not read from any of the stores, so I'd think the dead store elimination pass should have no trouble.

The associated llvm:

;; julia> @code_llvm debuginfo=:none test_alloca(b, c)

define double @julia_test_alloca_17840(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128)) {
top:
  %2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %3 = addrspacecast %jl_value_t addrspace(11)* %2 to %jl_value_t*
  %4 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %5 = addrspacecast %jl_value_t addrspace(11)* %4 to %jl_value_t*
  %ptr.i20 = bitcast %jl_value_t* %3 to <4 x double>*
  %res.i21 = load <4 x double>, <4 x double>* %ptr.i20, align 8
  %ptr.i18 = bitcast %jl_value_t* %5 to <4 x double>*
  %res.i19 = load <4 x double>, <4 x double>* %ptr.i18, align 8
  %res.i17 = fmul fast <4 x double> %res.i19, %res.i21
  %6 = bitcast %jl_value_t* %3 to i8*
  %7 = getelementptr i8, i8* %6, i64 32
  %8 = bitcast %jl_value_t* %5 to i8*
  %9 = getelementptr i8, i8* %8, i64 32
  %ptr.i20.1 = bitcast i8* %7 to <4 x double>*
  %res.i21.1 = load <4 x double>, <4 x double>* %ptr.i20.1, align 8
  %ptr.i18.1 = bitcast i8* %9 to <4 x double>*
  %res.i19.1 = load <4 x double>, <4 x double>* %ptr.i18.1, align 8
  %res.i17.1 = fmul fast <4 x double> %res.i19.1, %res.i21.1
  %10 = getelementptr i8, i8* %6, i64 64
  %11 = getelementptr i8, i8* %8, i64 64
  %ptr.i20.2 = bitcast i8* %10 to <4 x double>*
  %res.i21.2 = load <4 x double>, <4 x double>* %ptr.i20.2, align 8
  %ptr.i18.2 = bitcast i8* %11 to <4 x double>*
  %res.i19.2 = load <4 x double>, <4 x double>* %ptr.i18.2, align 8
  %res.i17.2 = fmul fast <4 x double> %res.i19.2, %res.i21.2
  %12 = getelementptr i8, i8* %6, i64 96
  %13 = getelementptr i8, i8* %8, i64 96
  %ptr.i20.3 = bitcast i8* %12 to <4 x double>*
  %res.i21.3 = load <4 x double>, <4 x double>* %ptr.i20.3, align 8
  %ptr.i18.3 = bitcast i8* %13 to <4 x double>*
  %res.i19.3 = load <4 x double>, <4 x double>* %ptr.i18.3, align 8
  %res.i17.3 = fmul fast <4 x double> %res.i19.3, %res.i21.3
  %res.i12 = fadd fast <4 x double> %res.i17.1, %res.i17
  %res.i12.1 = fadd fast <4 x double> %res.i17.2, %res.i12
  %res.i12.2 = fadd fast <4 x double> %res.i17.3, %res.i12.1
  %vec_2_1.i = shufflevector <4 x double> %res.i12.2, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i = shufflevector <4 x double> %res.i12.2, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i
  %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
  %res.i = extractelement <1 x double> %vec_1.i, i32 0
  ret double %res.i
}

It elided the alloca and stores. However, trying to use lifetimes:

;; julia> @code_llvm debuginfo=:none test_lifetime!(a, b, c)

define double @"julia_test_lifetime!_17839"(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128)) {
  980 top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %4 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
  %.ptr = bitcast %jl_value_t* %4 to i8*
  call void @llvm.lifetime.start.p0i8(i64 256, i8* %.ptr)
  %5 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %6 = addrspacecast %jl_value_t addrspace(11)* %5 to %jl_value_t*
  %7 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
  %8 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
  %ptr.i22 = bitcast %jl_value_t* %6 to <4 x double>*
  %res.i23 = load <4 x double>, <4 x double>* %ptr.i22, align 8
  %ptr.i20 = bitcast %jl_value_t* %8 to <4 x double>*
  %res.i21 = load <4 x double>, <4 x double>* %ptr.i20, align 8
  %res.i19 = fmul fast <4 x double> %res.i21, %res.i23
  %ptr.i18 = bitcast %jl_value_t* %4 to <4 x double>*
  store <4 x double> %res.i19, <4 x double>* %ptr.i18, align 8
  %9 = getelementptr i8, i8* %.ptr, i64 32
  %10 = bitcast %jl_value_t* %6 to i8*
  %11 = getelementptr i8, i8* %10, i64 32
  %12 = bitcast %jl_value_t* %8 to i8*
  %13 = getelementptr i8, i8* %12, i64 32
  %ptr.i22.1 = bitcast i8* %11 to <4 x double>*
  %res.i23.1 = load <4 x double>, <4 x double>* %ptr.i22.1, align 8
  %ptr.i20.1 = bitcast i8* %13 to <4 x double>*
  %res.i21.1 = load <4 x double>, <4 x double>* %ptr.i20.1, align 8
  %res.i19.1 = fmul fast <4 x double> %res.i21.1, %res.i23.1
  %ptr.i18.1 = bitcast i8* %9 to <4 x double>*
  store <4 x double> %res.i19.1, <4 x double>* %ptr.i18.1, align 8
  %14 = getelementptr i8, i8* %.ptr, i64 64
  %15 = getelementptr i8, i8* %10, i64 64
  %16 = getelementptr i8, i8* %12, i64 64
  %ptr.i22.2 = bitcast i8* %15 to <4 x double>*
  %res.i23.2 = load <4 x double>, <4 x double>* %ptr.i22.2, align 8
  %ptr.i20.2 = bitcast i8* %16 to <4 x double>*
  %res.i21.2 = load <4 x double>, <4 x double>* %ptr.i20.2, align 8
  %res.i19.2 = fmul fast <4 x double> %res.i21.2, %res.i23.2
  %ptr.i18.2 = bitcast i8* %14 to <4 x double>*
  store <4 x double> %res.i19.2, <4 x double>* %ptr.i18.2, align 8
  %17 = getelementptr i8, i8* %10, i64 96
  %18 = getelementptr i8, i8* %12, i64 96
  %ptr.i22.3 = bitcast i8* %17 to <4 x double>*
  %res.i23.3 = load <4 x double>, <4 x double>* %ptr.i22.3, align 8
  %ptr.i20.3 = bitcast i8* %18 to <4 x double>*
  %res.i21.3 = load <4 x double>, <4 x double>* %ptr.i20.3, align 8
  %res.i19.3 = fmul fast <4 x double> %res.i21.3, %res.i23.3
  %res.i13 = fadd fast <4 x double> %res.i19.1, %res.i19
  %res.i13.1 = fadd fast <4 x double> %res.i19.2, %res.i13
  %res.i13.2 = fadd fast <4 x double> %res.i19.3, %res.i13.1
  %vec_2_1.i = shufflevector <4 x double> %res.i13.2, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i = shufflevector <4 x double> %res.i13.2, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i
  %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
  %res.i = extractelement <1 x double> %vec_1.i, i32 0
  call void @llvm.lifetime.end.p0i8(i64 256, i8* %.ptr)
  ret double %res.i
}

The lifetime start and lifetime ends are there, but so are three of the four stores. I can confirm that the 4th store is gone:

julia> fill!(a, 0.0)'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

julia> test_lifetime!(a, b, c)
3.9704768664758925

julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
 0.157677  0.152386  0.507693  0.00696963  0.0651712  0.241523  0.129705  0.175321  0.236032  0.0314141  0.199595  0.404153  0.0  0.0  0.0  0.0

While without specifying lifetime all four must of course happen:

julia> function teststore!(a, b, c)
       storedot!(pointer(a), b, c)
       end
test_store! (generic function with 1 method)

julia> fill!(a, 0.0); test_store!(a, b, c)
3.9704768664758925

julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
 0.157677  0.152386  0.507693  0.00696963  0.0651712  0.241523  0.129705  0.175321  0.236032  0.0314141  0.199595  0.404153  0.256597  0.0376403  0.889331  0.479269

Yet, unlike with the alloca, it was not able to elide all 4 stores.

For reference, I built Julia with LLVM 8.0.1.

I'm not using alloca in place of my stack pointer for two reasons: a) I got bugs when calling non-inlined functions with alloca-created pointers. Replacing those pointers with others made the bugs disappear, as did inlining the functions. If there is a way to resolve that, I could at least use alloca in a lot more places. b) I couldn't find out how to get Julia to have more than 4MB of stack per thread available to alloca. I think 4MB is plenty for many of my use cases, but not all. A limit like that isn't great if I'm aiming to write fairly general software.

My questions:

Is there any way I can get LLVM to replicate the behavior it shows with alloca?
Did I do things correctly, and allow LLVM allowed to show the desired behavior, but the optimizer is for some reason more limited compared to alloca?
And could therefore be expected to improve with future versions.
Any advice on how to handle this, better enable the optimizers, or things I'm missing in general?
Given that only the last is elided, is the problem that it assumes they may alias?

834

asked Nov 13 '19 16:11

Chris Elrod

1 Answers

I edited in the following bullet after originally posting the question:

Given that only the last is elided, is the problem that it assumes they may alias?

Turns out it was exactly this problem. If ptra did alias b or c, eliding the stores would have been invalid.

Writing instead:

a = @Mutable rand(48);
a[Static(1:16)]' * a[Static(17:32)]
# 2.5295415040590425

function test_lifetime!(a)
    ptra = pointer(a)
    b = PtrVector{16,Float64,16}(ptra)
    c = PtrVector{16,Float64,16}(ptra + 128)
    ptra += 256
    lifetime_start(Val(128), ptra)
    d = storedot!(ptra, b, c)
    lifetime_end(Val(128), ptra)
    d
end

test_lifetime!(a)
# 2.5295415040590425

Does in fact elide all the stores:

# julia> @code_native debuginfo=:none test_lifetime!(a)
        .text
        vmovupd 128(%rdi), %ymm0
        vmovupd 160(%rdi), %ymm1
        vmovupd 192(%rdi), %ymm2
        vmovupd 224(%rdi), %ymm3
        vmulpd  (%rdi), %ymm0, %ymm0
        vfmadd231pd     32(%rdi), %ymm1, %ymm0 # ymm0 = (ymm1 * mem) + ymm0
        vfmadd231pd     64(%rdi), %ymm2, %ymm0 # ymm0 = (ymm2 * mem) + ymm0
        vfmadd231pd     96(%rdi), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
        vextractf128    $1, %ymm0, %xmm1
        vaddpd  %xmm1, %xmm0, %xmm0
        vpermilpd       $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
        vaddsd  %xmm1, %xmm0, %xmm0
        vzeroupper
        retq
        nop

So the answer is: LLVM knows the alloca pointer cannot alias one of the inputs, therefore it is safe not to store. The behavior I wanted in my question (without an alias check) would have been unsafe / liable to get incorrect results: one of the stores into ptra could change the contents of b or c. Hence, all but the very last store actually do have to be performed.

In this last test, I defined each of a, b, and c at different offsets from the same pointer, so that stores into a are guaranteed not to change b or c, letting LLVM actually eliminate the stores. Perfect!

answered Oct 03 '22 17:10

Chris Elrod

Related questions
                            
                                How do I optimize/improve upon this hash function
                            
                                Fast searching for the lowest value greater than x in a sorted vector
                            
                                Global function is slower than functor or lambda when passed to std::sort
                            
                                ASP.NET MVC bundles in Classic ASP (or PHP etc.)
                            
                                C++/JAVA: Efficiently find a set in the collection containing given value
                            
                                How to speed up Google App Engine Go unit tests?
                            
                                Why does GCC-generated code read junk from stack?
                            
                                Searching a sequence of bytes in a binary file in PHP?
                            
                                Does PHP optimize function arguments of array type, not explicitly passed by reference, when they are not modified?
                            
                                pointers on pointers - reason for performance penalty
                            
                                Code optimization - number of function calls in Python
                            
                                Numerical Optimization of PL/SQL, or alternatives
                            
                                How to use Resource Dictionary efficiently in all parts of project
                            
                                Getting the maximum in each rolling window of a 2D numpy array
                            
                                How to do basic optimisation using loco
                            
                                Maximum number of assets in R package 'performanceanalytics' optimizer
                            
                                Is it beneficial anymore to unroll loops in C++ over fixed-sized arrays?
                            
                                Sum of max elements in sub-triangles
                            
                                Gekko Non-Linear optimization, object type error in constraint function evaluating if statement
                            
                                Integer Linear Programming with CVXPY in python3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to tell LLVM that it can optimize away stores?

Tags:

optimization

llvm

alloca

julia

llvm-ir

Chris Elrod

People also ask

1 Answers

Chris Elrod

Recent Activity

Donate For Us