Ok, I'm doing a serie of tests lately. I have a MC simulation where I have several variables (20) which make sense to put them all inside a one-dimensional array because it makes several things easier to read.
But I'm having one problem, I need to sum the variables in every iteration and the simulation takes A LOT of iterations, so I run into this problem (reduced to 7 variables):
function sumtest1(N)
    s=0.0
    a=1.0
    b=2.0
    c=3.0
    d=4.0
    e=5.0
    f=6.0
    g=7.0
    for i = 1:N
        s += (a+b+c+d+e+f+g)
    end
    return s
end
function sumtest2(N)
    s=0.0
    A=[1.0,2.0,3.0,4.0,5.0,6.0,7.0]
    for i = 1:N
        s += sum(A)
    end
    return s
end
@time sumtest1(1_000_000_000)
elapsed time: 0.998272756 seconds (96 bytes allocated)
@time sumtest1(1_000_000_000)
elapsed time: 7.522198967 seconds (208 bytes allocated)
Is this expected? Or am I doing something wrong? I would really like to have my variables indexed because of other reasons too long to explain now, but this performance penalty is something I can't go with.
Let's look at the LLVM code that's being executed for sumtest1:
julia> @code_llvm sumtest1(10^9)
define double @julia_sumtest1_21391(i64) {
top:
  %1 = icmp sgt i64 %0, 0
  %2 = select i1 %1, i64 %0, i64 0
  %3 = icmp eq i64 %2, 0
  br i1 %3, label %L3, label %L.preheader
L.preheader:                                      ; preds = %top
  %4 = icmp sgt i64 %0, 0
  %smax = select i1 %4, i64 %0, i64 0
  br label %L
L:                                                ; preds = %L, %L.preheader
  %lsr.iv = phi i64 [ %smax, %L.preheader ], [ %lsr.iv.next, %L ]
  %s.0 = phi double [ %5, %L ], [ 0.000000e+00, %L.preheader ]
  %5 = fadd double %s.0, 2.800000e+01
  %lsr.iv.next = add i64 %lsr.iv, -1
  %6 = icmp eq i64 %lsr.iv.next, 0
  br i1 %6, label %L3, label %L
L3:                                               ; preds = %L, %top
  %s.1 = phi double [ 0.000000e+00, %top ], [ %5, %L ]
  ret double %s.1
}
That's pretty hard to grok, but one thing stands out in the body of the loop, L:
  %5 = fadd double %s.0, 2.800000e+01
For each iteration, the pre-computed constant 28.0 is being added to the accumulator, s. The compiler can tell that you never change any of the local variables and so it knows that the same sum is being added every time. The only reason that the loop has to execute at all is that repeated floating-point addition is not exactly equivalent to multiplication. If all the local variables are changed to integers, where repeated addition is exactly equivalent to multiplication, the loop is eliminated entirely:
julia> @time sumtest1_int(10^9)
  0.000005 seconds (6 allocations: 192 bytes)
28000000000
julia> @code_llvm sumtest1_int(10^9)
define i64 @julia_sumtest1_int_21472(i64) {
top:
  %1 = icmp slt i64 %0, 1
  br i1 %1, label %L3, label %L.preheader
L.preheader:                                      ; preds = %top
  %2 = icmp sgt i64 %0, 0
  %.op = mul i64 %0, 28
  %3 = select i1 %2, i64 %.op, i64 0
  br label %L3
L3:                                               ; preds = %L.preheader, %top
  %s.1 = phi i64 [ 0, %top ], [ %3, %L.preheader ]
  ret i64 %s.1
}
Which translates roughly back into Julia as:
sumtest1_int(N) = N < 1 ? 0 : ifelse(N > 0, N*28, 0)
That's a little redundant, since the body could be simplified to ifelse(N > 1, N*28, 0) (which could in turn be changed to just 28N since we don't care about negative values of N), but it's still much faster than executing a loop.
The function sumtest2 cannot be analyzed or optimized nearly so easily. It would require proving that the array A can never be changed, which is quite difficult. So the compiler has no choice but to do all the work, which is, of course, much slower than not doing it. In your simulation, it may still be faster to use local variables than storing values in an array, but it may not. You'll have to measure code that does something harder to optimize away completely to be sure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With