Why is this version of a matrix copy so slow?

Question

I've noticed a strange behavior of julia during a matrix copy.

Consider the following three functions:

function priv_memcopyBtoA!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
  A[1:n,1:n] = B[1:n,1:n]
  return nothing
end

function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
  ii = 1; jj = 1;
  while ii <= n
    jj = 1 #(*)
    while jj <= n
      A[jj,ii] = B[jj,ii]
      jj += 1
    end
    ii += 1
  end
  return nothing
end

function priv_memcopyBtoA3!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
  A[1:n,1:n] = view(B, 1:n, 1:n)
  return nothing
end

Edit: 1) I tested if the code would throw an BoundsError so the line marked with jj = 1 #(*) was missing in the initial code. The testing results were already from the fixed version, so they remain unchanged. 2) I've added the view variant, thanks to @Colin T Bowers for addressing both issues.

It seems like both functions should lead to more or less the same code. Yet I get for

A = fill!(Matrix{Int32}(2^12,2^12),2); B = Int32.(eye(2^12));

the results

@timev priv_memcopyBtoA!(A,B, 2000)
  0.178327 seconds (10 allocations: 15.259 MiB, 85.52% gc time)
elapsed time (ns): 178326537
gc time (ns):      152511699
bytes allocated:   16000304
pool allocs:       9
malloc() calls:    1
GC pauses:         1

and

@timev priv_memcopyBtoA2!(A,B, 2000)
 0.015760 seconds (4 allocations: 160 bytes)
elapsed time (ns): 15759742
bytes allocated:   160
pool allocs:       4

and

@timev priv_memcopyBtoA3!(A,B, 2000)
  0.043771 seconds (7 allocations: 224 bytes)
elapsed time (ns): 43770978
bytes allocated:   224
pool allocs:       7

That's a drastic difference. It's also surprising. I've expected the first version to be like memcopy, which is hard to beat for a large memory block.

The second version has overhead from the pointer arithmetic (getindex), the branch condition (<=) and the bounds check in each assignment. Yet each assignment takes just ~3 ns.

Also, the time which the garbage collector consumes, varies a lot for the first function. If no garbage collection is performed, the large difference becomes small, but it remains. It's still a factor of ~2.5 between version 3 and 2.

So why is the "memcopy" version not as efficient as the "assignment" version?

Colin T Bowers · Accepted Answer

Firstly, your code contains a bug. Run this:

A = [1 2 ; 3 4]
B = [5 6 ; 7 8]
priv_memcopyBtoA2!(A, B, 2)

then:

julia> A
2×2 Array{Int64,2}:
 5  2
 7  4

You need to re-assign jj back to 1 at the end of each inner while loop, ie:

function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
  ii = 1 
  while ii <= n
    jj = 1
    while jj <= n
      A[jj,ii] = B[jj,ii]
      jj += 1
    end
    ii += 1
  end
  return nothing
end

Even with the bug fix, you'll still note that the while loop solution is faster. This is because array slices in julia create temporary arrays. So in this line:

A[1:n,1:n] = B[1:n,1:n]

the right-hand side operation creates a temporary nxn array, and then assigns the temporary array to the left-hand side.

If you wanted to avoid the temporary array allocation, you would instead write:

A[1:n,1:n] = view(B, 1:n, 1:n)

and you'll notice that the timings of the two methods is now pretty close, although the while loop is still slightly faster. As a general rule, loops in Julia are fast (as in C fast), and explicitly writing out the loop will usually get you the most optimized compiled code. I would still expect the explicit loop to be faster than the view method.

As for the garbage collection stuff, that is just a result of your method of timing. Much better to use @btime from the package BenchmarkTools, which uses various tricks to avoid traps like timing garbage collection etc.

BlueLemon · Answer

Why is A[1:n,1:n] = view(B, 1:n, 1:n) or variants of it, slower than a set of while loops? Let's look at what A[1:n,1:n] = view(B, 1:n, 1:n) does.

view returns an iterator which contains a pointer to the parent B and information how to compute the indices which should be copied. A[1:n,1:n] = ... is parsed to a call _setindex!(...). After that, and a few calls down the call chain, the main work is done by:

.\abstractarray.jl:883;
# In general, we simply re-index the parent indices by the provided ones
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
    @_inline_meta
    @boundscheck checkbounds(V, I...)
    @inbounds r = V.parent[reindex(V, V.indexes, I)...]
    r
end
#.\multidimensional.jl:212;
@inline function next(iter::CartesianRange{I}, state) where I<:CartesianIndex
    state, I(inc(state.I, iter.start.I, iter.stop.I))
end
@inline inc(::Tuple{}, ::Tuple{}, ::Tuple{}) = ()
@inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (state[1]+1,)
@inline function inc(state, start, stop)
    if state[1] < stop[1]
        return (state[1]+1,tail(state)...)
    end
    newtail = inc(tail(state), tail(start), tail(stop))
    (start[1], newtail...)
end

getindex takes a view V and an index I. We get the view from B and the index I from A. In each step reindex computes from the view V and the index I indices to get an element in B. It's called r and we return it. Finally r is written to A.

After each copy inc increments the index I to the next element in A and tests if one is done. Note that the code is from v0.63 but in master it's more or less the same.

In principle the code could be reduced to a set of while loops, yet it is more general. It works for arbitrary views of B and arbitrary slices of the form a:b:c and for an arbitrary number of matrix dimensions. The big N is in our case 2.

Since the functions are more complex, the compiler doesn't optimize them as well. I.e. there is a recommendation that the compiler should inline them, but it doesn't do that. This shows that the shown functions are non trivial.

For a set of loops the compiler reduces the innermost loop to three additions (each for a pointer to A and B and one for the loop index) and a single copy instruction.

tl;dr The internal call chain of A[1:n,1:n] = view(B, 1:n, 1:n) coupled with multiple dispatch is non trivial and handles the general case. This induces overhead. A set of while loops is already optimized to a special case.

Note that the performance depends on the compiler. If one looks at the one dimensional case A[1:n] = view(B, 1:n), it's faster than a while loop because it vectorizes the code. Yet for higher dimensions N >2 the difference grows.

Why is this version of a matrix copy so slow?

Tags:

performance

copy

compilation

julia

BlueLemon

2 Answers

Colin T Bowers

BlueLemon

Recent Activity

Donate For Us

Why is this version of a matrix copy so slow?

Tags:

performance

copy

compilation

julia

BlueLemon

2 Answers

Colin T Bowers

BlueLemon

Related questions

Recent Activity

Donate For Us