I am trying to optimize the simulation of a simple dynamical system in which the response of a network as well as its parameters (weights) evolve according to simple linear equations. The simulation needs to run for tens of millions of time steps, but the network size will typically be small. Hence, performance is bound less so by matrix-vector products but rather by temporary arrays, bound checks and other less visible factors. Since I am new to Julia, I'd appreciate any hints to further optimize performance.
function train_network(A, T, Of, cs, dt)
N, I = size(T)
z = zeros(I)
r = zeros(N)
@inbounds for t in 1:size(cs, 1)
# precompute
Az = A*z
Ofr = Of*r
# compute training signal
@devec z += dt.*(Az + cs[t] - 0.5.*z)
I_teach = T*(Az + cs[t])
Tz = T*z
# rate updates
@devec r += dt.*(I_teach - Ofr - 0.1.*r)
# weight updates
for i in 1:I
@devec T[:, i] += dt.*1e-3.*(z[i].*r - T[:, i])
end
for n in 1:N
@devec Of[:, n] += dt.*1e-3.*(Tz.*r[n] - Of[:, n])
end
end
end
# init parameters
N, I = 20, 2
dt = 1e-3
# init weights
T = rand(N, I)*N
A = rand(I, I)
Of = rand(N, N)/N
# simulation time & input
sim_T = 2000
ts = 0:dt:sim_T
cs = randn(size(ts, 1), I)
Timing the network (2.000.000 steps) with
@time train_network(A, T, Of, cs, dt)
yields the timings
3.420486 seconds (26.12 M allocations: 2.299 GB, 6.65% gc time)
Update 1
Following the advice by David Sanders I got rid of the devec macro and wrote out the loops. This indeed reduces array allocations and boosts performance by about 25%, here are the new numbers:
2.648113 seconds (18.00 M allocations: 1.669 GB, 5.60% gc time)
The smaller the network size, the bigger the boost. A gist of the updated simulation code can be found here.
Update 2
Much of the memory allocations are due to the matrix-vector products. So, in order to get rid of those I replaced those products by an in-place BLAS operation, BLAS.genv!, which reduces timings by another 25% and memory allocation by 90%,
1.990031 seconds (2.00 M allocations: 152.589 MB, 0.69% gc time)
Updated code here.
Update 3
The largest rank-1 update can also be replaced by two calls to in-place BLAS functions, namely BLAS.scal! for scaling and BLAS.ger! for a rank-1 update. The caveat is that both calls are fairly slow if more then one thread is used (problem with OpenBLAS?), so it is best to set
blas_set_num_threads(1)
There is a 15% gain in timings for a network size of 20, and a gain of 50% for a network of size 50. There are no more memory allocations, and the new timings are
1.638287 seconds (11 allocations: 1.266 KB)
Again, the updated code can be found here.
Update 4
I wrote a basic Cython script to compare the results so far. The main difference is that I do not use any calls to BLAS but have loops: Injecting low-level BLAS calls is a pain in Cython, and calls to numpy dot have too much overhead for small network sizes (I tried...). Timings are
CPU times: user 3.46 s, sys: 6 ms, total: 3.47 s, Wall time: 3.47 s
which is roughly the same as the original version (from which, so far, 50% is shaved off).
Although you are using the Devectorize.jl
package, I suggest that you just write out all of those vectorized operations explicitly as simple loops. I expect that this will give you a significant performance boost.
Devectorize
package is certainly a great contribution, but to see the hoops it jumps through to do the dirty work for you, you can do something like this (an example from the package README):
using Devectorize
a = rand(2,2);
b = rand(2,2);
c = rand(2,2);
julia> macroexpand(:(@devec r = exp(a + b) .* sum(c)))
Here, macroexpand
is a function that tells you the code to which the @devec
macro expands its argument (the code on the rest of the line).
I won't spoil the surprise by showing the output here, but it is not just the simple for
loop that you would write by hand.
Furthermore, the fact that you have huge allocation suggests that not all of the vector operations are being correctly dealt with.
By the way, don't forget to do a small run first so you are not timing the compilation step.
[Tangential note: here, exp
is the function that applies the usual exponential function to each element of a matrix, equivalent to map(exp, a+b)
. expm
gives the exponential of a matrix. There has been talk of deprecating such uses of exp
.]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With