I try to run the following function in julia command, but when timing the function I see too much memory allocations which I can't figure out why.
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
for m = 1:length(Pf)
i = 0
for k = 1:iters
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
energy_fin = (y'*y) / L
@inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
@inbounds Pd[m] = i/iters
end
#thresh = erfcinv(2Pf) * sqrt(2/L) + 1
#Pd_the = 0.5 * erfc(((thresh - (snr + 1)) * sqrt(L)) / (2*(snr + 1)))
end
Running that function in the julia command on my laptop, I get the following shocking numbers:
julia> @time pdpf(1000, 10000)
17.621551 seconds (9.00 M allocations: 30.294 GB, 7.10% gc time)
What is wrong with my code? Any help is appreciated.
I don't think this memory allocation is so surprising. For instance, consider all of the times that the inner loop gets executed:
for m = 1:length(Pf) this gives you 100 executions
for k = 1:iters this gives you 10,000 executions based on the arguments you supply to the function.
randn(L) this gives you a random vector of length 1,000, based on the arguments you supply to the function.
Thus, just considering these, you've got 100*10,000*1000 = 1 billion Float64 random numbers being generated. Each one of them takes 64 bits = 8 bytes. I.e. 8GB right there. And, you've got two calls to randn(L) which means that you're at 16GB allocations already.
You then have y = s + n which means another 8GB allocations, taking you up to 24GB. I haven't looked in detail on the remaining code to get you from 24GB to 30GB allocations, but this should show you that it's not hard for the GB allocations to start adding up in your code.
If you're looking at places to improve, I'll give you a hint that these lines can be improved by using the properties of normal random variables:
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
You should easily be able to cut down the allocations here from 24GB to 8GB in this way. Note that y will be a normal random variable here as you've defined it, and think up a way to generate a normal random variable with an identical distribution to what y has now.
Another small thing, snr is a constant inside your function. Yet, you keep taking its sqrt 1 million separate times. In some settings, 'checking your work' can be helpful, but I think that you can be confident the computer will get it right the first time and thus you don't need to make it keep re-doing this calculation ; ). There are other similar places you can improve your code to avoid duplicate computations here that I'll leave to you to locate.
aireties gives a good answer for why you have so many allocations. You can do more to reduce the number of allocations. Using this property we know that y = s+n is really y = sqrt(snr) * randn(L) + randn(L) and so we can instead do y = rvvar*randn(L) where rvvar= sqrt(1+sqrt(snr)^2) is defined outside the loop (thanks for the fix!). This will halve the number of random variables needed.
Outside the loop you can save sqrt(2/L) to cut down a little bit of time.
I don't think transpose is special-cased yet, so try using dot(y,y) instead of y'*y. I know dot for sure is just a loop without having to transpose, while the other may transpose depending on the version of Julia.
Something that would help performance (but not allocations) would be to use one big randn(L,iters) and loop through that. The reason is because if you make all of your random numbers all at once it's faster since it can use SIMD and a bunch of other goodies. If you want to implicitly do that without changing your code much, you can use ChunkedArrays.jl where you can use rands = ChunkedArray(randn,L) to initialize it and then everytime you want a randn(L), you instead use next(rands). Inside the ChunkedArray it actually makes bigger vectors and replenishes them as needed, but like this you can just get your randn(L) without having to keep track of all of that.
Edit:
ChunkedArrays probably only save time when L is smaller. This gives the code:
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
rvvar= sqrt(1+sqrt(snr)^2)
for m = 1:length(Pf)
i = 0
for k = 1:iters
y = rvvar*randn(L)
energy_fin = (y'*y) / L
@inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
@inbounds Pd[m] = i/iters
end
end
which runs in half the time as using two randn calls. Indeed from the ProfileViewer we get:
@profile pdpf(1000, 10000)
using ProfileView
ProfileView.view()

I circled the two parts for the line y = rvvar*randn(L), so the vast majority of the time is random number generation. Last time I checked you could still get a decent speedup on random number generation by changing to to VSL.jl library, but you need MKL linked to your Julia build. Note that from the Google Summer of Code page you can see that there is a project to make a repo RNG.jl with faster psudo-rngs. It looks like it already has a few new ones implemented. You may want to check them out and see if they give speedups (or help out with that project!)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With