I'm trying the multithread function of Julia 1.3
with the following Hardware:
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2.8 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
When running the following script:
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
@time F(43)
it gives me the following output
2.229305 seconds (2.00 k allocations: 103.924 KiB)
433494437
However when running the following code copied from the Julia page about multithreading
import Base.Threads.@spawn
function fib(n::Int)
if n < 2
return n
end
t = @spawn fib(n - 2)
return fib(n - 1) + fetch(t)
end
fib(43)
what happens is that the utilisation of RAM/CPU jumps from 3.2GB/6% to 15GB/25% without any output (for at least 1 minute, after which i decided to kill the julia session)
What am I doing wrong?
Great question.
This multithreaded implementation of the Fibonacci function is not faster than the single threaded version. That function was only shown in the blog post as a toy example of how the new threading capabilities work, highlighting that it allows for spawning many many threads in different functions and the scheduler will figure out an optimal workload.
The problem is that @spawn
has a non-trivial overhead of around 1µs
, so if you spawn a thread to do a task that takes less than 1µs
, you've probably hurt your performance. The recursive definition of fib(n)
has exponential time complexity of order 1.6180^n
[1], so when you call fib(43)
, you spawn something of order 1.6180^43
threads. If each one takes 1µs
to spawn, it'll take around 16 minutes just to spawn and schedule the needed threads, and that doesn't even account for the time it takes to do the actual computations and re-merge / sync threads which takes even more time.
Things like this where you spawn a thread for each step of a computation only make sense if each step of the computation takes a long time compared to the @spawn
overhead.
Note that there is work going into lessening the overhead of @spawn
, but by the very physics of multicore silicon chips I doubt it can ever be fast enough for the above fib
implementation.
If you're curious about how we could modify the threaded fib
function to actually be beneficial, the easiest thing to do would be to only spawn a fib
thread if we think it will take significantly longer than 1µs
to run. On my machine (running on 16 physical cores), I get
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
julia> @btime F(23);
122.920 μs (0 allocations: 0 bytes)
so thats a good two orders of magnitude over the cost of spawning a thread. That seems like a good cutoff to use:
function fib(n::Int)
if n < 2
return n
elseif n > 23
t = @spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return fib(n-1) + fib(n-2)
end
end
now, if I follow proper benchmark methodology with BenchmarkTools.jl [2] I find
julia> using BenchmarkTools
julia> @btime fib(43)
971.842 ms (1496518 allocations: 33.64 MiB)
433494437
julia> @btime F(43)
1.866 s (0 allocations: 0 bytes)
433494437
@Anush asks in the comments: This is a factor of 2 speed up using 16 cores it seems. Is it possible to get something closer to a factor of 16 speed up?
Yes it is. The problem with the above function is that the function body is larger than that of F
, with lots of conditionals, function / thread spawning and all that. I invite you to compare @code_llvm F(10)
@code_llvm fib(10)
. This means that fib
is much harder for julia to optimize. This extra overhead it makes a world of difference for the small n
cases.
julia> @btime F(20);
28.844 μs (0 allocations: 0 bytes)
julia> @btime fib(20);
242.208 μs (20 allocations: 320 bytes)
Oh no! all that extra code that never gets touched for n < 23
is slowing us down by an order of magnitude! There's an easy fix though: when n < 23
, don't recurse down to fib
, instead call the single threaded F
.
function fib(n::Int)
if n > 23
t = @spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return F(n)
end
end
julia> @btime fib(43)
138.876 ms (185594 allocations: 13.64 MiB)
433494437
which gives a result closer to what we'd expect for so many threads.
[1] https://www.geeksforgeeks.org/time-complexity-recursive-fibonacci-program/
[2] The BenchmarkTools @btime
macro from BenchmarkTools.jl will run functions multiple times, skipping the compilation time and average results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With