Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-threaded parallelism performance problem with Fibonacci sequence in Julia (1.3)

I'm trying the multithread function of Julia 1.3 with the following Hardware:

Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed:    2.8 GHz
Number of Processors:   1
Total Number of Cores:  4
L2 Cache (per Core):    256 KB
L3 Cache:   6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB

When running the following script:

function F(n)
if n < 2
    return n
    else
        return F(n-1)+F(n-2)
    end
end
@time F(43)

it gives me the following output

2.229305 seconds (2.00 k allocations: 103.924 KiB)
433494437

However when running the following code copied from the Julia page about multithreading

import Base.Threads.@spawn

function fib(n::Int)
    if n < 2
        return n
    end
    t = @spawn fib(n - 2)
    return fib(n - 1) + fetch(t)
end

fib(43)

what happens is that the utilisation of RAM/CPU jumps from 3.2GB/6% to 15GB/25% without any output (for at least 1 minute, after which i decided to kill the julia session)

What am I doing wrong?

like image 961
ecjb Avatar asked Nov 27 '19 20:11

ecjb


1 Answers

Great question.

This multithreaded implementation of the Fibonacci function is not faster than the single threaded version. That function was only shown in the blog post as a toy example of how the new threading capabilities work, highlighting that it allows for spawning many many threads in different functions and the scheduler will figure out an optimal workload.

The problem is that @spawn has a non-trivial overhead of around 1µs, so if you spawn a thread to do a task that takes less than 1µs, you've probably hurt your performance. The recursive definition of fib(n) has exponential time complexity of order 1.6180^n [1], so when you call fib(43), you spawn something of order 1.6180^43 threads. If each one takes 1µs to spawn, it'll take around 16 minutes just to spawn and schedule the needed threads, and that doesn't even account for the time it takes to do the actual computations and re-merge / sync threads which takes even more time.

Things like this where you spawn a thread for each step of a computation only make sense if each step of the computation takes a long time compared to the @spawn overhead.

Note that there is work going into lessening the overhead of @spawn, but by the very physics of multicore silicon chips I doubt it can ever be fast enough for the above fib implementation.


If you're curious about how we could modify the threaded fib function to actually be beneficial, the easiest thing to do would be to only spawn a fib thread if we think it will take significantly longer than 1µs to run. On my machine (running on 16 physical cores), I get

function F(n)
    if n < 2
        return n
    else
        return F(n-1)+F(n-2)
    end
end


julia> @btime F(23);
  122.920 μs (0 allocations: 0 bytes)

so thats a good two orders of magnitude over the cost of spawning a thread. That seems like a good cutoff to use:

function fib(n::Int)
    if n < 2
        return n
    elseif n > 23
        t = @spawn fib(n - 2)
        return fib(n - 1) + fetch(t)
    else
        return fib(n-1) + fib(n-2)
    end
end

now, if I follow proper benchmark methodology with BenchmarkTools.jl [2] I find

julia> using BenchmarkTools

julia> @btime fib(43)
  971.842 ms (1496518 allocations: 33.64 MiB)
433494437

julia> @btime F(43)
  1.866 s (0 allocations: 0 bytes)
433494437

@Anush asks in the comments: This is a factor of 2 speed up using 16 cores it seems. Is it possible to get something closer to a factor of 16 speed up?

Yes it is. The problem with the above function is that the function body is larger than that of F, with lots of conditionals, function / thread spawning and all that. I invite you to compare @code_llvm F(10) @code_llvm fib(10). This means that fib is much harder for julia to optimize. This extra overhead it makes a world of difference for the small n cases.

julia> @btime F(20);
  28.844 μs (0 allocations: 0 bytes)

julia> @btime fib(20);
  242.208 μs (20 allocations: 320 bytes)

Oh no! all that extra code that never gets touched for n < 23 is slowing us down by an order of magnitude! There's an easy fix though: when n < 23, don't recurse down to fib, instead call the single threaded F.

function fib(n::Int)
    if n > 23
       t = @spawn fib(n - 2)
       return fib(n - 1) + fetch(t)
    else
       return F(n)
    end
end

julia> @btime fib(43)
  138.876 ms (185594 allocations: 13.64 MiB)
433494437

which gives a result closer to what we'd expect for so many threads.

[1] https://www.geeksforgeeks.org/time-complexity-recursive-fibonacci-program/

[2] The BenchmarkTools @btime macro from BenchmarkTools.jl will run functions multiple times, skipping the compilation time and average results.

like image 105
Mason Avatar answered Sep 22 '22 06:09

Mason