I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.
Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.
I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.
using BenchmarkTools
function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 8.034s
Threads.@threads
on 4 threadsusing BenchmarkTools
using Base.Threads
function slow(n::Int, digits::String)
total = Atomic{Float64}(0)
@threads for i in 1:n
if !occursin(digits, string(i))
atomic_add!(total, 1.0 / i)
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 6.938s
Speedup: 1.2
using BenchmarkTools
using FLoops
function slow(n::Int, digits::String)
total = 0.0
@floop for i in 1:n
if !occursin(digits, string(i))
@reduce(total += 1.0 / i)
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 10.850s
No speedup: slower than the sequential code.
I tested the sequential and Threads.@threads
code on a different machine and experimented with various numbers of threads.
Here are the results:
Number of threads | Speedup |
---|---|
2 | 1.2 |
4 | 1.2 |
8 | 1.0 (no speedup) |
16 | 0.9 (the code takes longer to run than the sequential code) |
For heavier computations (n = 1e9
in the code above) which would minimize the relative effect of any overhead, the results are very similar:
Number of threads | Speedup |
---|---|
2 | 1.1 |
4 | 1.3 |
8 | 1.1 |
16 | 0.8 (the code takes longer to run than the sequential code) |
Code run with Chapel v1.23.0:
use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
if (i: string).find(digits) == -1 then
total += 1.0 / i;
}
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
total, watch.elapsed());
First run (same hardware as the first Julia tests):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 13.33 | n/a |
2 | 7.34 | 1.8 |
Second run (same hardware):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 13.59 | n/a |
2 | 6.83 | 2.0 |
Third run (different hardware):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 19.99 | n/a |
2 | 10.06 | 2.0 |
4 | 5.05 | 4.0 |
8 | 2.54 | 7.9 |
16 | 1.28 | 15.6 |
The number of threads can either be specified as an integer ( --threads=4 ) or as auto ( --threads=auto ), where auto sets the number of threads to the number of local CPU threads. The -t / --threads command line argument requires at least Julia 1.5. In older versions you must use the environment variable instead.
So, theoretically, the optimal number of threads will be the number of cores you have on your machine. If your cores are "hyper threaded" (as Intel calls it) it can run 2 threads on each core. Then, in that case, the optimal number of threads is double the number of cores on your machine.
Someone can make a much more detailed analysis than me but the main reason naive Julia threading is performing badly is that your "task" in each iteration is way too light. Using an atomic lock, in this case, will imply huge overhead because all threads are just waiting for the lock way too often.
Since your Chapel code is doing a mapreduce, we can also try a parallel mapreduce in Julia:
julia> function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
"total = $total"
end
slow (generic function with 1 method)
julia> @btime slow(Int64(1e5), "9")
6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"
julia> using ThreadsX
julia> function slow_thread_thx(n::Int, digits::String)
total = ThreadsX.mapreduce(+,1:n) do i
if !occursin(digits, string(i))
1.0 / i
else
0.0
end
end
"total = $total"
end
julia> @btime slow_thread_thx(Int64(1e5), "9")
1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"
With 4 threads. I've tested with other numbers of threads and confirmed the scaling is pretty linear.
Btw, just as a general tip, you should try to avoid printing in a benchmarked code because it makes a mess when timed repeatedly and also if your task is fast, STDIO can take nonnegligible time.
As jling suggests in the comments on their answer the problem here is most likely that the Julia code is allocating lots of memory that needs to be garbage collected. Chapel is, to my understanding, not a garbage-collected language and that could explain why this example scales more linearly. As a small test of this hypothesis, I benchmarked the following code that performs the same operations but with preallocated Vector{UInt8}
instead of String
:
using BenchmarkTools
using Transducers
using Distributed
function string_vector!(a::Vector{UInt8}, x::Unsigned)
n = ndigits(x)
length(a) < n && error("Vector too short")
i = n
@inbounds while i >= 1
d, r = divrem(x, 0x0a)
a[i] = 0x30 + r
x = oftype(x, d)
i -= 1
end
a
end
function slow_no_garbage(n::UInt, digits::String)
digits = collect(codeunits(digits))
thread_strings = [zeros(UInt8, 100) for _ in 1:Threads.nthreads()]
fun(i) = if Base._searchindex(string_vector!(thread_strings[Threads.threadid()], i), digits, 1) == 0
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 0x1:n)
"total = $total"
end
println(@btime slow_no_garbage(UInt(1e8), "9"))
I do not recommend using this code (especially since, because the numbers are always growing in length I don't properly clear the thread buffer between iterations, although that is easily fixed). However, it results in almost linear scaling with the number of threads (table at the end of the answer).
As jling also mentioned, if a lot of garbage is created distribution may be better than threading. The following two code snippets use Transducers.jl to run the code first using threads:
using BenchmarkTools
using Transducers
function slow_transducers(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 1:n)
"total = $total"
end
println(@btime slow_transducers(Int64(1e8), "9"))
and then distributed to separate Julia processes (taking the number of processes as the first command-line argument):
using BenchmarkTools
using Transducers
using Distributed
function slow_distributed(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxd(+, Map(fun), 1:n)
"total = $total"
end
addprocs(parse(Int, ARGS[1]))
println(@btime slow_distributed(Int64(1e8), "9"))
The following table shows the results of running all versions with different number of threads/processes:
n | jling | slow_transducers |
slow_distributed |
slow_no_garbage |
Chapel |
---|---|---|---|---|---|
1 | 4.242 s | 4.224 s | 4.241 s | 2.743 s | 7.32 s |
2 | 2.952 s | 2.958 s | 2.168 s | 1.447 s | 3.73 s |
4 | 2.135 s | 2.147 s | 1.163 s | 0.716105 s | 1.9 s |
8 | 1.742 s | 1.741 s | 0.859058 s | 0.360469 s |
Speedup:
n | jling | slow_transducers |
slow_distributed |
slow_no_garbage |
Chapel |
---|---|---|---|---|---|
1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2 | 1.43699 | 1.42799 | 1.95618 | 1.89565 | 1.96247 |
4 | 1.98689 | 1.9674 | 3.6466 | 3.83044 | 3.85263 |
8 | 2.43513 | 2.42619 | 4.9368 | 7.60953 |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With