race
divides operations on an iterable automatically into threads. For instance,
(Bool.roll xx 2000).race.sum
would automatically divide the sum of the 2000-long array into 4 threads. However, benchmarks show that this is much slower than if race
were not employed. This happens even if you make the array bigger.
This happens even as the non-autothreaded version gets faster and faster with each version. (Auto-threading also gets faster, but is still twice as slow as not using it.)
So the question is: what is the minimum size of the atomic operation that is worthwhile to use? Is the overhead added to the sequential operation fixed or can it be decreased somehow?
Update: in fact, performance of hyper
(similar to race, but with guaranteed ordered results) seems to be getting worse with time, at least for small sizes which are nonetheless integer multiples of the default batch size (64). Same happens with race
The short answer: .sum
isn't smart enough to calculate sums in batches.
So what you're effectively doing in this benchmark, is to set up a HyperSeq
/ RaceSeq
but then not doing any parallel processing:
dd (Bool.roll xx 2000).race;
# RaceSeq.new(configuration => HyperConfiguration.new(batch => 64, degree => 4))
So you've been measuring .hyper
/ .race
overhead. You see, at the moment, only .map
and .grep
have been implemented on HyperSeq
/ RaceSeq
. If you give that something to do, like:
# find the 1000th prime number in a single thread
$ time perl6 -e 'say (^Inf).grep( *.is-prime ).skip(999).head'
real 0m1.731s
user 0m1.780s
sys 0m0.043s
# find the 1000th prime number concurrently
$ time perl6 -e 'say (^Inf).hyper.grep( *.is-prime ).skip(999).head'
real 0m0.809s
user 0m2.048s
sys 0m0.060s
As you can see, in this (small) example, the concurrent version is more than 2x as fast as the non-concurrent one. But uses more CPU.
Since .hyper
and .race
got to work correctly, performance has slightly improved, as you can see in this graph.
Other functions, such as .sum
could be implemented for .hyper
/ .race
. However, I would hold off on that at the moment, as we will need a small refactor of the way we do .hyper
and .race
: at the moment, a batch can not communicate back to the "supervisor" how fast it has finished its job. The supervisor needs that information if we want to allow it to adjust e.g. batch-size, if it finds out that the default batch-size is too small and we have too much overhead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With