I am pretty new to Raku and I have a questions to functional methods, in particular with reduce. I originally had the method:
sub standardab{
my $mittel = mittel(@_);
my $foo = 0;
for @_ {
$foo += ($_ - $mittel)**2;
}
$foo = sqrt($foo/(@_.elems));
}
and it worked fine. Then I started to use reduce:
sub standardab{
my $mittel = mittel(@_);
my $foo = 0;
$foo = @_.reduce({$^a + ($^b-$mittel)**2});
$foo = sqrt($foo/(@_.elems));
}
my execution time doubled (I am applying this to roughly 1000 elements) and the solution differed by 0.004 (i guess rounding error). If I am using
.race.reduce(...)
my execution time is 4 times higher than with the original sequential code. Can someone tell me the reason for this? I thought about parallelism initialization time, but - as I said - i am applying this to 1000 elements and if i change other for loops in my code to reduce it gets even slower!
Thanks for your help
Routines are one of the means Raku has to reuse code. They come in several forms, most notably methods, which belong in classes and roles and are associated with an object; and functions (also called subroutines or sub s, for short), which can be called independently of objects.
Thankfully, Raku has features that will enable you to run things in parallel. At this stage, it is important to note that parallelism can mean one of two things: Task Parallelism: Two (or more) independent expressions running in parallel. Data Parallelism: A single expression iterating over a list of elements in parallel.
Raku has many conditional and looping constructs. 5.1. if The code runs only if a condition has been met; i.e., an expression evaluates to True. my$age= 19; if$age> 18{ say'Welcome'} In Raku, we can invert the code and the condition. Even if the code and the condition have been inverted, the condition is always evaluated first.
In order to enforce a function to act as a mutator, we use .=instead of . (#4) (Line 9 of the script) 5. Loops and conditions Raku has many conditional and looping constructs. 5.1. if The code runs only if a condition has been met; i.e., an expression evaluates to True. my$age= 19; if$age> 18{ say'Welcome'}
In general, reduce
and for
do different things, and they are doing different things in your code. For example, compared with your for
code, your reduce
code involves twice as many arguments being passed and is doing one less iteration. I think that's likely at the root of the 0.004
difference.
Even if your for
and reduce
code did the same thing, an optimized version of such reduce
code would never be faster than an equally optimized version of equivalent for
code.
I thought that race
didn't automatically parallelize reduce
due to reduce
's nature. (Though I see per your and @user0721090601's comment I'm wrong.) But it will incur overhead -- currently a lot.
You could use race
to parallelize your for
loop instead, if it's slightly rewritten. That might speed it up.
for
and reduce
codeHere's the difference I meant:
say do for <a b c d> { $^a } # (a b c d) (4 iterations)
say do reduce <a b c d>: { $^a, $^b } # (((a b) c) d) (3 iterations)
For more details of their operation, see their respective doc (for
, reduce
).
You haven't shared your data, but I will presume that the for
and/or reduce
computations involve Num
s (floats). Addition of floats isn't commutative, so you may well get (typically small) discrepancies if the additions end up happening in a different order.
I presume that explains the 0.004
difference.
reduce
being 2X slower than your for
my execution time doubled (I am applying this to roughly 1000 elements)
First, your reduce
code is different, as explained above. There are general abstract differences (eg taking two arguments per call instead of your for
block's one) and perhaps your specific data leads to fundamental numeric computation differences (perhaps your for
loop computation is primarily integer or float math while your reduce
is primarily rational?). That might explain the execution time difference, or some of it.
Another part of it may be the difference between, on the one hand, a reduce
, which will by default compile into calls of a closure, with call overhead, and two arguments per call, and temporary memory storing intermediate results, and, on the other, a for
which will by default compile into direct iteration, with the {...}
being just inlined code rather than a call of a closure. (That said, it's possible a reduce
will sometimes compile to inlined code; and it may even already be that way for your code.)
More generally, Rakudo optimization effort is still in its relatively early days. Most of it has been generic, speeding up all code. Where effort has been applied to particular constructs, the most widely used constructs have gotten the attention so far, and for
is widely used and reduce
less so. So some or all the difference may just be that reduce
is poorly optimized.
reduce
with race
my execution time [for
.race.reduce(...)
] is 4 times higher than with the original sequential code
I didn't think reduce
would be automatically parallelizable with race
. Per its doc, reduce
works by "iteratively applying a function which knows how to combine two values", and one argument in each iteration is the result of the previous iteration. So it seemed to me it must be done sequentially.
(I see in the comments that I'm misunderstanding what could be done by a compiler with a reduction. Perhaps this is if it's a commutative operation?)
In summary, your code is incurring race
ing's overhead without gaining any benefit.
race
in generalLet's say you're using some operation that is parallelizable with race
.
First, as you noted, race
incurs overhead. There'll be an initialization and teardown cost, at least some of which is paid repeatedly for each evaluation of an overall statement/expression that's being race
d.
Second, at least for now, race
means use of threads running on CPU cores. For some payloads that can yield a useful benefit despite any initialization and teardown costs. But it will, at best, be a speed up equal to the number of cores.
(One day it should be possible for compiler implementors to spot that a race
d for
loop is simple enough to be run on a GPU rather than a CPU, and go ahead and send it to a GPU to achieve a spectacular speed up.)
Third, if you literally write .race.foo...
you'll get default settings for some tunable aspects of the racing. The defaults are almost certainly not optimal and may be way off.
The currently tunable settings are :batch
and :degree
. See their doc for more details.
More generally, whether parallelization speeds up code depends on the details of a specific use case such as the data and hardware in use.
race
with for
If you rewrite your code a bit you can race
your for
:
$foo = sum do race for @_ { ($_ - $mittel)**2 }
To apply tuning you must repeat the race
as a method, for example:
$foo = sum do race for @_.race(:degree(8)) { ($_ - $mittel)**2 }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With