As an exercise, I took these Scala and Java examples of Akka to port to Frege. While it works fine, it runs slower(11s) than Scala(540ms) counterpart.
module mmhelloworld.akkatutorialfregecore.Pi where
import mmhelloworld.akkatutorialfregecore.Akka
data PiMessage = Calculate |
Work {start :: Int, nrOfElements :: Int} |
Result {value :: Double} |
PiApproximation {pi :: Double, duration :: Duration}
data Worker = private Worker where
calculatePiFor :: Int -> Int -> Double
calculatePiFor !start !nrOfElements = loop start nrOfElements 0.0 f where
loop !curr !n !acc f = if n == 0 then acc
else loop (curr + 1) (n - 1) (f acc curr) f
f !acc !i = acc + (4.0 * fromInt (1 - (i `mod` 2) * 2) / fromInt (2 * i + 1))
onReceive :: Mutable s UntypedActor -> PiMessage -> ST s ()
onReceive actor Work{start=start, nrOfElements=nrOfElements} = do
sender <- actor.sender
self <- actor.getSelf
sender.tellSender (Result $ calculatePiFor start nrOfElements) self
data Master = private Master {
nrOfWorkers :: Int,
nrOfMessages :: Int,
nrOfElements :: Int,
listener :: MutableIO ActorRef,
pi :: Double,
nrOfResults :: Int,
workerRouter :: MutableIO ActorRef,
start :: Long } where
initMaster :: Int -> Int -> Int -> MutableIO ActorRef -> MutableIO UntypedActor -> IO Master
initMaster nrOfWorkers nrOfMessages nrOfElements listener actor = do
props <- Props.forUntypedActor Worker.onReceive
router <- RoundRobinRouter.new nrOfWorkers
context <- actor.getContext
workerRouter <- props.withRouter router >>= (\p -> context.actorOf p "workerRouter")
now <- currentTimeMillis ()
return $ Master nrOfWorkers nrOfMessages nrOfElements listener 0.0 0 workerRouter now
onReceive :: MutableIO UntypedActor -> Master -> PiMessage -> IO Master
onReceive actor master Calculate = do
self <- actor.getSelf
let tellWorker start = master.workerRouter.tellSender (work start) self
work start = Work (start * master.nrOfElements) master.nrOfElements
forM_ [0 .. master.nrOfMessages - 1] tellWorker
return master
onReceive actor master (Result newPi) = do
let (!newNrOfResults, !pi) = (master.nrOfResults + 1, master.pi + newPi)
when (newNrOfResults == master.nrOfMessages) $ do
self <- actor.getSelf
now <- currentTimeMillis ()
duration <- Duration.create (now - master.start) TimeUnit.milliseconds
master.listener.tellSender (PiApproximation pi duration) self
actor.getContext >>= (\context -> context.stop self)
return master.{pi=pi, nrOfResults=newNrOfResults}
data Listener = private Listener where
onReceive :: MutableIO UntypedActor -> PiMessage -> IO ()
onReceive actor (PiApproximation pi duration) = do
println $ "Pi approximation: " ++ show pi
println $ "Calculation time: " ++ duration.toString
actor.getContext >>= ActorContext.system >>= ActorSystem.shutdown
calculate nrOfWorkers nrOfElements nrOfMessages = do
system <- ActorSystem.create "PiSystem"
listener <- Props.forUntypedActor Listener.onReceive >>= flip system.actorOf "listener"
let constructor = Master.initMaster nrOfWorkers nrOfMessages nrOfElements listener
newMaster = StatefulUntypedActor.new constructor Master.onReceive
factory <- UntypedActorFactory.new newMaster
masterActor <- Props.fromUntypedFactory factory >>= flip system.actorOf "master"
masterActor.tell Calculate
getLine >> return () --Not to exit until done
main _ = calculate 4 10000 10000
Am I doing something wrong with Akka or is it something to do with laziness in Frege for being slow? For example, when I initially had fold
(strict fold) in place of loop
in Worker.calculatePiFor
, it took 27s.
Dependencies:
I am not exactly familiar with Actors, but assuming that the tightest loop is indeed loop
you could avoid passing function f
as argument.
For one, applications of passed functions cannot take advantage of the strictness of the actual passed function. Rather, code generation must assume conservatively that the passed function takes its arguments lazily and returns a lazy result.
Second, in our case you use f
really just once here, so one can inline it. (This is how it is done in the scala code in the article you linked.)
Look at the code generated for the tail recursion in the following sample code that mimics yours:
test b c = loop 100 0 f
where
loop 0 !acc f = acc
loop n !acc f = loop (n-1) (acc + f (acc-1) (acc+1)) f -- tail recursion
f x y = 2*x + 7*y
We get there:
// arg2$f is the accumulator
arg$2 = arg$2f + (int)frege.runtime.Delayed.<java.lang.Integer>forced(
f_3237.apply(PreludeBase.INum_Int._minusƒ.apply(arg$2f, 1)).apply(
PreludeBase.INum_Int._plusƒ.apply(arg$2f, 1)
).result()
);
You see here that f
is called lazily which causes all the argument expressios to also be computed lazily. Note the number of method calls this requires!
In your case the code should still be something like:
(double)Delayed.<Double>forced(f.apply(acc).apply(curr).result())
This means, two closures are build with the boxed values acc and curr and then the result is computed, i.e. the function f
gets called with the unboxed arguments, and the result gets again boxed, just to get unboxed again (forced) for the next loop.
Now compare the following, where we just do not pass f
but call it directly:
test b c = loop 100 0
where
loop 0 !acc = acc
loop n !acc = loop (n-1) (acc + f (acc-1) (acc+1))
f x y = 2*x + 7*y
We get:
arg$2 = arg$2f + f(arg$2f - 1, arg$2f + 1);
Much better! Finally, in the case above we can do without a function call at all:
loop n !acc = loop (n-1) (acc + f) where
f = 2*x + 7*y
x = acc-1
y = acc+1
And this gets:
final int y_3236 = arg$2f + 1;
final int x_3235 = arg$2f - 1;
...
arg$2 = arg$2f + ((2 * x_3235) + (7 * y_3236));
Please try this out and let us know what happens. The main boost in performance should come from not passing f
, whereas the inlining will probably be done in the JIT anyway.
The additional cost with fold
is probably because you also had to create some list before applying it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With