I have been trying to design a system which allows a large amount of concurrent users to be represented in memory at the same time. When setting out to design this sytem I immediately thought of some sort of actor based solution a kin to Erlang.
The system has to be done in .NET, so I started working on a prototype in F# using MailboxProcessor but have run into serious performance problems with them. My initial idea was to use one actor (MailboxProcessor) per user to serialize communication the communication for one user.
I have isolated a small piece of code that reproduces the problem I am seeing:
open System.Threading;
open System.Diagnostics;
type Inc() =
let mutable n = 0;
let sw = new Stopwatch()
member x.Start() =
sw.Start()
member x.Increment() =
if Interlocked.Increment(&n) >= 100000 then
printf "UpdateName Time %A" sw.ElapsedMilliseconds
type Message
= UpdateName of int * string
type User = {
Id : int
Name : string
}
[<EntryPoint>]
let main argv =
let sw = Stopwatch.StartNew()
let incr = new Inc()
let mb =
Seq.initInfinite(fun id ->
MailboxProcessor<Message>.Start(fun inbox ->
let rec loop user =
async {
let! m = inbox.Receive()
match m with
| UpdateName(id, newName) ->
let user = {user with Name = newName};
incr.Increment()
do! loop user
}
loop {Id = id; Name = sprintf "User%i" id}
)
)
|> Seq.take 100000
|> Array.ofSeq
printf "Create Time %i\n" sw.ElapsedMilliseconds
incr.Start()
for i in 0 .. 99999 do
mb.[i % mb.Length].Post(UpdateName(i, sprintf "User%i-UpdateName" i));
System.Console.ReadLine() |> ignore
0
Just creating the 100k actors take around 800ms on my quad core i7. Then submitting the UpdateName
message to each one of the actor and wait for them to complete takes about 1.8 seconds.
Now, I realize there is overhead from all the queue:ing on the ThreadPool, setting/resetting AutoResetEvents, etc internally in the MailboxProcessor. But is this really the expected performance? From reading both MSDN and various blogs on the MailboxProcessor I have gotten the idea that it's to be a kin to erlang actors, but from the abyssmal performance I am seeing this doesn't seem to hold true in reality?
I also tried a modified version of the code, which uses 8 MailboxProcessors and each one of them hold a Map<int, User>
map which is used to lookup a user by id, it yielded some improvements bringing down the total time for the UpdateName operation to 1.2 seconds. But it still feels very slow, the modified code is here:
open System.Threading;
open System.Diagnostics;
type Inc() =
let mutable n = 0;
let sw = new Stopwatch()
member x.Start() =
sw.Start()
member x.Increment() =
if Interlocked.Increment(&n) >= 100000 then
printf "UpdateName Time %A" sw.ElapsedMilliseconds
type Message
= CreateUser of int * string
| UpdateName of int * string
type User = {
Id : int
Name : string
}
[<EntryPoint>]
let main argv =
let sw = Stopwatch.StartNew()
let incr = new Inc()
let mb =
Seq.initInfinite(fun id ->
MailboxProcessor<Message>.Start(fun inbox ->
let rec loop users =
async {
let! m = inbox.Receive()
match m with
| CreateUser(id, name) ->
do! loop (Map.add id {Id=id; Name=name} users)
| UpdateName(id, newName) ->
match Map.tryFind id users with
| None ->
do! loop users
| Some(user) ->
incr.Increment()
do! loop (Map.add id {user with Name = newName} users)
}
loop Map.empty
)
)
|> Seq.take 8
|> Array.ofSeq
printf "Create Time %i\n" sw.ElapsedMilliseconds
for i in 0 .. 99999 do
mb.[i % mb.Length].Post(CreateUser(i, sprintf "User%i-UpdateName" i));
incr.Start()
for i in 0 .. 99999 do
mb.[i % mb.Length].Post(UpdateName(i, sprintf "User%i-UpdateName" i));
System.Console.ReadLine() |> ignore
0
So my question is here, am I doing something wrong? Have I missunderstood how the MailboxProcessor is supposed to be used? Or is this performance what is expected.
Update:
So I got a hold of some guys on ##fsharp @ irc.freenode.net, which informed me that using sprintf is very slow, and as it turns out that is where a large part of my performance problems were comming from. But, removing the sprintf operations above and just using the same name for every User, I still end up with about 400ms for doign the operations, which feels really slow.
Now, I realize there is overhead from all the queue:ing on the ThreadPool, setting/resetting AutoResetEvents, etc internally in the MailboxProcessor.
And printf
, Map
, Seq
and contending for your global mutable Inc
. And you're leaking heap-allocated stack frames. In fact, only a small proportion of the time taken to run your benchmark has anything to do with MailboxProcessor
.
But is this really the expected performance?
I am not surprised by the performance of your program but it does not say much about the performance of MailboxProcessor
.
From reading both MSDN and various blogs on the MailboxProcessor I have gotten the idea that it's to be a kin to erlang actors, but from the abyssmal performance I am seeing this doesn't seem to hold true in reality?
The MailboxProcessor
is conceptually somewhat similar to part of Erlang. The abysmal performance you're seeing is due to a variety of things, some of which are quite subtle and will affect any such program.
So my question is here, am I doing something wrong?
I think you're doing a few things wrong. Firstly, the problem you are trying to solve is not clear so this sounds like an XY problem question. Secondly, you're trying to benchmark the wrong things (e.g. you are complaining about microsecond times required to create a MailboxProcessor
but may intend to do so only when a TCP connection is established which takes several orders of magnitude longer). Thirdly, you have written a benchmark program that measures the performance of some things but have attributed your observations to completely different things.
Let's look at your benchmark program in more detail. Before we do anything else, let's fix some bugs. You should always use sw.Elapsed.TotalSeconds
to measure time because it is more precise. You should always recur in an async workflow using return!
and not do!
or you will leak stack frames.
My initial timings are:
Creation stage: 0.858s
Post stage: 1.18s
Next, let's run a profile to make sure our program really is spending most of its time thrashing the F# MailboxProcessor
:
77% Microsoft.FSharp.Core.PrintfImpl.gprintf(...)
4.4% Microsoft.FSharp.Control.MailboxProcessor`1.Post(!0)
Clearly not what we'd hoped. Thinking more abstractly, we are generating lots of data using things like sprintf
and then applying it but we're doing the generation and application together. Let's separate out our initialization code:
let ids = Array.init 100000 (fun id -> {Id = id; Name = sprintf "User%i" id})
...
ids
|> Array.map (fun id ->
MailboxProcessor<Message>.Start(fun inbox ->
...
loop id
...
printf "Create Time %fs\n" sw.Elapsed.TotalSeconds
let fxs =
[|for i in 0 .. 99999 ->
mb.[i % mb.Length].Post, UpdateName(i, sprintf "User%i-UpdateName" i)|]
incr.Start()
for f, x in fxs do
f x
...
Now we get:
Creation stage: 0.538s
Post stage: 0.265s
So creation is 60% faster and posting is 4.5x faster.
Let's try completely rewriting your benchmark:
do
for nAgents in [1; 10; 100; 1000; 10000; 100000] do
let timer = System.Diagnostics.Stopwatch.StartNew()
use barrier = new System.Threading.Barrier(2)
let nMsgs = 1000000 / nAgents
let nAgentsFinished = ref 0
let makeAgent _ =
new MailboxProcessor<_>(fun inbox ->
let rec loop n =
async { let! () = inbox.Receive()
let n = n+1
if n=nMsgs then
let n = System.Threading.Interlocked.Increment nAgentsFinished
if n = nAgents then
barrier.SignalAndWait()
else
return! loop n }
loop 0)
let agents = Array.init nAgents makeAgent
for agent in agents do
agent.Start()
printfn "%fs to create %d agents" timer.Elapsed.TotalSeconds nAgents
timer.Restart()
for _ in 1..nMsgs do
for agent in agents do
agent.Post()
barrier.SignalAndWait()
printfn "%fs to post %d msgs" timer.Elapsed.TotalSeconds (nMsgs * nAgents)
timer.Restart()
for agent in agents do
use agent = agent
()
printfn "%fs to dispose of %d agents\n" timer.Elapsed.TotalSeconds nAgents
This version expects nMsgs
to each agent before that agent will increment the shared counter, greatly reducing the performance impact of that shared counter. This program also examines performance with different numbers of agents. On this machine I get:
Agents M msgs/s
1 2.24
10 6.67
100 7.58
1000 5.15
10000 1.15
100000 0.36
So it appears that part of the reasons for the lower msgs/s speed you are seeing is the unusually-large number (100,000) of agents. With 10-1,000 agents the F# implementation is over 10x faster than it is with 100,000 agents.
So if you can make do with this kind of performance then you should be able to write your entire application in F# but if you need to eek out much more performance I would recommend using a different approach. You may not even have to sacrifice using F# (and you can certainly use it for prototyping) by adopting a design like the Disruptor. In practice, I have found that the time spent doing serialization on .NET tends to be much larger than the time spent in F# async and MailboxProcessor
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With