I made a change to my library, now it's much slower but I can't figure out where it spends all that additional time. Profiling reports are not helping. Please help me figure out what the reason might be.
I made a Redis client-library called Hedis and have a benchmark program for it. Now, I made some internal changes to the library, to clean up the architecture. This caused performance (in Redis-requests per second, as measured by said benchmark) to drop by a factor of about 2.5.
The benchmark opens 50 network connections to a Redis server on localhost. The connections are handled differently between the two versions:
unsafeInterleaveIO
(I describe my approach in broad
strokes in a blog post here). I was a bit unhappy
with the architecture, hence I changed things toChan
s (150 threads running in the benchmark).Some more information that might be relevant:
-threaded
).forkIO
. Not the more expensive
forkOS
.Profiling does not give me a clear reason for the drop in performance. According to the profiling report, both versions spend more than 99% of their time in System.IO.hFlush
and Data.ByteString.hGetSome
. The number of times hFlush
and hGetSome
are called is the same in both versions. Since network traffic is also the same in both cases, these functions can't be the reason for the slow-down.
The only significant difference I can measure between the two versions is what time
(the Unix utility) tells me: the slow version (with three times as many threads) spends significantly more time in "sys" instead of "user", compared to the fast version. The GHC +RTS -s
flag reports this as reduced productivity.
Here are the program outputs for both versions with the +RTS -s
flag:
$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping 33305.29 Req/s
get 25802.92 Req/s
mget 18215.94 Req/s
ping (pipelined) 268994.36 Req/s
5,118,163,904 bytes allocated in the heap
185,075,608 bytes copied during GC
4,084,384 bytes maximum residency (39 sample(s))
916,544 bytes maximum slop
10 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7416 colls, 0 par 0.38s 0.40s 0.0001s 0.0003s
Gen 1 39 colls, 0 par 0.03s 0.03s 0.0007s 0.0009s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.93s ( 12.34s elapsed)
GC time 0.41s ( 0.43s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.33s ( 12.76s elapsed)
%GC time 4.9% (3.3% elapsed)
Alloc rate 645,587,554 bytes per MUT second
Productivity 95.1% of total user, 62.1% of total elapsed
real 0m12.772s
user 0m8.334s
sys 0m4.424s
$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping 11457.83 Req/s
get 11169.64 Req/s
mget 8446.96 Req/s
ping (pipelined) 130114.31 Req/s
6,053,055,680 bytes allocated in the heap
1,184,574,408 bytes copied during GC
9,750,264 bytes maximum residency (198 sample(s))
2,872,280 bytes maximum slop
26 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9105 colls, 0 par 2.11s 2.14s 0.0002s 0.0006s
Gen 1 198 colls, 0 par 0.23s 0.24s 0.0012s 0.0093s
INIT time 0.00s ( 0.00s elapsed)
MUT time 10.99s ( 27.92s elapsed)
GC time 2.34s ( 2.38s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 13.33s ( 30.30s elapsed)
%GC time 17.6% (7.8% elapsed)
Alloc rate 550,656,490 bytes per MUT second
Productivity 82.4% of total user, 36.3% of total elapsed
real 0m30.305s
user 0m13.333s
sys 0m16.964s
Do you have any ideas or hints where that additional time might come from?
According to the profiling report, most of the time is spent in hFlush
and hGetSome
. According to time
, the slow version takes a lot more sys time. So, my hypothesis is that a lot of time is spent blocked and waiting, whether it's waiting for more input or locking and unlocking threads.
Here's the first thing I would do: compile the code with -threaded
and see what happens. The threaded runtime uses a completely different IO manager, and I strongly suspect that this single change will fix your problem.
My guess would be that it's something to do with the overhead of Chan
.
My first thought was increased GC time, but that doesn't appear to be the case at all. So my second thought is that maybe all the locking and unlocking involved in using Chan
(which is implemented on top of MVar
) is the problem. But this is still only a guess.
You might try TChan
(i.e., STM) instead and see if that makes the slightest shred of difference. (Perhaps you could code up a small skeleton just to compare the two and see if that's where the problem is, rather than reimplement your "real" code.)
Other than that, I'm out of ideas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With