Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A change in my library made it much slower. Profiling isn't helping me. What might be the reason for the slow-down?

My Problem, Briefly

I made a change to my library, now it's much slower but I can't figure out where it spends all that additional time. Profiling reports are not helping. Please help me figure out what the reason might be.

Some Context

I made a Redis client-library called Hedis and have a benchmark program for it. Now, I made some internal changes to the library, to clean up the architecture. This caused performance (in Redis-requests per second, as measured by said benchmark) to drop by a factor of about 2.5.

The benchmark opens 50 network connections to a Redis server on localhost. The connections are handled differently between the two versions:

  • The fast version uses one thread per connection (thus, the benchmark has 50 threads running concurrently). It reads from the socket handle by using unsafeInterleaveIO (I describe my approach in broad strokes in a blog post here). I was a bit unhappy with the architecture, hence I changed things to
  • the slow version which uses three threads per connection. They communicate via two Chans (150 threads running in the benchmark).

Some more information that might be relevant:

  • Compiled with GHC 7.2.2.
  • The benchmark program is unchanged between the two versions, so network traffic is the same.
  • Both versions use the single-threaded runtime (compiled without -threaded).
  • All threads are created by calling forkIO. Not the more expensive forkOS.

Profiling Results

Profiling does not give me a clear reason for the drop in performance. According to the profiling report, both versions spend more than 99% of their time in System.IO.hFlush and Data.ByteString.hGetSome. The number of times hFlush and hGetSome are called is the same in both versions. Since network traffic is also the same in both cases, these functions can't be the reason for the slow-down.

The only significant difference I can measure between the two versions is what time (the Unix utility) tells me: the slow version (with three times as many threads) spends significantly more time in "sys" instead of "user", compared to the fast version. The GHC +RTS -s flag reports this as reduced productivity.

Here are the program outputs for both versions with the +RTS -s flag:

Benchmark of the Fast Version

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   33305.29 Req/s
get                    25802.92 Req/s
mget                   18215.94 Req/s
ping (pipelined)      268994.36 Req/s
   5,118,163,904 bytes allocated in the heap
     185,075,608 bytes copied during GC
       4,084,384 bytes maximum residency (39 sample(s))
         916,544 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7416 colls,     0 par    0.38s    0.40s     0.0001s    0.0003s
  Gen  1        39 colls,     0 par    0.03s    0.03s     0.0007s    0.0009s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.93s  ( 12.34s elapsed)
  GC      time    0.41s  (  0.43s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.33s  ( 12.76s elapsed)

  %GC     time       4.9%  (3.3% elapsed)

  Alloc rate    645,587,554 bytes per MUT second

  Productivity  95.1% of total user, 62.1% of total elapsed


real    0m12.772s
user    0m8.334s
sys     0m4.424s

Benchmark of the Slow Version

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   11457.83 Req/s
get                    11169.64 Req/s
mget                    8446.96 Req/s
ping (pipelined)      130114.31 Req/s
   6,053,055,680 bytes allocated in the heap
   1,184,574,408 bytes copied during GC
       9,750,264 bytes maximum residency (198 sample(s))
       2,872,280 bytes maximum slop
              26 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9105 colls,     0 par    2.11s    2.14s     0.0002s    0.0006s
  Gen  1       198 colls,     0 par    0.23s    0.24s     0.0012s    0.0093s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   10.99s  ( 27.92s elapsed)
  GC      time    2.34s  (  2.38s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   13.33s  ( 30.30s elapsed)

  %GC     time      17.6%  (7.8% elapsed)

  Alloc rate    550,656,490 bytes per MUT second

  Productivity  82.4% of total user, 36.3% of total elapsed


real    0m30.305s
user    0m13.333s
sys     0m16.964s

Do you have any ideas or hints where that additional time might come from?

like image 505
informatikr Avatar asked Jan 31 '12 10:01

informatikr


2 Answers

According to the profiling report, most of the time is spent in hFlush and hGetSome. According to time, the slow version takes a lot more sys time. So, my hypothesis is that a lot of time is spent blocked and waiting, whether it's waiting for more input or locking and unlocking threads.

Here's the first thing I would do: compile the code with -threaded and see what happens. The threaded runtime uses a completely different IO manager, and I strongly suspect that this single change will fix your problem.

like image 190
John L Avatar answered Sep 29 '22 10:09

John L


My guess would be that it's something to do with the overhead of Chan.

My first thought was increased GC time, but that doesn't appear to be the case at all. So my second thought is that maybe all the locking and unlocking involved in using Chan (which is implemented on top of MVar) is the problem. But this is still only a guess.

You might try TChan (i.e., STM) instead and see if that makes the slightest shred of difference. (Perhaps you could code up a small skeleton just to compare the two and see if that's where the problem is, rather than reimplement your "real" code.)

Other than that, I'm out of ideas.

like image 23
MathematicalOrchid Avatar answered Sep 29 '22 10:09

MathematicalOrchid