Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes this performance drop?

I'm using the Disruptor framework for performing fast Reed-Solomon error correction on some data. This is my setup:

          RS Decoder 1
        /             \
Producer-     ...     - Consumer
        \             /
          RS Decoder 8 
  • The producer reads blocks of 2064 bytes from disk into a byte buffer.
  • The 8 RS decoder consumers perform Reed-Solomon error correction in parallel.
  • The consumer writes files to disk.

In the disruptor DSL terms, the setup looks like this:

        RsFrameEventHandler[] rsWorkers = new RsFrameEventHandler[numRsWorkers];
        for (int i = 0; i < numRsWorkers; i++) {
            rsWorkers[i] = new RsFrameEventHandler(numRsWorkers, i);
        }
        disruptor.handleEventsWith(rsWorkers)
                .then(writerHandler);

When I don't have a disk output consumer (no .then(writerHandler) part), the measured throughput is 80 M/s, as soon as I add a consumer, even if it writes to /dev/null, or doesn't even write, but it is declared as a dependent consumer, performance drops to 50-65 M/s.

I've profiled it with Oracle Mission Control, and this is what the CPU usage graph shows:

Without an additional consumer: Without an additional consumer

With an additional consumer: With additional consumer

What is this gray part in the graph and where is it coming from? I suppose it has to do with thread synchronisation, but I can't find any other statistic in Mission Control that would indicate any such latency or contention.

like image 533
Zoltán Avatar asked Feb 20 '15 13:02

Zoltán


2 Answers

Your hypothesis is correct, it is a thread synchronization issue.

From the API Documentation for EventHandlerGroup<T>.then (Emphasis mine)

Set up batch handlers to consume events from the ring buffer. These handlers will only process events after every EventProcessor in this group has processed the event.

This method is generally used as part of a chain. For example if the handler A must process events before handler B:

This should necessarily decrease throughput. Think about it like a funnel:

Event Funnel

The consumer has to wait for every EventProcessor to be finished, before it can proceed through the bottleneck.

like image 90
durron597 Avatar answered Oct 24 '22 11:10

durron597


I can see two possibilities here, based on what you've shown. You might be affected by one or both, I'd recommend testing both. 1) IO processing bottleneck. 2) Contention on multiple threads writing to buffer.

IO processing

From the data shown, you have stated that as soon as you enable the IO component, your throughput decreases and kernel time increases. This could quite easily be the IO wait time while your consumer thread is writing. Context switch to perform a write() call is significantly more expensive than doing nothing. Your Decoders are now capped at the maximum speed of the consumer. To test this hypothesis, you could remove the write() call. In other words, open the output file, prepare the string for output, and just not issue the write call.

Suggestions

  • Try removing the write() call in the Consumer, see if it reduces kernel time.
  • Are you writing to a single flat file sequentially - if not, try this
  • Are you using smart batching (ie: buffering until endOfBatch flag and then writing in a single batch) to ensure that the IO is bundled up as efficiently as possible?

Contention on multiple writers

Based on your description I suspect your Decoders are reading from the disruptor and then writing back to the very same buffer. This is going to cause issues with multiple writers aka contention on the CPUs writing to memory. One thing I would suggest is to have two disruptor rings:

  1. Producer writes to #1
  2. Decoder reads from #1, performs RS decode and writes the result to #2
  3. Consumer reads from #2, and writes to disk

Assuming your RBs are sufficiently large, this should result in good clean walking through memory.

The key here is not having the Decoder threads (which may be running on a different core) write to the same memory that was just owned by the Producer. With only 2 cores doing this, you will probably see improved throughput unless the disk speed is the bottleneck.

I have a blog article here which describes in more detail how to achieve this including sample code. http://fasterjava.blogspot.com.au/2013/04/disruptor-example-udp-echo-service-with.html

Other thoughts

  • It would also be helpful to know what WaitStrategy you are using, how many physical CPUs are in the machine, etc.
  • You should be able to significantly reduce CPU utilisation by moving to a different WaitStrategy given that your biggest latency will be IO writes.
  • Assuming you are using reasonably new hardware, you should be able to saturate the IO devices with only this setup.
  • You will also need to make sure the files are on different physical devices to achieve reasonable performance.
like image 31
jasonk Avatar answered Oct 24 '22 10:10

jasonk