I'm using the Disruptor framework for performing fast Reed-Solomon error correction on some data. This is my setup:
RS Decoder 1
/ \
Producer- ... - Consumer
\ /
RS Decoder 8
In the disruptor DSL terms, the setup looks like this:
RsFrameEventHandler[] rsWorkers = new RsFrameEventHandler[numRsWorkers];
for (int i = 0; i < numRsWorkers; i++) {
rsWorkers[i] = new RsFrameEventHandler(numRsWorkers, i);
}
disruptor.handleEventsWith(rsWorkers)
.then(writerHandler);
When I don't have a disk output consumer (no .then(writerHandler)
part), the measured throughput is 80 M/s, as soon as I add a consumer, even if it writes to /dev/null
, or doesn't even write, but it is declared as a dependent consumer, performance drops to 50-65 M/s.
I've profiled it with Oracle Mission Control, and this is what the CPU usage graph shows:
Without an additional consumer:
With an additional consumer:
What is this gray part in the graph and where is it coming from? I suppose it has to do with thread synchronisation, but I can't find any other statistic in Mission Control that would indicate any such latency or contention.
Your hypothesis is correct, it is a thread synchronization issue.
From the API Documentation for EventHandlerGroup<T>.then
(Emphasis mine)
Set up batch handlers to consume events from the ring buffer. These handlers will only process events after every
EventProcessor
in this group has processed the event.This method is generally used as part of a chain. For example if the handler A must process events before handler B:
This should necessarily decrease throughput. Think about it like a funnel:
The consumer has to wait for every EventProcessor
to be finished, before it can proceed through the bottleneck.
I can see two possibilities here, based on what you've shown. You might be affected by one or both, I'd recommend testing both. 1) IO processing bottleneck. 2) Contention on multiple threads writing to buffer.
IO processing
From the data shown, you have stated that as soon as you enable the IO component, your throughput decreases and kernel time increases. This could quite easily be the IO wait time while your consumer thread is writing. Context switch to perform a write()
call is significantly more expensive than doing nothing. Your Decoder
s are now capped at the maximum speed of the consumer. To test this hypothesis, you could remove the write()
call. In other words, open the output file, prepare the string for output, and just not issue the write call.
Suggestions
write()
call in the Consumer, see if it reduces kernel time.endOfBatch
flag and then writing in a single batch) to ensure that the IO is bundled up as efficiently as possible?Contention on multiple writers
Based on your description I suspect your Decoder
s are reading from the disruptor and then writing back to the very same buffer. This is going to cause issues with multiple writers aka contention on the CPUs writing to memory. One thing I would suggest is to have two disruptor rings:
Producer
writes to #1Decoder
reads from #1, performs RS decode and writes the result to #2Consumer
reads from #2, and writes to diskAssuming your RBs are sufficiently large, this should result in good clean walking through memory.
The key here is not having the Decoder
threads (which may be running on a different core) write to the same memory that was just owned by the Producer
. With only 2 cores doing this, you will probably see improved throughput unless the disk speed is the bottleneck.
I have a blog article here which describes in more detail how to achieve this including sample code. http://fasterjava.blogspot.com.au/2013/04/disruptor-example-udp-echo-service-with.html
Other thoughts
WaitStrategy
you are using, how many physical CPUs are in the machine, etc.WaitStrategy
given that your biggest latency will be IO writes.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With