I was wondering about possible ways to track down performance bottlenecks in distributed systems. I am aware of tools like X-Trace and its offspring (e.g. Dapper) but I am more curious about the methodology rather than specific tools.
In other words, given a distributed system without any obvious bottlenecks, how do you study and improve its performance?
I've used a method that has a pro, and a con. The pro is that it works - it finds problems that, when they are fixed, result in nice snappy performance. The con is that it's a good amount of manual work.
I even wrote a book, and included the method. The work is to collect time-stamped event logs and merge them together into a common timeline. Then you carefully examine it, tracing the flow of related messages through the network of asynchronous agents. What you are looking for are needless message cycles, or delays that don't necessarily have to happen. For example, in looking at this picture, receipt of a message is being delayed due to the task "post status to DB". When that is understood, the posting could actually be done on a separate thread.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With