I am getting ready to perform a series of performance comparisons of various of the shelf products.
What do I need to do to show credibility in the tests? How do I design my benchmark tests so that they are respectable?
I am also interested in any suggestions on the actual design of the tests. Ways to load data without effecting the tests (Heisenberg Uncertainty Principle), or ways to monitor... etc
This is a bit tricky to answer without knowing what sort of "off the shelf" products you are trying to assess. Are you looking for UI responsiveness, throughput (e.g. email, transactions/sec), startup time, etc - all of these have different criteria for what measures you should track and different tools for testing or evaluating. But to answer some of your general questions:
Credibility - this is important. Try to make sure that whatever you are measuring has little run to run variance. Utilize the technique of doing several runs of the same scenario, get rid of outliers (i.e. your lowest and highest), and evaluate your avg/max/min/median values. If you're doing some sort of throughput test, consider making it long running so you have a good sample set. For example, if you are looking at something like Microsoft Exchange and thus are using their perf counters, try to make sure you are taking frequent samples (once per sec or every few secs) and have the test run for 20mins or so. Again, chop off the first few mins and the last few mins to eliminate any startup/shutdown noise.
Heisenburg - tricky. In most modern systems, depending on what application/measures you are measuring, you can minimize this impact by being smart about what/how you are measuring. Sometimes (like in the Exchange example), you'll see near 0 impact. Try to use as least invasive tools as possible. For example, if you're measuring startup time, consider using xperfinfo and utilize the events built into the kernel. If you're using perfmon, don't flood the system with extraneous counters that you don't care about. If you're doing some exteremely long running test, ratchet down your sampling interval.
Also try to eliminate any sources of environment variability or possible sources of noise. If you're doing something network intensive, consider isolating the network. Try to disable any services or applications that you don't care about. Limit any sort of disk IO, memory intensive operations, etc. If disk IO might introduce noise in something that is CPU bound, consider using SSD.
When designing your tests, keep repeatability in mind. If you doing some sort of microbenchmark type testing (e.g. perf unit test) then have your infrastructure support running the same operation n times exactly the same. If you're driving UI, try not to physically drive the mouse and instead use the underlying accessibility layer (MSAA, UIAutomation, etc) to hit controls directly programmatically.
Again, this is just general advice. If you have more specifics then I can try to follow up with more relavant guidance.
Enjoy!
Your question is very interesting, but a bit vague, because without knowing what to test it is not easy to give you some clues.
You can test performance from many different angles, then, depending on the use or target of the library you should try one approach or another; I will try to enumerate some of the things you may have to consider for measurement:
Tools:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With