Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my logging library cause performance tests to run faster?

I have spent the past year developing a logging library in C++ with performance in mind. To evaluate performance I developed a set of benchmarks to compare my code with other libraries, including a base case that performs no logging at all.

In my last benchmark I measure the total running time of a CPU-intensive task while logging is active and when it is not. I can then compare the time to determine how much overhead my library has. This bar chart shows the difference compared to my non-logging base case.

performance chart

As you can see, my library ("reckless") adds negative overhead (unless all 4 CPU cores are busy). The program runs about half a second faster when logging is enabled than when it is disabled.

I know I should try to isolate this down to a simpler case rather than asking about a 4000-line program. But there are so many venues for what to remove, and without a hypothesis I will just make the problem go away when I try to isolate it. I could probably spend another year just doing this. I'm hoping that the collective expertise of Stack Overflow will make this a much more shallow problem or that the cause will be obvious to someone who has more experience than me.

Some facts about my library and the benchmarks:

  • The library consists of a front-end API that pushes the log arguments onto a lockless queue (Boost.Lockless) and a back-end thread that performs string formatting and writes the log entries to disk.
  • The timing is based on simply calling std::chrono::steady_clock::now() at the beginning and end of the program, and printing the difference.
  • The benchmark is run on a 4-core Intel CPU (i7-3770K).
  • The benchmark program computes a 1024x1024 Mandelbrot fractal and logs statistics about each pixel, i.e. it writes about one million log entries.
  • The total running time is about 35 seconds for the single worker-thread case. So the speed increase is about 1.5%.
  • The benchmark produces an output file (this is not part of the timed code) that contains the generated Mandelbrot fractal. I have verified that the same output is produced when logging is on and off.
  • The benchmark is run 100 times (with all the benchmarked libraries, this takes about 10 hours). The bar chart shows the average time and the error bars show the interquartile range.
  • Source code for the Mandelbrot computation
  • Source code for the benchmark.
  • Root of the code repository and documentation.

My question is, how can I explain the apparent speed increase when my logging library is enabled?

Edit: This was solved after trying the suggestions given in comments. My log object is created on line 24 of the benchmark test. Apparently when LOG_INIT() touches the log object it triggers a page fault that causes some or all pages of the image buffer to be mapped to physical memory. I'm still not sure why this improves the performance by almost half a second; even without the log object, the first thing that happens in the mandelbrot_thread() function is a write to the bottom of the image buffer, which should have a similar effect. But, in any case, clearing the buffer with a memset() before starting the benchmark makes everything more sane. Current benchmarks are here

Other things that I tried are:

  • Run it with the oprofile profiler. I was never able to get it to register any time in the locks, even after enlarging the job to make it run for about 10 minutes. Almost all the time was in the inner loop of the Mandelbrot computation. But maybe I would be able to interpret them differently now that I know about the page faults. I didn't think to check whether the image write was taking a disproportionate amount of time.
  • Removing the locks. This did have a significant effect on performance, but results were still weird and anyway I couldn't do the change in any of the multithreaded variants.
  • Compare the generated assembly code. There were differences but the logging build was clearly doing more things. Nothing stood out as being an obvious performance killer.
like image 298
flodin Avatar asked May 10 '15 13:05

flodin


1 Answers

When uninitialised memory is first accessed, page faults will affect timing.

So, before your first call to, std::chrono::steady_clock::now(), initialise the memory by running memset() on your sample_buffer.

like image 191
WonderWorker Avatar answered Nov 08 '22 19:11

WonderWorker