C++, ways to benchmark improvements in cache locality?

Question

I have an implementation of a class X, that has two pointers to two pieces of information. I have written a new implementation, class Y, that has only one pointer to a struct that contains the two pieces of information together as adjacent members. X's and Y's methods usually only need to manipulate one of the pieces of information, but provide a get() method that returns a pointer to the second piece (in this case class X just returns its pointer to that piece and class Y returns the address of the struct's second member). In normal usage, calls to X's and Y's methods will happen interspersed by calls to get() and doing work on that returned second piece.

I expect that in real life situations there should be a performance improvement, now that the two pieces of information are next to one another in memory in the class Y implementation (because they are adjacent members of a struct), but I'm not seeing any difference in the benchmarks I've written (interspersing calls to X's and Y's methods with doing work on their second pieces in big loops). I suspect this is because everything fits in cache in either case in my tests. I don't want to try this in my real app yet because the semantics of X and Y differ in other subtle ways not related to this optimization and porting the using application will be some work, and these benchmarks are supposed to help justify doing that work in the first place.

What's the best way to observe the difference in performance due to better cache locality? If I do a bunch of dummy work on an array equal to the size of the cache in between calls is that sufficient? Or do I want to do work on an array slightly less than the cache size, so that work on my instances of my class will cause things to fall in and out of cache? I'm not sure how to code something that is robust against compiler optimizations and different cache sizes.

Soo Wei Tan · Accepted Answer

If you are on Linux, then using Cachegrind in conjunction with KCacheGrind might provide more insight as to what how your cache is behaving.

Steve Jessop · Answer

You could design a benchmark specifically to bust the cache. For instance, allocate the pointed-to data blocks such that they're all guaranteed to be on different cache lines (say, by using a custom memory allocator that pads allocations out to at least a few hundred bytes). Then repeatedly iterate over a number of objects too big to fit everything in even the L2 cache (very platform-dependent, since it depends on the number of lines in cache, but 1 million would cover most architectures and only require a few hundred meg RAM total).

This will give you an upper limit on the performance gain made by the change from X to Y. But it does it by degrading the performance of X down to below any likely real-world usage. And to prove your case you need a lower-limit estimate, not an upper-limit estimate. So I'm not sure you'd achieve much, unless you discover that even this worst case still makes no significant difference and you needn't bother with the optimization.

Even if you don't aim for theoretical worst-case performance of X, any benchmark designed to exceed the cache is just picking an arbitrary point of bad performance of X, and looking to see if Y is better. It's not far off rigging the benchmark to make Y look good. It really doesn't matter how your code performs in dodgy benchmarks, except maybe for the purposes of marketing ~~lies~~ literature.

The best way to observe the real-world difference in performance, is to measure a real-world client of your class. You say that "the semantics of X and Y differ in other subtle ways not related to this optimization", in which case I can only recommend that you write a class Z which differs from X only in respect of this optimization, and use that in your application as the comparison.

Once your tests attempt to represent the worst realistic use, then if you aren't seeing any difference in performance there's probably no performance gain to be had.

All that said, if it makes logical sense (that is, it doesn't make the code any more astonishing), then I would advocate minimising the number of heap allocations in C++ simply as a rule of thumb. It doesn't tend to make speed or total memory usage worse, and it does tend to simplify your resource handling. A rule of thumb doesn't justify a re-write of working code, of course.

C++, ways to benchmark improvements in cache locality?

Tags:

c++

performance

pointers

caching

benchmarking

Joseph Garvin

2 Answers

Soo Wei Tan

Steve Jessop

Recent Activity

Donate For Us

C++, ways to benchmark improvements in cache locality?

Tags:

c++

performance

pointers

caching

benchmarking

Joseph Garvin

2 Answers

Soo Wei Tan

Steve Jessop

Related questions

Recent Activity

Donate For Us