I have a small struct of per-CPU data in a linux kernel module, where each CPU frequently writes and reads its own data. I know that I need to make sure these items of data aren't on the same cache line, because if they were then the cores would be forever dirtying each other's caches. However, is there anything at the page level that I need to worry about from an SMP performance point of view? ie. would there be any performance impact from padding these per-cpu structures out to 4096 bytes and aligning them?
This is on linux 2.6 on x86_64.
(Points about whether it's worth optimising and suggestions that I go benchmark it aren't needed -- what I'm looking for is whether there's any theoretical basis for worrying about page alignment).
Within a single NUMA node, different pages are only helpful if you want to apply different permissions, or map them individually into processes. For performance issues, being on different cachelines is sufficient.
On NUMA architectures, you may want to place a CPU's per-CPU structure on a page that is local to that CPU's node - but you still wouldn't pad the structure out to a page size to achieve that, because you can place the structures for multiple CPUs within the same NUMA node on the same page.
Even on a NUMA system, you probably won't benefit much by allocating memory pages local to each cpu (use kmalloc_node()
, if you're curious).
Node-local memory will be faster, but only in the case where it misses at all cache levels. For anything used with any frequency, you probably won't be able to tell the difference. If you're allocating megabytes of cpu-local data, then it probably makes sense to allocate pages local to each cpu.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With