I'm trying to optimize some C++ (RK4) by using
__builtin_prefetch
I can't figure out how to prefetch a whole structure.
I don't understand how much of the const void *addr
is read. I want to have the next values of from
and to
loaded.
for (int i = from; i < to; i++) { double kv = myLinks[i].kv; particle* from = con[i].Pfrom; particle* to = con[i].Pto; //Prefetch values at con[i++].Pfrom & con[i].Pto; double pos = to->px- from->px; double delta = from->r + to->r - pos; double k1 = axcel(kv, delta, from->mass) * dt; //axcel is an inlined function double k2 = axcel(kv, delta + 0.5 * k1, from->mass) * dt; double k3 = axcel(kv, delta + 0.5 * k2, from->mass) * dt; double k4 = axcel(kv, delta + k3, from->mass) * dt; #define likely(x) __builtin_expect((x),1) if (likely(!from->bc)) { from->x += (( k1 + 2 * k2 + 2 * k3 + k4) / 6); } }
Link: http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/
The __builtin_prefetch() function prefetches memory from addr. The rationale is to minimize cache-miss latency by trying to move data into a cache before accessing the data. Possible use cases include frequently called sections of code in which it is known that the data in a given address is likely to be accessed soon.
You may do so by inserting a prefetch operation (e.g., __builtin_prefetch ) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.
I think it just emit one FETCH
machine instruction, which basically fetches a line cache, whose size is processor specific.
And you could use __builtin_prefetch (con[i+3].Pfrom)
for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.
Don't use __builtin_prefetch
too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2
). If you are very lucky, manual __builtin_prefetch
could increase the performance of your loop by 10 or 20% (but it could also hurt it).
If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).
Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.
(added in january 2018:)
Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch
is less useful today (in 2018). Be sure to benchmarck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With