__builtin_prefetch, How much does it read?

Tags:

I'm trying to optimize some C++ (RK4) by using

__builtin_prefetch

I can't figure out how to prefetch a whole structure.

I don't understand how much of the const void *addr is read. I want to have the next values of from and to loaded.

for (int i = from; i < to; i++) {     double kv = myLinks[i].kv;     particle* from = con[i].Pfrom;     particle* to = con[i].Pto;     //Prefetch values at con[i++].Pfrom & con[i].Pto;     double pos = to->px- from->px;     double delta = from->r + to->r - pos;     double k1 = axcel(kv, delta, from->mass) * dt; //axcel is an inlined function     double k2 = axcel(kv, delta + 0.5 * k1, from->mass) * dt;     double k3 = axcel(kv, delta + 0.5 * k2, from->mass) * dt;     double k4 = axcel(kv, delta + k3, from->mass) * dt;     #define likely(x)       __builtin_expect((x),1)     if (likely(!from->bc))     {             from->x += (( k1 + 2 * k2 + 2 * k3 + k4) / 6);     } }

Link: http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

295

asked Dec 10 '11 22:12

Mikhail

1 Answers

I think it just emit one FETCH machine instruction, which basically fetches a line cache, whose size is processor specific.

And you could use __builtin_prefetch (con[i+3].Pfrom) for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.

Don't use __builtin_prefetch too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2). If you are very lucky, manual __builtin_prefetch could increase the performance of your loop by 10 or 20% (but it could also hurt it).

If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).

Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.

^{(added in january 2018:)}

Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch is less useful today (in 2018). Be sure to benchmarck.

answered Oct 07 '22 22:10

Basile Starynkevitch

Related questions
                            
                                ggplot2: change order of display of a factor variable on an axis
                            
                                How to determine which points are inside of a polygon and which are not (large number of points)?
                            
                                How to generate a JAR with the source code in Maven
                            
                                How to do authentication in UIWebView properly?
                            
                                Mysql Database Name Restrictions
                            
                                Why is Google Chrome's Math.random number generator not *that* random?
                            
                                How do I sort an Array with coffeescript?
                            
                                Screen orientation and values in manifest.xml
                            
                                Is it possible to have temp tables in a function?
                            
                                Use of option helper in Play Framework 2.0 templates
                            
                                How can I set specific compiler flags for a specific target in a specific build configuration using CMake?
                            
                                Inserting large quantities in IndexedDB's objectstore blocks UI

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With