In 6.3.2 of this this excellent paper Ulrich Drepper writes about software prefetching. He says this is the "familiar pointer chasing framework" which I gather is the test he gives earlier about traversing randomized pointers. It makes sense in his graph that performance tails off when the working set exceeds the cache size, because then we are going to main memory more and more often.
But why does prefetch help only 8% here? If we are telling the processor exactly what we want to load, and we tell it far enough ahead of time (he does it 160 cycles ahead), why isn't every access satisfied by the cache? He doesn't mention his node size, so there could be some waste due to fetching a full line when only some of the data is needed?
I am trying to use _mm_prefetch with a tree and I see no noticeable speed up. I'm doing something like this:
_mm_prefetch((const char *)pNode->m_pLeft, _MM_HINT_T0);
// do some work
traverse(pNode->m_pLeft);
traverse(pNode->m_pRight)
Now that should only help one side the traversal, but I just see no change at all in performance. I did add /arch:SSE to the project settings. I'm using Visual Studio 2012 with an i7 4770. In this thread a few people also talk about getting only 1% speedup with prefetch. Why does prefetch not work wonders for random access of data that's in main memory?
Prefetch can't increase the throughput of your main memory, it can only help you get closer to using it all.
If your code spends many cycles on computation before even requesting data from the next node in a linked list, it won't keep the memory 100% busy. A prefetch of the next node as soon as the address is known will help, but there's still an upper limit. The upper limit is approximately what you'd get with no prefetching, but no work between loading a node and chasing the pointer to the next. i.e. memory system fetching a result 100% of the time.
Even prefetching before 160 cycles of work isn't far enough ahead for the data to be ready, according to the graph in that paper. Random access latency is apparently really slow, since DRAM has to open a new page, a new row, and a new column.
I didn't read the paper in enough detail see how he could prefetch multiple steps ahead, or to understand why a prefetch thread helped more than prefetch instructions. This was on a P4, not Core or Sandybridge microarchitecture, and I don't think prefetch threads are still a thing. (Modern CPUs with hyperthreading have enough execution units and big enough caches that running two independent things on the two hardware threads of each core actually makes sense, unlike in P4 where there were less extra execution resources normally going unused for hyperthreading to utilize. And esp. I-cache was a problem in P4, because it just had that small trace cache.)
If your code already loads the next node essentially right away, prefetching can't magically make it faster. Prefetching helps when it can increase the overlap between CPU computation and waiting for memory. Or maybe in your tests, the ->left
pointers were mostly sequential from when you allocated the memory, so HW prefetching worked? If trees were shallow enough, prefetching the ->right
node (into last-level cache, not L1) before descending the left might be a win.
Software prefetching is only needed when the access pattern is not recognizable for the CPUs hardware prefetchers. (They're quite good, and can spot pattern with a decent-size stride. And track something like 10 forward streams (increasing addresses). Check http://agner.org/optimize/ for details.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With