Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When program will benefit from prefetch & non-temporal load/store?

I did a test with this

    for (i32 i = 0; i < 0x800000; ++i)
    {
        // Hopefully this can disable hardware prefetch
        i32 k = (i * 997 & 0x7FFFFF) * 0x40;

        _mm_prefetch(data + ((i + 1) * 997 & 0x7FFFFF) * 0x40, _MM_HINT_NTA);

        for (i32 j = 0; j < 0x40; j += 0x10)
        {
            //__m128 v = _mm_castsi128_ps(_mm_stream_load_si128((__m128i *)(data + k + j)));
            __m128 v = _mm_load_ps((float *)(data + k + j));

            a_single_chain_computation

            //_mm_stream_ps((float *)(data2 + k + j), v);
            _mm_store_ps((float *)(data2 + k + j), v);
        }
    }

Results are weird.

  1. No matter how much time the a_single_chain_computation takes, the load latency is not hidden.
  2. And what's more, the additional total time taken grows as I add more computation. (With a single v = _mm_mul_ps(v, v), prefetching saves about 0.60 - 0.57 = 0.03s. And with 16 v = _mm_mul_ps(v, v), it saves about 1.1 - 0.75 = 0.35s. WHY?)
  3. non-temporal load/stores degrades performance with or without prefetching. (I can understand the load part, but why stores, too?)
like image 618
BlueWanderer Avatar asked Jun 26 '13 06:06

BlueWanderer


People also ask

Does prefetch improve performance?

If the prefetched data is not subsequently used by the data consumer, the extra cost of prefetching normally reduces performance. Only in over-provisioned systems, can prefetching with low predictive accuracy improve performance.

What is the purpose of prefetching?

Prefetching allows a browser to silently fetch the necessary resources needed to display content that a user might access in the near future. The browser is able to store these resources in its cache enabling it to deliver the requested data faster.

How do you use prefetch instructions?

You want to prefetch once per 64B cache line, and you'll need to tune how far ahead to prefetch. e.g. _mm_prefetch((char*)(A+64), _MM_HINT_NTA); and the same for B would prefetch 16*64 = 1024 bytes head of where you're loading, allowing for hiding some of the latency of a cache miss but still easily fitting in L1D.

What is prefetch memory?

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').


2 Answers

You need to separate two different things here (which unfortunately have a similar name) :

  • Non-temporal prefetching - This would prefetch the line, but write it as the least recently used one when it fills the caches, so it would be the first in line for eviction when you next use the same set. That leaves you enough time to actually use it (unless you're very unlucky), but wouldn't waste more than a single way out of that set, since the next prefetch to come along would just replace it. By the way, regarding your comments above - every prefetch would pollute the L3 cache, it's inclusive so you can't get away without it.

  • Non-temporal (streaming) loads/stores - this also won't pollute the caches, but using a completely different mechanism of making them uncacheable (as well as write combining). This would indeed have a penalty on performance even if you really don't need these lines ever again, since a cacheable write has the luxury of staying buffered in the cache until evicted, so you don't have to write it out right away. With uncacheables you do, and in some scenarios it might interfere with your mem BW. On the other hand you get the benefit of write-combining and weak ordering which may give you some edge is several cases. The bottom line here is that you should use it only when it helps, don't assume it magically improves performance (Nothing does that nowadays..)

Regarding your questions -

  1. your prefetching should work, but it's not early enough to make an impact. try replacing i+1 with a larger number. Actually, maybe even do a sweep, would be interesting to see how many elements in advance you should peek.

  2. i'd guess this is same as 1 - with 16 muls your iteration is long enough for the prefetch to work

  3. As I said - your stores won't have the benefit of buffering in the lower level caches, and would have to get flushed to memory. That's the downside of streaming stores. it's implementation specific of course, so it might improve, but at the moment it's not always effective.

like image 79
Leeor Avatar answered Sep 20 '22 14:09

Leeor


If your computation chain is very short and if you're reading memory sequentially then the CPU will prefetch well on its own and actually work faster since its decoder has less work to do.

Streaming loads and stores are good only if you don't plan to access this memory in the near future. They are mainly aimed at uncached write back (WB) memory that's usually found when dealing with graphic surfaces. Explicit prefecthing may work well on one architecture (CPU model) and have a negative effect on other models so use them as a last resort option when optimizing.

like image 42
egur Avatar answered Sep 19 '22 14:09

egur