I did a test with this <pre class="prettyprint"><code> for (i32 i = 0; i < 0x800000; ++i) { // Hopefully this can disable hardware prefetch i32 k = (i * 997 & 0x7FFFFF) * 0x40; _mm_prefetch(data + ((i + 1) * 997 & 0x7FFFFF) * 0x40, _MM_HINT_NTA); for (i32 j = 0; j < 0x40; j += 0x10) { //__m128 v = _mm_castsi128_ps(_mm_stream_load_si128((__m128i *)(data + k + j))); __m128 v = _mm_load_ps((float *)(data + k + j)); a_single_chain_computation //_mm_stream_ps((float *)(data2 + k + j), v); _mm_store_ps((float *)(data2 + k + j), v); } } </code></pre> Results are weird. <ol> <li>No matter how much time the <code>a_single_chain_computation</code> takes, the load latency is not hidden.</li> <li>And what's more, the additional total time taken grows as I add more computation. (With a single <code>v = _mm_mul_ps(v, v)</code>, prefetching saves about 0.60 - 0.57 = 0.03s. And with 16 <code>v = _mm_mul_ps(v, v)</code>, it saves about 1.1 - 0.75 = 0.35s. WHY?)</li> <li>non-temporal load/stores degrades performance with or without prefetching. (I can understand the load part, but why stores, too?)</li> </ol>

You need to separate two different things here (which unfortunately have a similar name) : <ul> <li>Non-temporal prefetching - This would prefetch the line, but write it as the least recently used one when it fills the caches, so it would be the first in line for eviction when you next use the same set. That leaves you enough time to actually use it (unless you're very unlucky), but wouldn't waste more than a single way out of that set, since the next prefetch to come along would just replace it. By the way, regarding your comments above - every prefetch would pollute the L3 cache, it's inclusive so you can't get away without it.</li> <li>Non-temporal (streaming) loads/stores - this also won't pollute the caches, but using a completely different mechanism of making them uncacheable (as well as write combining). This would indeed have a penalty on performance even if you really don't need these lines ever again, since a cacheable write has the luxury of staying buffered in the cache until evicted, so you don't have to write it out right away. With uncacheables you do, and in some scenarios it might interfere with your mem BW. On the other hand you get the benefit of write-combining and weak ordering which may give you some edge is several cases. The bottom line here is that you should use it only when it helps, don't assume it magically improves performance (Nothing does that nowadays..)</li> </ul> Regarding your questions - <ol> <li>your prefetching should work, but it's not early enough to make an impact. try replacing <code>i+1</code> with a larger number. Actually, maybe even do a sweep, would be interesting to see how many elements in advance you should peek. </li> <li>i'd guess this is same as 1 - with 16 muls your iteration is long enough for the prefetch to work </li> <li>As I said - your stores won't have the benefit of buffering in the lower level caches, and would have to get flushed to memory. That's the downside of streaming stores. it's implementation specific of course, so it might improve, but at the moment it's not always effective.</li> </ol>

When program will benefit from prefetch & non-temporal load/store?

Tags:

c

sse

prefetch

temporal

I did a test with this

    for (i32 i = 0; i < 0x800000; ++i)
    {
        // Hopefully this can disable hardware prefetch
        i32 k = (i * 997 & 0x7FFFFF) * 0x40;

        _mm_prefetch(data + ((i + 1) * 997 & 0x7FFFFF) * 0x40, _MM_HINT_NTA);

        for (i32 j = 0; j < 0x40; j += 0x10)
        {
            //__m128 v = _mm_castsi128_ps(_mm_stream_load_si128((__m128i *)(data + k + j)));
            __m128 v = _mm_load_ps((float *)(data + k + j));

            a_single_chain_computation

            //_mm_stream_ps((float *)(data2 + k + j), v);
            _mm_store_ps((float *)(data2 + k + j), v);
        }
    }

Results are weird.

No matter how much time the a_single_chain_computation takes, the load latency is not hidden.
And what's more, the additional total time taken grows as I add more computation. (With a single v = _mm_mul_ps(v, v), prefetching saves about 0.60 - 0.57 = 0.03s. And with 16 v = _mm_mul_ps(v, v), it saves about 1.1 - 0.75 = 0.35s. WHY?)
non-temporal load/stores degrades performance with or without prefetching. (I can understand the load part, but why stores, too?)

618

asked Jun 26 '13 06:06

BlueWanderer

2 Answers

You need to separate two different things here (which unfortunately have a similar name) :

Non-temporal prefetching - This would prefetch the line, but write it as the least recently used one when it fills the caches, so it would be the first in line for eviction when you next use the same set. That leaves you enough time to actually use it (unless you're very unlucky), but wouldn't waste more than a single way out of that set, since the next prefetch to come along would just replace it. By the way, regarding your comments above - every prefetch would pollute the L3 cache, it's inclusive so you can't get away without it.
Non-temporal (streaming) loads/stores - this also won't pollute the caches, but using a completely different mechanism of making them uncacheable (as well as write combining). This would indeed have a penalty on performance even if you really don't need these lines ever again, since a cacheable write has the luxury of staying buffered in the cache until evicted, so you don't have to write it out right away. With uncacheables you do, and in some scenarios it might interfere with your mem BW. On the other hand you get the benefit of write-combining and weak ordering which may give you some edge is several cases. The bottom line here is that you should use it only when it helps, don't assume it magically improves performance (Nothing does that nowadays..)

Regarding your questions -

your prefetching should work, but it's not early enough to make an impact. try replacing i+1 with a larger number. Actually, maybe even do a sweep, would be interesting to see how many elements in advance you should peek.
i'd guess this is same as 1 - with 16 muls your iteration is long enough for the prefetch to work
As I said - your stores won't have the benefit of buffering in the lower level caches, and would have to get flushed to memory. That's the downside of streaming stores. it's implementation specific of course, so it might improve, but at the moment it's not always effective.

answered Sep 20 '22 14:09

Leeor

If your computation chain is very short and if you're reading memory sequentially then the CPU will prefetch well on its own and actually work faster since its decoder has less work to do.

Streaming loads and stores are good only if you don't plan to access this memory in the near future. They are mainly aimed at uncached write back (WB) memory that's usually found when dealing with graphic surfaces. Explicit prefecthing may work well on one architecture (CPU model) and have a negative effect on other models so use them as a last resort option when optimizing.

answered Sep 19 '22 14:09

egur

Related questions
                            
                                Why using a typedef *after* struct definition?
                            
                                Making UI for console application [closed]
                            
                                More linked lists in C
                            
                                Using ptrace to track all execve() calls across children
                            
                                How can barriers be destroyable as soon as pthread_barrier_wait returns?
                            
                                Is there a way to flag the use of non-reentrant C library calls?
                            
                                Where is stdarg.h?
                            
                                fail compile if required flags aren't present
                            
                                User mode USB isochronous transfer from device-to-host
                            
                                How to fork() n child processes correctly in C?
                            
                                Graceful Shutdown Server Socket in Linux
                            
                                Portable serialisation of IEEE754 floating-point values
                            
                                Is there command-line tool to extract typedef, structure, enumeration, variable, function from a C or C++ file?
                            
                                USB API for Windows [closed]
                            
                                Converting CGPoints from one view to another relatively for an animation
                            
                                Is there a guaranteed and safe way to truncate a file from ANSI C FILE pointer?
                            
                                Initializing scalars with braces
                            
                                Non-Toy Software Transactional Memory for C or Java
                            
                                How to properly maintain a listening port for a long time?
                            
                                `y=++y`, is this standard compliant? [which appears in a test by Microsoft] [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With