Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

streaming loads and non USWC memory

I just read this rather interesting article, Copying Accelerated Video Decode Frame Buffers.

Where they explain how to do copying from USWC memory as fast as possible using streaming loads.

My question is why this technique would not also speed up normal copies, from non USWC memory?

A streaming load would read an entire cache line in one go instead of the regular load which only load 16 bytes at a time. What am I missing? And copying from a fill buffer to the "cache buffer" which will be written to cache can't have much of an overhead, can it?

like image 840
ronag Avatar asked May 16 '11 07:05

ronag


1 Answers

From http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/

"The streaming load instruction is intended to accelerate data transfers from the USWC memory type. For other memory types such as cacheable (WB) or Uncacheable (UC), the instruction behaves as a typical 16-byte MOVDQA load instruction. However, future processors may use the streaming load instruction for other memory types (such as WB) as a hint that the intended cache line should be streamed from memory directly to the core while minimizing cache pollution."

That is, "normal" memory is WB, and hence there is no advantage to using non-temporal loads/stores vs. normal ones. Also, for normal cachable memory, the first load of a cache line will pull the entire cache line into L1, similar to how the first non-temporal load will pull an entire cache line into the special "non-temporal buffer".

As the quote above says, future processors may use the non-temporal load/store as a hint to not pollute the cache. Which might be a good idea in some cases, but maybe not the right choice for a general-purpose memcpy() implementation?

like image 185
janneb Avatar answered Oct 11 '22 03:10

janneb