Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any such processors which have instructions to bypass the cache?

Are there any such processors which have instructions to bypass the cache for a specific data? This question also has an answer which suggests that SSE4.2 instructions do bypass the cache. Can somebody enlighten me on that?

like image 654
MetallicPriest Avatar asked Jun 13 '13 17:06

MetallicPriest


5 Answers

In general, the caching policy is controlled by the Memory Management Unit (MMU). For each address range, a caching policy is decided upon. These tables are managed by the OS and are available in system space. As a sidebar answer to a question that you may have intended to ask, for architectures that have a cache, there are usually CPU commands available for synchronizing/invalidating/flushing the cache. However, much as the MMU tables, these commands are also available only in system space.

like image 66
levengli Avatar answered Oct 07 '22 16:10

levengli


Are there any such processors which have instructions to bypass the cache for a specific data?

The SuperH family (or at least the SuperH-2) has both implicit and explicit bypassing of its cache memory. This is done by using different areas of the memory address space, rather than through special instructions.
By setting the top 3 bits of an address to 001 you would access a cache-through mirror of the same address with the top 3 bits cleared. And some areas (like memory-mapped I/O registers) are never cached.

like image 39
Michael Avatar answered Oct 07 '22 14:10

Michael


The SSE cache-bypass store instructions are to avoid polluting the cache when writing to a region that won't be touched again soon. e.g. you don't want to evict data that will be used again.

Also, x86 implementations normally read in a whole cache line when a write into any part of the cache line occurs. If the previous contents of the cache line are unneeded, this is a waste of memory bandwidth. (e.g. the dest arg of memcpy or memset.) I found some old discussion of this write-back (default) vs. write-combining (movntq / movntdq) effect for implementing memcpy. Be careful of using this if something else will read the output of memcpy right away.

Streaming loads only work for reading from USWC regions, as normal memcpy performs horribly in that case. Streaming loads from normal (WB (writeback)) are currently not special, and work like regular movdqa loads. (i.e. the NT hint is ignored). Intel's optimization manual says you can use prefetchnta for pollution-reducing loads.


IDK if it's possible to write into cache (rather than bypassing with movnt) without triggering a read. Possibly AVX512 will solve this problem for memcpy, because a 512b ZMM register is 64bytes, i.e. a full cache line. A 64-byte aligned store from a ZMM register to memory that wasn't already cached could be implemented in a way that didn't read the RAM first, and still made the store visible right away to other CPU cores in the system.

(AVX-512 is going to be in Skylake Xeon (not other skylake CPUs). Also in Knight's Landing, the massively-parallel high-throughput Xeon Phi compute accelerator thing.)

like image 34
Peter Cordes Avatar answered Oct 07 '22 16:10

Peter Cordes


Altera Nios II architecture has 2 specific instructions ldio and stio for loads/stores that bypass the cache. They're used for memory-mapped IO.

http://www.csun.edu/~glaw/ee525/Lecture03Nios.pdf

Nios II is a soft processor generally used for Altera's FPGA boards and although it can also be licensed for hard ASIC devices but I don't know any commercial CPUs based on this architecture

like image 2
phuclv Avatar answered Oct 07 '22 16:10

phuclv


Depending on your definition of specific data, yes. Processors generally have cache control registers / tables which are used to define what regions of memory can be cached vs. which must not be cached. Generally, code running in user space is not able to access those tables.

like image 1
mah Avatar answered Oct 07 '22 14:10

mah