Avoiding cache pollution while loading a stream of numbers

Question

On x86 processors is there a way to load data from regular write back memory into registers without going through the cache hierarchy?

My use case is that I have a big look up structure (Hash map or B-Tree). I am working through a large stream of numbers (much bigger than my L3 but fits in memory). What I am trying to do is very simple:

 int result = 0;
 for (num : stream_numbers) {
     int lookup_result = lookup_using_b_tree(num);
     result += do_some_math_that_touches_registers_only(lookup_result);
 }
 return result;

Since I am visiting every number only once and the sum total of all numbers is more than the L3 size I imagine that they'll end up evicting some cache lines that hold parts of my B-tree. Instead I'd ideally like to not have any numbers from this stream hit cache since they have no temporal locality at all (only read once). That way I can maximize the chances that my B-tree remains in cache and look ups are faster.

I have looked at the (v)movntdqa instructions available in SSE 4.1 for temporal loads. That doesn't seem to be a good fit because it seems to only work for uncacheable write combining memory. This old article from Intel claims that:

Future generations of Intel processors may contain optimizations and enhancements for streaming loads, such as increased utilization of the streaming load buffers and support for additional memory types, creating even more opportunities for software developers to increase the performance and energy-efficiency of their applications.

However I am unaware of any such processor today. I have read elsewhere that a processor can just choose to ignore this hint for write back memory and use a movdqa instead. So is there any way I could achieve loads from regular write back memory without going through the cache hierarchy on x86 processors even if it is only possible on Haswell and later models? I'd also appreciate any information on if this will be possible in the future?

Johan · Accepted Answer

Yes, you can use MOVNTI to store values directly to memory without them touching the cache.

The latency of a MOVNTI is about 400 cycles (On Skylake).
However if you're just storing values you care little about latency and much more about reciprocal throughput, which is 1 cycle per MOVNTI.

Note that you need to perform an SFENCE or MFENCE after you are done with the stores.

According to my experimentation with MOVNTI (in the context of a ZeroMem routine) it is worth the effort if you're writing more than 512 KB.
The exact values will depend critically on the cache size etc.

The non-temporalness only applies to writes, not to reads!
In fact I don't know of any NT-mov variant that works in a non-temporal way when reading data.

However if you are doing a read-modify-write loop it makes little sense to use non-temporal moves.
You also need to take into account the locality of your node structure.
It likely looks like this:

left, right: pointer_to_node  (8 bytes aligned on 32 byte boundary).
data: integer;                (4 bytes) 
....

If this is so you reading the left/right node pointer will suck the data along within into the 32-byte(*) cache line.
Just doing a NT-mov on the data does not help here, it has already been sucked in when reading the other node data and thus is already in the cache.

The fact that compilers align data structure on cache friendly boundaries ensures that the maximum amount of node data gets hoovered into the cache with every node pointer access.

(*) cache line size is processor dependent.

Avoiding cache pollution while loading a stream of numbers

Tags:

optimization

x86

assembly

cpu-cache

x86-64

Rajiv

1 Answers

Johan

Recent Activity

Donate For Us

Avoiding cache pollution while loading a stream of numbers

Tags:

optimization

x86

assembly

cpu-cache

x86-64

Rajiv

1 Answers

Johan

Related questions

Recent Activity

Donate For Us