Cost of a sub-optimal cacheline prefetch

Question

What is the cost of a late prefetch done with a __builtin_prefetch(..., 1) intrinsic (prefetch in preparation for a write)? That is, a prefetch that does not arrive in the L1 cache before the demand load or write that requires it?

For example

void foo(std::uint8_t* line) {
    __builtin_prefetch(line + std::hardware_constructive_interference_size, 1);
    auto next_line = calculate_address_of_next_line(line);
    auto result = transform(line);
    write(next_line, result)
}

In this case if the cost of transform is lower than the prefetch, will this code end up being less efficient than if there were no prefetch? The wikipedia article on cache prefetching talks about an optimal stride for a for loop, but does not mention the impact of a sub-optimal prefetch in that scenario (eg. what would happen if k were too low?).

Does this get pipelined enough that a suboptimal prefetch does not matter? I am only considering Intel x86 (processors around the time of Broadwell maybe) for the purposes of this question.

BeeOnRope · Accepted Answer

Let's call the type of prefetch you are referring to a late prefetch: where the prefetch does not occur sufficiently before the demand load or store that uses the same cache line to fully hide the latency of the cache miss. This is as opposed to a too-early prefetch, where the prefetch happens so far away from the demand access that it is evicted from at least some levels of the cache before the access occurs.

Compared to not doing the prefetch at all, the cost of such a late prefetch is likely very small, zero or negative.

Let's focus on the negative part: i.e., the scenario where the prefetch helps even though it is late. If I understand your question correctly, you consider a prefetch that doesn't arrive before the load that needs it "missed" or ineffective. That is not case however: as soon as the prefetch request starts, the clocks starts ticking down for the completion of memory access and that work is not lost if the demand load occurs before it completes. For example, if your memory access takes 100 ns, but the demand access occurs only 20 ns after the prefetch, the prefetch is "too late" in the sense that the full 100 ns latency wasn't hidden, but the 20 ns spend on the prefetch is still useful: it reduced the demand access latency to about 80 ns.

That is, late prefetch isn't a binary condition: it ranges from just a little bit late (e.g., a prefetch issued 90 ns before an access with a latency of 100 ns), or really late (almost immediately before the consuming access). In most scenarios even fairly late prefetching probably helps, assuming memory latency was a bottleneck for your algorithm in the first place.

Costs

Let's consider now the case of a totally useless prefetch (i.e., issued immediately before the access, so the access could have been issued in its place had the prefetch not existed) - what is the cost? In most realistic scenarios the costs are probably very small: an extra instruction to handle, some small additional pressure on the AGUs, and perhaps a small amount of wasted effort when matching up the subsequent access with the in-flight prefetch².

Since the assumption is that prefetching is employed because of missed to the outer levels of cache or DRAM, and that the work in the transform function is significant enough to hide some of the latency, the relative cost of this one additional instruction is likely to be very small.

Of course, this is all under the assumption that the additional prefetch is a single instruction. In some cases, you may have had to organize your code somewhat to allow prefetching or perform some duplicate calculations to allow prefetching at the appropriate place. In that case, the cost side could be correspondingly higher.

M and E States

Finally, there is an additional behavior with respect to write accesses and prefetch with write intent, which means that in some scenarios even a totally useless prefetch (i.e., immediately before the first access) is useful - when the first access is a read.

If a given line is first read, then later written, the core may get the line in the E(xclusive) coherence state and then on the first need to make another roundtrip to some level of the cache to get it in the M state. Using a prefetch with write-intent before the first access would avoid this second roundtrip because the line would be brought in with the M state the first time. The effect of this optimization is tough to quantify in general, not least because writes are usually buffered and don't form part of a dependence chain (outside of store forwarding).

² I use the deliberately vague term "wasted effort" here because it isn't really clear if this has a performance or power cost, or is just some additional work that doesn't add to the operation latency. One possible cost is that a load that triggers the initial L1 miss has a special status and can receive its result without making another roundtrip to L1. In the scenario of a prefetch followed immediately by a load, the load presumably doesn't get the special status which may slightly increase the cost. However, this question is about stores not loads.

Cost of a sub-optimal cacheline prefetch

Tags:

c++

performance

x86

assembly

prefetch

Curious

1 Answers

Costs

M and E States

BeeOnRope

Recent Activity

Donate For Us

Cost of a sub-optimal cacheline prefetch

Tags:

c++

performance

x86

assembly

prefetch

Curious

1 Answers

Costs

M and E States

BeeOnRope

Related questions

Recent Activity

Donate For Us