Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating a 64-byte read PCIe TLP from an x86 CPU

Tags:

x86

pci-e

When writing data to a PCIe device, it is possible to use a write-combining mapping to hint the CPU that it should generate 64-byte TLPs towards the device.

Is it possible to do something similar for reads? Somehow hint the CPU to read an entire cache line or a larger buffer instead of reading one word at a time?

like image 381
haggai_e Avatar asked Mar 06 '23 14:03

haggai_e


1 Answers

Intel has a white-paper on copying from video RAM to main memory; this should be similar but a lot simpler (because the data fits in 2 or 4 vector registers).

It says that NT loads will pull a whole cache-line of data from WC memory into a LFB:

Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

Use AVX2 _mm256_stream_load_si256() or the SSE4.1/AVX1 128-bit version.

Fill-buffers are a limited resource, so you definitely want the compiler to generate asm that does the two aligned loads of a 64-byte cache-line back to back, then store to regular memory.

If you're doing more than one 64-byte block at a time, see Intel's white-paper for a suggestion on using a small bounce buffer that stays hot in L1d to avoid mixing stores to DRAM with NT loads. (L1d evictions to DRAM, like NT stores, also require line-fill buffers, LFBs).


Note that _mm256_stream_load_si256() is not useful at all on memory types other than WC. The NT hint is ignored on current hardware, but it costs an extra ALU uop anyway vs. a regular load. There is prefetchnta, but that's a totally different beast.

like image 70
Peter Cordes Avatar answered Mar 23 '23 06:03

Peter Cordes