Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing latency in instruction tables

I am currently looking at Agner Fog's instruction tables to get an idea of the latencies of common instructions.

I hope I haven't missed the answer of this question in the documents, but can anyone explain to me why for some instructions, there is no latency entry?

For example, the latency for the PEXT instructions, with operands r,r,m, is left blank for Skylake?

What is the interpretation for the missing latency and why is it difficult to get the latency in the first place (if this is the case) ?

like image 257
Goaler444 Avatar asked Oct 15 '22 11:10

Goaler444


1 Answers

IDK why Agner leaves some cells blank in his spreadsheet. I think these are all entered by hand because there have been at least a couple fairly clear typos, e.g. 5 instead of 0.5 for a throughput of something (a memory-source vinserti128 or something, IIRC).

The interpretation is that there's zero info beyond what you can infer from how CPUs usually work. i.e. usually there's a separate load uop feeding the ALU uop, and it's usually the same ALU uop as with a register source. But some instructions can use a broadcast load, e.g. Skylake vpsrld with a memory-source shift count (low element applies to all) looks like it uses a broadcast-load uop instead of its usual ALU shuffle to feed a variable-shift uop (like vpsrlvd 1 uop for p01).

For multi-uop instructions with multiple inputs, Agner still only lists 1 latency number. That's not a complete picture; sometimes the first uop only needs one of the inputs, so latency from a -> result > b -> result. e.g. he lists vpsrld (2 uops for p01 p5 on SKL) as 1c throughput / 1c latency. That's obviously impossible for both inputs to both be 1c to result. Presumably Agner measured the data input -> output latency, with the broadcast of the shift count running off the critical path. (I'm inferring what the p5 uops is doing from the fact that it's p5 only: the shuffle port. And that SKL has 1 uop variable-count shifts. And that it's not needed with a shift count from memory. The obvious conclusion is that it's a broadcast shuffle or load.)


To get more complete latency data, see https://www.uops.info/table.html

It has a full latency breakdown for pext r64, r64, m64:

  • Measurements: Latencies:
    • Latency operand 2 → 1: 3
    • Latency operand 3 → 1 (address): 8
    • Latency operand 3 → 1 (memory): ≤7

So far they only have mostly Intel CPUs (but also Zen), but the data comes from automated testing and tests every input to every output separately. And lists IACA data, too. For each form of each instruction, there's a link to detailed test results for it.

Also they're more careful with the uop breakdown for multi-uop instructions, e.g. movbe r64, m64 isn't 2p0156 + p23, it's p06 p15 p23 (like bswap r64 which Agner does get correct).

like image 53
Peter Cordes Avatar answered Nov 14 '22 22:11

Peter Cordes