I understand Port I/O from a hardware abstraction level (i.e. asserts a pin that indicates to devices on the bus that the address is a port address, which makes sense on earlier CPUs with a simple address bus model) but I'm not really sure how it's implemented on modern CPUs microarchitecturally but also particularly how the Port I/O operation appears on the ring bus.
Firstly. Where does the IN/OUT instruction get allocated to, the reservation station or the load/store buffer? My initial thoughts were that it would be allocated in the load/store buffer and the memory scheduler recognises it, sends it to the L1d indicating that it is a port-mapped operation. A line fill buffer is allocated and it gets sent to L2 and then to the ring. I'm guessing that the message on the ring has some port-mapped indicator which only the system agent accepts and then it checks its internal components and relays the port-mapped indicated request to them; i.e. PCIe root bridge would pick up CF8h and CFCh. I'm guessing the DMI controller is fixed to pick up all the standardised ports that will appear on the PCH, such as the one for the legacy DMA controller.
Yes, I assume the message over the ring bus has some kind of tag that flags it as being to I/O space, not a physical memory address, and that the system agent sorts this out.
If anyone knows more details, that might be interesting, but this simple mental model is probably fine.
I don't know how port I/O turns into PCIe messages, but I think PCIe devices can have I/O ports in I/O space, not just MMIO.
IN/OUT are pretty close to serializing (but not officially defined using that term for some reason How many memory barriers instructions does an x86 CPU have?). They do drain the store buffer before executing, and are full memory barriers.
the reservation station or the load/store buffer?
Both. For normal loads/stores, the front-end allocates a load buffer entry for a load, or a store buffer entry for a store, and issues the uop into the ROB and RS.
For example, when the RS dispatches a store-address or store-data uop to port 4 (store-data) or p2/p3 (load or store-address), that execution unit will use the store-buffer entry as the place where it writes the data, or where it writes the address.
Having the store-buffer entry allocated by the issue/allocate/rename logic means that either store-address or store-data can execute first, whichever one has its inputs ready first, and free its space in the RS after completing successfully. The ROB entry stays allocated until the store retires. The store buffer entry stays allocated until some time after that, when the store commits to L1d cache. (Or for a store to uncacheable memory, commits to an LFB or something to be send out the memory hierarchy where the system agent will pick it up if it's to a MMIO region.)
Obviously IN/OUT are micro-coded as multiple uops, and all those uops are allocated in the ROB and reservation station as they issue from the front-end, like any other uop. (Well, some of them might not need a back-end execution unit, in which case they'd only be allocated in the ROB in an already-executed state. e.g. the uops for lfence
are like this on Skylake.)
I'd assume they use the normal store buffer / load buffer mechanism for communicating off-core, but since they're more or less serializing there's no real performance implication to how they're implemented. (Later instructions can't start executing until after the "data phase" of the I/O transaction, and they drain the store buffer before executing.)
The execution of the IN
and OUT
instructions depends on the operating mode of the processor. In real mode, no permissions need to be checked to execute the instructions. In all other modes, the IOPL field of the Flags register and the I/O permission map associated with the current hardware task need to be checked to determine whether the IN/OUT
instruction is allowed to execute. In addition, the IN/OUT
instruction has serialization properties that are stronger than LFENCE
but weaker than a fully serializing instruction. According to Section 8.2.5 of the Intel manual volume 3:
Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O buffers. I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such accesses as follows. Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.
This description suggests that an IN/OUT
instruction completely blocks the allocation stage of the pipeline until all previous instructions are executed and the store buffer and WCBs are drained and then the IN/OUT
instruction retires. To implement these serialization properties and to perform the necessary operating mode and permission checks, the IN/OUT
instruction needs to be decoded into many uops. For more information on how such an instruction can be implemented, refer to: What happens to software interrupts in the pipeline?.
Older versions of the Intel optimization manual did provide latency and throughput numbers for the IN
and OUT
instructions. All of them seem to say that the worst-case latency is 225 cycles and the throughput is exactly 40 cycles per instruction. However, these numbers don't make much sense to me because I think the latency depends on the I/O device being read from or written to. And because these instructions are basically serialized, the latency essentially determines throughput.
I've tested the in al, 80h
instruction on Haswell. According to @MargaretBloom, it's safe to read a byte from the port 0x80 (which according to osdev.org is mapped to some DMA controller register). Here is what I found:
MEM_UOPS_RETIRED.ALL_LOADS
. It's also counted as a load uop that misses the L1D. However, it's not counted as a load uop that hits the L1D or misses or hits the L2 or L3 caches.al, 80h
instruction.L1D_PEND_MISS.PENDING_CYCLES
, the I/O load request seems to be allocated in an LFB for one cycle. IMUL
instruction that is dependent on the result of in
instruction, the total execution time does not change. This suggests that the in
instruction does not completely block the allocation stage until all of its uops are retired and it may overlap with later instructions, in contrast to my interpretation of the manual.I've tested the out dx, al
instruction on Haswell for ports 0x3FF, 0x2FF, 0x3EF, and 0x2EF. The distribution of uops is as follows: p0:10.9, p1:15.2, p2:1, p3:1, p4:1, p5:11.3, p6:25.3, and finally p7:1. That's a total of 66.7 uops per instruction. The throughput of out
to 0x2FF, 0x3EF, and 0x2EF is 1880c. The throughput of out
to 0x3FF is 6644.7c. The out
instruction is not counted as a retired store.
Once the I/O load or store request reaches the system agent, it can determine what to do with the request by consulting its system I/O mapping table. This table depends on the chipset. Some I/O ports are mapped statically while other are mapped dynamically. See for example Section 4.2 of the Intel 100 Series Chipset datasheet, which is used for Skylake processors. Once the request is completed, the system agent sends a response back to the processor so that it can fully retire the I/O instruction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With