Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MMIO read/write latency

I found my MMIO read/write latency is unreasonably high. I hope someone could give me some suggestions.

In the kernel space, I wrote a simple program to read a 4 byte value in a PCIe device's BAR0 address. The device is a PCIe Intel 10G NIC and plugged-in at the PCIe x16 bus on my Xeon E5 server. I use rdtsc to measure the time between the beginning of the MMIO read and the end, a code snippet looks like this:

vaddr = ioremap_nocache(0xf8000000, 128); // addr is the BAR0 of the device  
rdtscl(init); 
ret = readl(vaddr); 
rmb(); 
rdtscl(end);

I'm expecting the elapsed time between (end, init) to be less than 1us, after all, the data traversing the PCIe data link should be only a few nanoseconds. However, my test results show at lease 5.5use to do a MMIO PCIe device read. I'm wondering whether this is reasonable. I change my code to remote the memory barrier (rmb) , but still get around 5 us latency.

This paper mentions about the PCIe latency measurement. Usually it's less than 1us. www.cl.cam.ac.uk/~awm22/.../miller2009motivating.pdf‎ Do I need to do any special configuration such as kernel or device to get lower MMIO access latency? or Does anyone has experiences doing this before?

like image 386
William Tu Avatar asked Jul 21 '13 09:07

William Tu


1 Answers

5usec is great! Do that in a loop statistically and you might find much much larger values.

There are several reasons for this. BARs are usually non-cacheable and non-prefetchable - check yours using pci_resource_flags(). If the BAR is marked cacheable then cache-coherency - the process of ensuring that all CPUs have the same value cached might be one issue.

Secondly, reading io is always a non-posted affair. The CPU has to stall until it gets permission to communicate on some data bus and stall a bit more until the data arrives on said bus. This bus is made to appear like memory but in actual fact is not and the stall might be a non-interruptable busy wait but its non-productive never-the-less. So i would expect the worst-case latency to be much higher than 5us even before you start to consider task-preemption.

like image 92
toomanychushki Avatar answered Nov 14 '22 00:11

toomanychushki