I have a small linux kernel module that is a prototype for a device driver for hardware that doesn't exist yet. The code needs to do a short bit of computation as fast as possible from beginning to end with a duration that is a few microseconds. I am trying to measure whether this is possible with the intel rdtscp instruction using an <code>ndelay()</code> call to simulate the computation. I find that 99.9% of the time it runs as expected, but 0.1% of the time it has a very large delay that appears as if something else is preempting the code despite running inside a spinlock which should be disabling interrupts. This is run using a stock Ubuntu 64 bit kernel (4.4.0-112) with no extra realtime or low latency patches. Here is some example code that replicates this behavior. This is written as a handler for a <code>/proc</code> filesystem entry for easy testing, but I have only shown the function that actually computes the delays: <pre class="prettyprint"><code>#define ITERATIONS 50000 #define SKIPITER 10 DEFINE_SPINLOCK(timer_lock); static int timing_test_show(struct seq_file *m, void *v) { uint64_t i; uint64_t first, start, stop, delta, max=0, min=1000000; uint64_t avg_ticks; uint32_t a, d, c; unsigned long flags; int above30k=0; __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx"); first = a | (((uint64_t)d)<<32); for (i=0; i<ITERATIONS; i++) { spin_lock_irqsave(&timer_lock, flags); __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx"); start = a | (((uint64_t)d)<<32); ndelay(1000); __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx"); stop = a | (((uint64_t)d)<<32); spin_unlock_irqrestore(&timer_lock, flags); if (i < SKIPITER) continue; delta = stop-start; if (delta < min) min = delta; if (delta > max) max = delta; if (delta > 30000) above30k++; } seq_printf(m, "min: %llu max: %llu above30k: %d\n", min, max, above30k); avg_ticks = (stop - first) / ITERATIONS; seq_printf(m, "Average total ticks/iteration: %llu\n", avg_ticks); return 0; } </code></pre> Then if I run: <pre class="prettyprint"><code># cat /proc/timing_test min: 4176 max: 58248 above30k: 56 Average total ticks/iteration: 4365 </code></pre> This is on a 3.4 GHz sandy bridge generation Core i7. The ~4200 ticks of the TSC is about right for a little over 1 microsecond delay. About 0.1% of the time I see delays about 10x longer than expected, and in some cases I have seen times as long as 120,000 ticks. These delays appear too long to be a single cache miss, even to DRAM. So I think it either has to be several cache misses, or another task preempting the CPU in the middle of my critical section. I would like to understand the possible causes of this to see if they are something we can eliminate or if we have to move to a custom processor/FPGA solution. Things I have tried: <ul> <li>I considered if this could be caused by cache misses. I don't think that could be the case since I ignore the first few iterations which should load the cache. I have verified by examining disassembly that there are no memory operations between the two calls to rdtscp, so I think the only possible cache misses are for the instruction cache.</li> <li>Just in case, I moved the spin_lock calls around the outer loop. Then it shouldn't be possible to have any cache misses after the first iteration. However, this made the problem worse.</li> <li>I had heard that the SMM interrupt is unmaskable and mostly transparent and could cause unwanted preemption. However, you can read the SMI interrupt count with <code>rdmsr</code> on <code>MSR_SMI_COUNT</code>. I tried adding that before and after and there are no SMM interrupts happening while my code is executing.</li> <li>I understand there are also inter-processor interrupts in SMP systems that may interrupt, but I looked at /proc/interrupts before and after and don't see enough of them to explain this behavior.</li> <li>I don't know if <code>ndelay()</code> takes into account variable clock speed, but I think the CPU clock only varies by a factor of 2, so this should not cause a >10x change.</li> <li>I booted with nopti to disable page table isolation in case that is causing problems.</li> </ul>

Another thing that I have just noticed is that it is unclear what <code>ndelay()</code> does. Maybe you should show it so as non-trivial problems may be lurking inside it. For example, I've observed once that my piece of a kernel driver code was still preempted when it had a memory leak inside it, so as soon as it hit some watermark limit, it was put aside even if it disabled interrupts.

Looking for cause of unexpected preemption in linux kernel module

Tags:

c

linux-kernel

real-time

smp

I have a small linux kernel module that is a prototype for a device driver for hardware that doesn't exist yet. The code needs to do a short bit of computation as fast as possible from beginning to end with a duration that is a few microseconds. I am trying to measure whether this is possible with the intel rdtscp instruction using an ndelay() call to simulate the computation. I find that 99.9% of the time it runs as expected, but 0.1% of the time it has a very large delay that appears as if something else is preempting the code despite running inside a spinlock which should be disabling interrupts. This is run using a stock Ubuntu 64 bit kernel (4.4.0-112) with no extra realtime or low latency patches.

Here is some example code that replicates this behavior. This is written as a handler for a /proc filesystem entry for easy testing, but I have only shown the function that actually computes the delays:

#define ITERATIONS 50000
#define SKIPITER 10
DEFINE_SPINLOCK(timer_lock);
static int timing_test_show(struct seq_file *m, void *v) 
{
  uint64_t i;
  uint64_t first, start, stop, delta, max=0, min=1000000;
  uint64_t avg_ticks;
  uint32_t a, d, c;
  unsigned long flags;
  int above30k=0;

  __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
  first = a | (((uint64_t)d)<<32);
  for (i=0; i<ITERATIONS; i++) {
    spin_lock_irqsave(&timer_lock, flags);
    __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
    start = a | (((uint64_t)d)<<32);
    ndelay(1000);
    __asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
    stop = a | (((uint64_t)d)<<32);
    spin_unlock_irqrestore(&timer_lock, flags);
    if (i < SKIPITER) continue;
    delta = stop-start;
    if (delta < min) min = delta;
    if (delta > max) max = delta;
    if (delta > 30000) above30k++;
  }
  seq_printf(m, "min: %llu max: %llu above30k: %d\n", min, max, above30k);
  avg_ticks = (stop - first) / ITERATIONS;
  seq_printf(m, "Average total ticks/iteration: %llu\n", avg_ticks);
  return 0;
}

Then if I run:

# cat /proc/timing_test
min: 4176 max: 58248 above30k: 56
Average total ticks/iteration: 4365

This is on a 3.4 GHz sandy bridge generation Core i7. The ~4200 ticks of the TSC is about right for a little over 1 microsecond delay. About 0.1% of the time I see delays about 10x longer than expected, and in some cases I have seen times as long as 120,000 ticks.

These delays appear too long to be a single cache miss, even to DRAM. So I think it either has to be several cache misses, or another task preempting the CPU in the middle of my critical section. I would like to understand the possible causes of this to see if they are something we can eliminate or if we have to move to a custom processor/FPGA solution.

Things I have tried:

I considered if this could be caused by cache misses. I don't think that could be the case since I ignore the first few iterations which should load the cache. I have verified by examining disassembly that there are no memory operations between the two calls to rdtscp, so I think the only possible cache misses are for the instruction cache.
Just in case, I moved the spin_lock calls around the outer loop. Then it shouldn't be possible to have any cache misses after the first iteration. However, this made the problem worse.
I had heard that the SMM interrupt is unmaskable and mostly transparent and could cause unwanted preemption. However, you can read the SMI interrupt count with rdmsr on MSR_SMI_COUNT. I tried adding that before and after and there are no SMM interrupts happening while my code is executing.
I understand there are also inter-processor interrupts in SMP systems that may interrupt, but I looked at /proc/interrupts before and after and don't see enough of them to explain this behavior.
I don't know if ndelay() takes into account variable clock speed, but I think the CPU clock only varies by a factor of 2, so this should not cause a >10x change.
I booted with nopti to disable page table isolation in case that is causing problems.

889

asked Feb 05 '18 22:02

Evan

1 Answers

Another thing that I have just noticed is that it is unclear what ndelay() does. Maybe you should show it so as non-trivial problems may be lurking inside it.

For example, I've observed once that my piece of a kernel driver code was still preempted when it had a memory leak inside it, so as soon as it hit some watermark limit, it was put aside even if it disabled interrupts.

189

answered Oct 13 '22 06:10

Grigory Rechistov

Related questions
                            
                                is it possible in C or C++ to create a function inside another?
                            
                                What's the difference between the types - int * and int *[100] in C?
                            
                                What is the most violent way that an application can terminate itself (linux)
                            
                                Why does "++x || ++y && ++z" calculate "++x" first, even though operator "&&" has higher precedence than "||"
                            
                                Is exit status observable behavior?
                            
                                Memcached rubygem + Rlibmemcached argument error with memcache_mget()
                            
                                FFmpeg How to write video to a file
                            
                                C Pointer not getting set
                            
                                Understanding discrepancy between POSIX and Linux/glibc sched_* functions
                            
                                Disable warning/error for default warnings
                            
                                How much does the C standard library extensibility affect C++ programs?
                            
                                Why c++ standard support function strftime but not strptime?
                            
                                ARM Deliberately Bloating Compiled Code?
                            
                                Proper way to get the preferred size of Windows controls
                            
                                Reuse code memory for data
                            
                                Is there any workaround to "reserve" a cache fraction?
                            
                                Memory not requested being allocated
                            
                                Visual Studio 2015 "stealing" the application's console

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With