Tracking down mysterious high-priority thread suspend inside the kernel

Question

Description

I'm working on an embedded Linux system (using kernel 3.4 and bionic, Android-like) that is running on a multi-core ARMv7a SoC. We have a user-space thread that is basically servicing events coming out of the kernel. The events are generated from an IRQ and must be reacted to by user-space with very low latency.

A thread runs with SCHED_FIFO priority 0. It is the only priority 0 thread in the system. Approximate code for the thread:

    while (1)
    {
        struct pollfd fds[1];
        fds[0].fd = fd;
        fds[0].events = POLLIN|POLLRDNORM|POLLPRI;

        int ret = poll(fds, 1, reallyLongTimeout);
        FTRACE("poll() exit");
        if (ret > 0)
        {
            // notify worker threads of pending events
        }
    }

Generally we get very good latency (the thread makes a full round-trip back into poll() within the same millisecond the IRQ happened), however randomly we have delays into the tens of milliseconds that wreck everything. After tracing it all over the place I have come to the conclusion that the delay happens after the IRQ fires and before the poll() system call returns, because the thread puts itself asleep. Then some time after somehow gets woken up by some unknown force and everything continues on.

I suspected some other IRQ but after enabling sched:,irq:,timer:* tracing I had to rule it out. I had some difficulty porting syscalls:* tracers to the ARM kernel. The syscalls tracers work but if I also enable sched:* I get all sorts of panics inside the ring_buffer code.

After inserting some custom trace points into sys_poll() I've come to the uncomfortable conclusion that my thread falls asleep after sys_poll() returned, but before it re-emerges in user space.

Here's an annotated trace with my custom trace points in fs/select.c:

 <my thread>-915   [001] ...1    17.589394: custom: do_poll:786 - calling do_pollfd
 <my thread>-915   [001] ...1    17.589399: custom: do_poll:794 - failed, no events
 <my thread>-915   [001] ...1    17.589402: custom: do_poll:823 - going to sleep, count = 0, timed_out = 0

.... // everything going OK, then the IRQ happens, which runs a tasklet:

 <random proc>-834 [000] d.h1    17.616541: irq_handler_entry: irq=17 name=hwblock handler=hw_block_process_irq
 <random proc>-834 [000] d.h1    17.616569: softirq_raise: vec=6 [action=TASKLET]
 <random proc>-834 [000] d.h1    17.616570: irq_handler_exit: irq=17 ret=handled
 <random proc>-834 [000] ..s2    17.616627: softirq_entry: vec=6 [action=TASKLET]

.... // the tasklet signals the wait queue of the poll, which wakes up my thread:

 <my thread>-915   [001] ...1    17.616827: custom: do_poll:826 - woke up, count = 0, timed_out = 0
 <my thread>-915   [001] ...1    17.616833: custom: do_poll:772 - start of loop
 <my thread>-915   [001] ...1    17.616840: custom: do_poll:786 - calling do_pollfd
 <my thread>-915   [001] ...1    17.616852: custom: do_poll:788 - success, event!
 <my thread>-915   [001] ...1    17.616859: custom: do_poll:810 - bailing, count = 1, timed_out = 0
 <my thread>-915   [001] ...1    17.616862: custom: do_sys_poll:880 - before free_wait()
 <my thread>-915   [001] ...1    17.616867: custom: do_sys_poll:882 - before __put_user()
 <my thread>-915   [001] ...1    17.616872: custom: sys_poll:940 - do_sys_poll - exit

.... // the tasklet exits, and my thread appears to be about to be

 <random proc>-834 [000] .Ns2    17.616922: softirq_exit: vec=6 [action=TASKLET]


.... // wait wait, why is my thread going back to sleep, and what was it doing for 75us?

 <my thread>-915   [001] d..3    17.616947: sched_stat_wait: comm=<another thread> pid=1165 delay=1010000 [ns]
 <my thread>-915   [001] ...2    17.616957: sched_switch: prev_comm=<my thread> prev_pid=915 prev_prio=0 prev_state=S ==> next_comm=<another thread> next_pid=1165 next_prio=120

.... // everything running on for 20ms as if nothing is wrong, then my thread suddenly gets woken up.
.... // nothing pid 947 is doing should have any effect on <my thread>

<random proc>-947  [000] d..4    17.636087: sched_wakeup: comm=<my thread> pid=915 prio=0 success=1 target_cpu=001
<random proc>-1208 [001] ...2    17.636212: sched_switch: prev_comm=<rancom proc> prev_pid=1208 prev_prio=120 prev_state=R ==> next_comm=<my thread> next_pid=915 next_prio=0
<my thread>-915    [001] ...1    17.636713: tracing_mark_write: poll() exit

So somewhere my thread is becoming TASK_INTERRUPTIBLE and then voluntarily walking into the scheduler, and then... waking up for apparently no reason 20ms later.

The occurrence of this seems at least somewhat timing dependent, and various attempts to observe it often make it harder to reproduce.

Questions

Any ideas what's causing this?
Any suggestions on easy ways of finding out where my thread is falling asleep?
Any suggestions on easy ways of finding out why my thread wakes up?
I've considered somehow adapting unwind_backtrace() to generate a single string I can stuff into each trace_sched_switch() call but that seems a little daunting. Anything simpler along the same lines?
Any idea why tracing syscalls:* and sched:* makes it explode with an unhandled page fault inside the ring buffer code where it needs to move the tail? It seems like it's dereferencing a user-space pointer (based on number similarity) but a different one each time.

Things I have already tried and checked

It's not a normal IRQ that runs too long or something with disabled interrupts. Tracing with irq:* shows that. It might be some sort of a TrustZone NMI but somehow I doubt it.
It shouldn't be a RT throttling/timeslicing of any sort, because:

a) sched_rt_runtime_us=10000 and sched_rt_period_us=10000

b) The thread has a fairly low duty cycle (<30ms per second, in 60-80 events per second)
It's probably not an artifact of having tracing or writing to /sys/kernel/debug/tracing/trace_marker from user space -- it happens without that macro and with tracing disabled (even compiled out of the kernel). Also the code related to that in trace.c and ring_buffer.c seems to be mostly lock-free.
There's nothing else with priority 0, and it's not getting pre-empted but rather seems to willingly unschedule itself.
I put a panic() at the top of syscall_trace() to make sure I'm not accidentally falling into one of the tracing/auditing paths on my way out of sys_poll(). It didn't fire, so that's not it.

Thank you very much in advance

Update #1

I gave up on finding something easy and implemented an unwind_backtrace_to_str() function that lets me instrument various tracepoints with backtrace information. After adding backtraces to trace_sched_switch() and trace_sched_wake() I managed to isolate several causes of latency, with the main two being:

Priority inversion due to mm->mmap_sem being taken by some other thread in the same process doing a fork()/mmap()/munmap() and therefore not available during futex_wait() or tracing_mark_write() for the RT thread. This one can mostly be avoided by restructuring some code and using vfork() instead of fork() in some places.
Failure to run scheduled task when sched_wake() is called from a different source CPU than the CPU it needs to run on. This seems to be a bigger problem. I traced it through the scheduler and it seems that in the bad case wake_up_process() calls try_to_wake_up() which winds up calling ttwu_queue() which is where things get interesting.

Inside ttwu_queue() we do not enter the 'if' because cpus_share_cache() always returns true for any of our cores (that sounds right, shared L2). Which means it simply calls ttwu_do_activate() for the task and exits. ttwu_do_activate() seems to only put the task on the right run queue and mark it as TASK_RUNNING, but does not have any SMP handling.

I added the following after p->state = TASK_RUNNING; in ttwu_do_wakeup()

#ifdef CONFIG_SMP
      if (task_cpu(p) != smp_processor_id())
          smp_send_reschedule(task_cpu(p));
#endif

and it fixes the problem by forcing the target CPU to run the scheduler. However, I suspect this isn't how it's supposed to work, and even if this is a real bug then there's probably a more refined fix. I checked the latest kernel (3.14) and the code in core.c looks pretty much the same.

Any ideas why this is? Why does it not call ttwu_queue_remote() if cpus_share_cache() returns true? So what if they share cache -- I could see how that's relevant for a migration decision, but whether the wakeup is done locally or remotely? Maybe our cpus_share_cache() should be returning false? That function doesn't seem to be well documented (or I'm not looking in the right places)

Anton · Accepted Answer

Just a wild guess since the is no any answer yet.. You say the system is multicore. Do you assign the affinity for the user thread to run on the same core where the interrupt occurs? And does the interrupt occur on the specific core only? I suspect a situation when a user thread runs on one core but the interrupt happens on the other one and cannot resume here immediately (not sleeping yet?). Probably a data race allows it to fall asleep e.g. just before the interrupt handler publishes some data which the thread polls. Thus, it is suspended until the next system interrupt (e.g. timer).

So, try to assign the interrupt and the thread to the same core in order to serialize them and avoid potential data races.

in response to update #1

Looks like I was right about the data race between the cores since raising IRQ on the target core fixes the issue. I guess it is not in the kernel code because of excessive reschedule IRQs and so additional scheduling overheads just for sake of very rare cases or just because it can be done faster using usual synchronization assuming the shared cache.

And there is some synchronization which looks like the right direction but apparently it misses something. I'd try running a reproducer on a different architecture/versions to understand whether it is a general bug or just specific to your platform/kernel versions. I hope it's not a missing fence on the p->on_cpu load/store..

Anyway, returning to your specific problem, if you cannot or don't want to use the custom kernel build with your hot-fix, my suggestion with thread affinity still stands actual and valid.

Additionally, if you cannot pin the interrupt to one particular core, you may want to run such a polling thread on each core (also explicitly pinned to it) to ensure that at least one of the threads will get the event immediately after the IRQ. Of course, it leads to additional synchronization burden on the user thread code.

Yuriy Romanenko · Answer

We wound up going with the following fixes:

smp_send_reschedule(task_cpu(p)); mentioned above in the scheduler to allow cross-CPU prevention. I'll follow up with a maintainer to see if it's the correct fix.
Implementing get_user_pages_fast() for our platform that does not lock mmap_sem if it doesn't have to. This removed contention between mmap/munmap and futex_wait
Switching to vfork() + execve() in a couple of places in userspace code where fork() was unnecessary. This removed contention between mmap/munmap and calls that spawn other processes.

It seems like everything is running smoothly now.

Thank you for all your help.

Tracking down mysterious high-priority thread suspend inside the kernel

Tags:

multithreading

linux-kernel

multicore

low-latency

embedded-linux

Yuriy Romanenko

2 Answers

Anton

Yuriy Romanenko

Recent Activity

Donate For Us

Tracking down mysterious high-priority thread suspend inside the kernel

Tags:

multithreading

linux-kernel

multicore

low-latency

embedded-linux

Yuriy Romanenko

2 Answers

Anton

Yuriy Romanenko

Related questions

Recent Activity

Donate For Us