My system is a CentOS 6.3
(running Kernel version 2.6.32-279.el6.x86_64
).
I have a loadable kernel module which is a driver that manages a PCIe card.
If I manually insert the driver using insmod
while the OS is up and running, the driver loads successfully and is operational.
However, if I try to install the driver using rpm and then reboot the system, during startup the OS gets stuck spitting out the following "soft lockup" message for ALL the CPU cores, except for one core that is in "soft lockup" in one of the threads created by my driver.
BUG: soft lockup - CPU#X stuck for 67s! [migration/8:36]
.......(same above message for all cores except one)
BUG: soft lockup - CPU#10 stuck for 67s! [mydriver_thread/8:36]
(one core is locked up in one of the threads in my driver).
I searched the net quite a bit for info on this kernel msg / bug, and there are quite a bit of posts about it, none on what causes it or how to debug. Any help with the following questions would really be appreciated:
I am not able to log into the system, I think it's because all the cores are in a "soft lockup" state, and hence cannot trigger a kernel dump from shell prompt. I enabled SysRq, and tried to trigger a kernel dump with SysRq key combo, but no luck. It seems the system is not responding to keyboard (not even responding to CapsLock button). Any suggestions on how I can trigger a kernel dump in this circumstance?
I can imagine the possibly of my driver thread causing "soft lockup". But how can the "migration" thread (a kernel thread) be in a "soft lockup" just because of my driver?
From browsing the net, the "migration" thread is used to move tasks from one cpu to another. Can someone please help me understand what this thread exact does? And how it can be affected by other threads, if at all.
A 'soft lockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds without giving other tasks a chance to run. The watchdog daemon will send an non-maskable interrupt (NMI) to all CPUs in the system who, in turn, print the stack traces of their currently running tasks.
A soft lockup is the symptom of a task or kernel thread using and not releasing a CPU for a longer period of time than allowed. The technical reason behind a soft lock involves CPU interrupts and nmi-watchdog. For each online CPU on the system, a watchdog process gets created.
CPU Lockup Lockup is broadly defined as the symptom of a function or task using the CPU and not releasing it for a period of time. The lockup behavior is more often caused by an application use case and occurs during firmware code development, engineering evaluation, or at production programming.
I had a very similar problem on my desktop. It would soft lockup very frequently - about once a day or so.
It turns out it was because I was running on an Intel Haswell. It seems that the Haswell/Broadwell series of Intel processors have a bug which can cause system instability. This bug was fixed in a microcode update.
Check if CentOS offers an intel-microcode package, and install it. Make sure you configure grub to load it as the initial ramdisk before it loads initramfs.
Personally, I upgraded my microcode by booting into Windows and running a BIOS Update. You can check if the micrcode was actually updated by comparing the output of grep 'microcode' /proc/cpuinfo
before and after the update.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With