Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Segmentation fault itself is hanging

Tags:

c++

c

linux

I have had some problems with a server today and I have now boiled it down to that it is not able to get rid of processes that gets a segfault.

After the process gets a seg-fault, the process just keeps hanging, not getting killed.

A test that should cause the error Segmentation fault (core dumped).

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
 char *buf;
 buf = malloc(1<<31);
 fgets(buf, 1024, stdin);
 printf("%s\n", buf);
 return 1;
}

Compile and set permissions with gcc segfault.c -o segfault && chmod +x segfault.

Running this (and pressing enter 1 time), on the problematic server causes it to hang. I also ran this on another server with the same kernel version (and most of the same packages), and it gets the seg-fault and then quits.

Here are the last few lines after running strace ./segfault on both of the servers.

Bad server

"\n", 1024)                     = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
# It hangs here....

Working server

"\n", 1024)                     = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
root@server { ~ }# echo $?
139

When the process hangs (after it have segfaulted), this is how it looks.

Not able to ^c it

root@server { ~ }# ./segfault

^C^C^C

Entry from ps aux

root 22944 0.0 0.0 69700 444 pts/18 S+ 15:39 0:00 ./segfault

cat /proc/22944/stack

[<ffffffff81223ca8>] do_coredump+0x978/0xb10
[<ffffffff810850c7>] get_signal_to_deliver+0x1c7/0x6d0
[<ffffffff81013407>] do_signal+0x57/0x6c0
[<ffffffff81013ad9>] do_notify_resume+0x69/0xb0
[<ffffffff8160bbfc>] retint_signal+0x48/0x8c
[<ffffffffffffffff>] 0xffffffffffffffff

Another funny thing is that I am unable to attach strace to a hanging segfault process. Doing so actually makes it getting killed.

root@server { ~ }# strace -p 1234
Process 1234 attached
+++ killed by SIGSEGV (core dumped) +++

ulimit -c 0 is sat and ulimit -c, ulimit -H -c, and ulimit -S -c all shows the value 0

  • Kernel version: 3.10.0-229.14.1.el7.x86_64
  • Distro-version: Red Hat Enterprise Linux Server release 7.1 (Maipo)
  • Running in vmware

The server is working as it should on everything else.

Update Shutting down abrt (systemctl stop abrtd.service) fixed the problem with processes already hung after core-dump, and new processes core-dumping. Starting up abrt again did not bring back the problem.

Update 2016-01-26 We got a problem that looked similar, but not quite the same. The initial code used to test:

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
 char *buf;
 buf = malloc(1<<31);
 fgets(buf, 1024, stdin);
 printf("%s\n", buf);
 return 1;
}

was hanging. The output of cat /proc/<pid>/maps was

00400000-00401000 r-xp 00000000 fd:00 13143328                           /root/segfault
00600000-00601000 r--p 00000000 fd:00 13143328                           /root/segfault
00601000-00602000 rw-p 00001000 fd:00 13143328                           /root/segfault
7f6c08000000-7f6c08021000 rw-p 00000000 00:00 0
7f6c08021000-7f6c0c000000 ---p 00000000 00:00 0
7f6c0fd5b000-7f6c0ff11000 r-xp 00000000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c0ff11000-7f6c10111000 ---p 001b6000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c10111000-7f6c10115000 r--p 001b6000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c10115000-7f6c10117000 rw-p 001ba000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c10117000-7f6c1011c000 rw-p 00000000 00:00 0
7f6c1011c000-7f6c1013d000 r-xp 00000000 fd:00 14274                      /usr/lib64/ld-2.17.so
7f6c10330000-7f6c10333000 rw-p 00000000 00:00 0
7f6c1033b000-7f6c1033d000 rw-p 00000000 00:00 0
7f6c1033d000-7f6c1033e000 r--p 00021000 fd:00 14274                      /usr/lib64/ld-2.17.so
7f6c1033e000-7f6c1033f000 rw-p 00022000 fd:00 14274                      /usr/lib64/ld-2.17.so
7f6c1033f000-7f6c10340000 rw-p 00000000 00:00 0
7ffc13b5b000-7ffc13b7c000 rw-p 00000000 00:00 0                          [stack]
7ffc13bad000-7ffc13baf000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

However, the smaller c code (int main(void){*(volatile char*)0=0;}) to trigger a segfault did cause a segfault and did not hang...

like image 633
xeor Avatar asked Nov 12 '15 14:11

xeor


1 Answers

WARNING - this answer contains a number of suppositions based on the incomplete information to hand. Hopefully it is still useful though!

Why does the segfault appear to hang?

As the stack trace shows, the kernel is busy creating a core dump of the crashed process.

But why does this take so long? A likely explanation is that the method you are using to create the segfaults is resulting in the process having a massive virtual address space.

As pointed out in the comments by M.M., the outcome of the expression 1<<31 is undefined by the C standards, so it is difficult to say what actual value is being passed to malloc, but based on the subsequent behavior I am assuming it is a large number.

Note that for malloc to succeed it is not necessary for you to actually have this much RAM in your system - the kernel will expand the virtual size of your process but actual RAM will only be allocated when your program actually accesses this RAM.

I believe the call to malloc succeeds, or at least returns, because you state that it segfaults after you press enter, so after the call to fgets.

In any case, the segfault is leading the kernel to perform a core dump. If the process has a large virtual size, that could take a long time, especially if the kernel decides to dump all pages, even those that have never been touched by the process. I am not sure if it will do that, but if it did, and if there was not enough RAM in the system, it would have to begin swapping pages in and out of memory in order to dump them to the core dump. This would generate a high IO load which could lead to the process to appear to be unresponsive (and overall system performance would be degraded).

You may be able to verify some of this by looking in the abrtd dump directory (possibly /var/tmp/abrt, or check /etc/abrt/abrt.conf) where you may find the core dumps (or perhaps partial core dumps) that have been created.

If you are able to reproduce the behavior, then you can check:

  • /proc/[pid]/maps to see the address space map of the process and see if it really is large
  • Use a tool like vmstat to see if the the system is swapping, the amount of I/O going on, and how much IO Wait state is being experienced
  • If you had sar running then you may be able to see similar information even for the period prior to restarting abrtd.

Why is a core dump created, even though ulimit -c is 0?

According to this bug report, abrtd will trigger collection of a core dump regardless of ulimit settings.

Why did this not start happening again when arbtd was started up once more?

There are a couple of possible explanations for that. For one thing, it would depend on the amount of free RAM in the system. It might be that a single core dump of a large process would not take that long, and not be perceived as hanging, if there is enough free RAM and the system is not pushed to swap.

If in your initial experiments you had several processes in this state, then the symptoms would be far worse than is the case when just getting a single process to misbehave.

Another possibility is that the configuration of abrtd had been altered but the service not yet reloaded, so that when you restarted it, it began using the new configuration, perhaps changing it's behavior.

It is also possible that a yum update had updated abrtd, but not restarted it, so that when you restarted it, the new version was running.

like image 186
harmic Avatar answered Oct 20 '22 16:10

harmic