UPDATES on Oct 25: Now I found out what's causing the problem. 1) The child process kills itself, that's why strace/perf/auditctl cannot track it down. 2) The JNI call to create a process is triggered from a Java thread. When the thread eventually dies, it's also destroying the process that it creates. 3) In my code to fork and execve() a child process, I have the code to monitor parent process death and kill my child process with the following line: prctl( PR_SET_PDEATHSIG, SIGKILL ); My fault that I didn't pay special attention to this flag before b/c it's considered as a BEST PRACTICE for my other projects where child process is forked from the main thread. 4) If I comment out this line, the problem is gone. The original purpose is to kill the child process when the parent process is gone. Even w/o this flag, it's still the correct behavior. Seems like the ubuntu box default behavior. 5) Finally found it's a kernel bug, fixed in kernel version 3.4.0, my ubuntu box from AWS is kernel version 3.13.0-29-generic. There are a couple of useful links to the issues: a) http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them b) prctl(PR_SET_PDEATHSIG, SIGNAL) is called on parent thread exit, not parent process exit. c) https://bugzilla.kernel.org/show_bug.cgi?id=43300 UPDATES on Oct 15: Thanks so much for all the suggestions. I am investigating from one area of the system to another area. It's hard 2 find a reason. I am wondering 2 things. 1) why are powerful tools such as strace, auditctl and perf script not able to track down who caused the kill? 2) Is +++ killed by SIGKILL +++ really means its killed from signal? ORIGINAL POST I have a long running C process launched from a Java application server in Ubuntu 12 through the JNI interface. The reason I use JNI interface to start a process instead of through Java's process builder is b/c of the performance reasons. It's very inefficient for java process builder to do IPC especially b/c extra buffering introduces very long delay. Periodically it is terminated by SIGKILL mysteriously. The way I found out is through strace, which says: "+++ killed by SIGKILL +++" I checked the following: <ol> <li>It's not a crash.</li> <li>It's not a OOM. Nothing in dmesg. My process uses only 3.3% of 1Gbytes of memory.</li> <li>Java layer didn't kill the process. I put a log in the JNI code if the code terminates the process, but no log was written to indicate that.</li> <li>It's not a permission issue. I tried to run as sudo or a different user, both cases causes the process to be killed.</li> <li>If I run the process locally in a shell, everything works fine. What's more, in my C code for my long-running process, I ignore the signal SIGHUP. Only when it's running as a child process of Java server, it gets killed.</li> <li>The process is very CPU intensive. It's using 30% of the CPU. There are lots of voluntary context switch and nonvoluntary_ctxt_switches.</li> <li>(NEW UPDATE) One IMPORTANT thing very likely related to why my process is killed. If the process do some heavy lifting, it won't be killed, however, sometimes it's doing little CPU intensive work. When that happens, after a while, roughly 1 min, it is killed. It's status is always S(Sleeping) instead of R(Running). It seems that the OS decides to kill the process if it was idle most of the time, and not kill the process if it was busy.</li> <li>I suspect Java's GC is the culprit, however, Java will NEVER garbage collect a singleton object associated with JNI. (My JNI object is tied to that singleton).</li> </ol> I am puzzled by the reason why it's terminated. Does anyone has a good suggestion how to track it down? p.s. <ol> <li> On my ubuntu limit -a result is: <pre class="prettyprint"><code>core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 7862 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 65535 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 7862 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited </code></pre> I tried to increase the limits, and still does not solve the issue. <pre class="prettyprint"><code>core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 65535 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) unlimited real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited </code></pre> </li> <li> Here is proc status when I run cat /proc/$$$/status <pre class="prettyprint"><code>Name: mimi_coso State: S (Sleeping) Tgid: 2557 Ngid: 0 Pid: 2557 PPid: 2229 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 256 Groups: 0 VmPeak: 146840 kB VmSize: 144252 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 36344 kB VmRSS: 34792 kB VmData: 45728 kB VmStk: 136 kB VmExe: 116 kB VmLib: 23832 kB VmPTE: 292 kB VmSwap: 0 kB Threads: 1 SigQ: 0/7862 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000004 SigIgn: 0000000000011001 SigCgt: 00000001c00064ee CapInh: 0000000000000000 CapPrm: 0000001fffffffff CapEff: 0000001fffffffff CapBnd: 0000001fffffffff Seccomp: 0 Cpus_allowed: 7fff Cpus_allowed_list: 0-14 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 16978 nonvoluntary_ctxt_switches: 52120 </code></pre> </li> <li> strace shows: <pre class="prettyprint"><code>$ strace -p 22254 -s 80 -o /tmp/debug.lighttpd.txt read(0, "SGI\0\1\0\0\0\1\0c\0\0\0\t\0\0T\1\2248\0\0\0\0'\1\0\0(\0\0"..., 512) = 113 read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224L\0\0\0\0/\377\373\222D\231\214"..., 512) = 448 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \272\1\0\0\t\0\1\253\1\243\273\0\0\0\0'\1\0\0\0\0\0\1\242"..., 454) = 454 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254Z\0\0\0\0/\377\373R\4\0\17\21!"..., 237) = 237 read(0, "SGI\0\1\0\0\0\1\0)\3\0\0\t\0\3\32\1\224`\0\0\0\0'\1\0\0\310\0\0"..., 512) = 512 read(0, "\344u\233\16\257\341\315\254\272\300\351\302\324\263\212\351\225\365\1\241\225\3+\276J\273\37R\234R\362z"..., 512) = 311 read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224f\0\0\0\0/\377\373\222d[\210"..., 512) = 448 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 %!\0\0\t\0\0+\1\243\335\0\0\0\0\27\0\0\0\0\1B\300\36"..., 8497) = 8497 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254t\0\0\0\0/\377\373R\4\0\17\301\31"..., 237) = 237 read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224\200\0\0\0\0/\377\373\222d/\200"..., 512) = 448 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254\216\0\0\0\0/\377\373R\4\0\17\361+"..., 237) = 237 read(0, "SGI\0\1\0\0\0\1\0\221\0\0\0\t\0\0\202\1\224\210\0\0\0\0'\1\0\0P\0\0"..., 512) = 159 read(0, unfinished ...) +++ killed by SIGKILL +++ </code></pre> </li> </ol>

Finally I figured out the reason why. The child process kills itself and it's a linux kernel bug. Details: 1) The child process kills itself, that's why strace/perf/auditctl cannot track it down. 2) The JNI call to create a process is triggered from a Java thread. When the thread eventually dies, it's also destroying the process that it creates. 3) In my code to fork and execve() a child process, I have the code to monitor parent process death and kill my child process with the following line: prctl( PR_SET_PDEATHSIG, SIGKILL ); I didn't pay special attention to this flag before b/c it's considered as a BEST PRACTICE for my other projects where child process is forked from the main thread. 4) If I comment out this line, the problem is gone. The original purpose is to kill the child process when the parent process is gone. Even w/o this flag, it's still the correct behavior. Seems like the ubuntu box default behavior. 5) From this article, https://bugzilla.kernel.org/show_bug.cgi?id=43300. it's a kernel bug, fixed in kernel version 3.4.0, my ubuntu box from AWS is kernel version 3.13.0-29-generic. My machine configuration: ===>Ubuntu 14.04 LTS ===>3.13.0-29-generic Some useful links to the issues: a) http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them b) prctl(PR_SET_PDEATHSIG, SIGNAL) is called on parent thread exit, not parent process exit c) https://bugzilla.kernel.org/show_bug.cgi?id=43300

Who sends a SIGKILL to my process mysteriously on ubuntu server

Tags:

UPDATES on Oct 25:

Now I found out what's causing the problem.

1) The child process kills itself, that's why strace/perf/auditctl cannot track it down.

2) The JNI call to create a process is triggered from a Java thread. When the thread eventually dies, it's also destroying the process that it creates.

3) In my code to fork and execve() a child process, I have the code to monitor parent process death and kill my child process with the following line: prctl( PR_SET_PDEATHSIG, SIGKILL ); My fault that I didn't pay special attention to this flag before b/c it's considered as a BEST PRACTICE for my other projects where child process is forked from the main thread.

4) If I comment out this line, the problem is gone. The original purpose is to kill the child process when the parent process is gone. Even w/o this flag, it's still the correct behavior. Seems like the ubuntu box default behavior.

5) Finally found it's a kernel bug, fixed in kernel version 3.4.0, my ubuntu box from AWS is kernel version 3.13.0-29-generic.

There are a couple of useful links to the issues:

a) http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them

b) prctl(PR_SET_PDEATHSIG, SIGNAL) is called on parent thread exit, not parent process exit.

c) https://bugzilla.kernel.org/show_bug.cgi?id=43300

UPDATES on Oct 15:

Thanks so much for all the suggestions. I am investigating from one area of the system to another area. It's hard 2 find a reason.

I am wondering 2 things. 1) why are powerful tools such as strace, auditctl and perf script not able to track down who caused the kill?

2) Is +++ killed by SIGKILL +++ really means its killed from signal?

ORIGINAL POST

I have a long running C process launched from a Java application server in Ubuntu 12 through the JNI interface. The reason I use JNI interface to start a process instead of through Java's process builder is b/c of the performance reasons. It's very inefficient for java process builder to do IPC especially b/c extra buffering introduces very long delay.

Periodically it is terminated by SIGKILL mysteriously. The way I found out is through strace, which says: "+++ killed by SIGKILL +++"

I checked the following:

It's not a crash.
It's not a OOM. Nothing in dmesg. My process uses only 3.3% of 1Gbytes of memory.
Java layer didn't kill the process. I put a log in the JNI code if the code terminates the process, but no log was written to indicate that.
It's not a permission issue. I tried to run as sudo or a different user, both cases causes the process to be killed.
If I run the process locally in a shell, everything works fine. What's more, in my C code for my long-running process, I ignore the signal SIGHUP. Only when it's running as a child process of Java server, it gets killed.
The process is very CPU intensive. It's using 30% of the CPU. There are lots of voluntary context switch and nonvoluntary_ctxt_switches.
(NEW UPDATE) One IMPORTANT thing very likely related to why my process is killed. If the process do some heavy lifting, it won't be killed, however, sometimes it's doing little CPU intensive work. When that happens, after a while, roughly 1 min, it is killed. It's status is always S(Sleeping) instead of R(Running). It seems that the OS decides to kill the process if it was idle most of the time, and not kill the process if it was busy.
I suspect Java's GC is the culprit, however, Java will NEVER garbage collect a singleton object associated with JNI. (My JNI object is tied to that singleton).

I am puzzled by the reason why it's terminated. Does anyone has a good suggestion how to track it down?

p.s.

On my ubuntu limit -a result is:

core file size          (blocks, -c) 0 data seg size           (kbytes, -d) unlimited scheduling priority             (-e) 0 file size               (blocks, -f) unlimited pending signals                 (-i) 7862 max locked memory       (kbytes, -l) 64 max memory size         (kbytes, -m) unlimited open files                      (-n) 65535 pipe size            (512 bytes, -p) 8 POSIX message queues     (bytes, -q) 819200 real-time priority              (-r) 0 stack size              (kbytes, -s) 8192 cpu time               (seconds, -t) unlimited max user processes              (-u) 7862 virtual memory          (kbytes, -v) unlimited file locks                      (-x) unlimited

I tried to increase the limits, and still does not solve the issue.

core file size          (blocks, -c) 0 data seg size           (kbytes, -d) unlimited scheduling priority             (-e) 0 file size               (blocks, -f) unlimited pending signals                 (-i) unlimited max locked memory       (kbytes, -l) unlimited max memory size         (kbytes, -m) unlimited open files                      (-n) 65535 pipe size            (512 bytes, -p) 8 POSIX message queues     (bytes, -q) unlimited real-time priority              (-r) 0 stack size              (kbytes, -s) 8192 cpu time               (seconds, -t) unlimited max user processes              (-u) unlimited virtual memory          (kbytes, -v) unlimited file locks                      (-x) unlimited

Here is proc status when I run cat /proc/$$$/status

Name:   mimi_coso State:  S (Sleeping) Tgid:   2557 Ngid:   0 Pid:    2557 PPid:   2229 TracerPid:  0 Uid:    0   0   0   0 Gid:    0   0   0   0 FDSize: 256 Groups: 0 VmPeak:   146840 kB VmSize:   144252 kB VmLck:         0 kB VmPin:         0 kB VmHWM:     36344 kB VmRSS:     34792 kB VmData:    45728 kB VmStk:       136 kB VmExe:       116 kB VmLib:     23832 kB VmPTE:       292 kB VmSwap:        0 kB Threads:    1 SigQ:   0/7862 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000004 SigIgn: 0000000000011001 SigCgt: 00000001c00064ee CapInh: 0000000000000000 CapPrm: 0000001fffffffff CapEff: 0000001fffffffff CapBnd: 0000001fffffffff Seccomp:    0 Cpus_allowed:   7fff Cpus_allowed_list:  0-14 Mems_allowed:   00000000,00000001 Mems_allowed_list:  0 voluntary_ctxt_switches:    16978 nonvoluntary_ctxt_switches: 52120

strace shows:

$ strace -p 22254 -s 80 -o /tmp/debug.lighttpd.txt read(0, "SGI\0\1\0\0\0\1\0c\0\0\0\t\0\0T\1\2248\0\0\0\0'\1\0\0(\0\0"..., 512) = 113 read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224L\0\0\0\0/\377\373\222D\231\214"..., 512) = 448 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \272\1\0\0\t\0\1\253\1\243\273\0\0\0\0'\1\0\0\0\0\0\1\242"..., 454) = 454 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254Z\0\0\0\0/\377\373R\4\0\17\21!"..., 237) = 237 read(0, "SGI\0\1\0\0\0\1\0)\3\0\0\t\0\3\32\1\224`\0\0\0\0'\1\0\0\310\0\0"..., 512) = 512 read(0, "\344u\233\16\257\341\315\254\272\300\351\302\324\263\212\351\225\365\1\241\225\3+\276J\273\37R\234R\362z"..., 512) = 311 read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224f\0\0\0\0/\377\373\222d[\210"..., 512) = 448 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 %!\0\0\t\0\0+\1\243\335\0\0\0\0\27\0\0\0\0\1B\300\36"..., 8497) = 8497 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254t\0\0\0\0/\377\373R\4\0\17\301\31"..., 237) = 237 read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224\200\0\0\0\0/\377\373\222d/\200"..., 512) = 448 sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107 write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254\216\0\0\0\0/\377\373R\4\0\17\361+"..., 237) = 237 read(0, "SGI\0\1\0\0\0\1\0\221\0\0\0\t\0\0\202\1\224\210\0\0\0\0'\1\0\0P\0\0"..., 512) = 159 read(0,  unfinished ...)  +++ killed by SIGKILL +++

229

asked Oct 09 '14 18:10

Howy

2 Answers

Assuming that you have root access on your machine, you can enable audit on kill(2) syscall to gather such information.

root # auditctl -a exit,always -F arch=b64 -S kill -F a1=9 root # auditctl -l LIST_RULES: exit,always arch=3221225534 (0xc000003e) a1=9 (0x9) syscall=kill  root # sleep 99999 & [2] 11688 root # kill -9 11688  root # ausearch -sc kill time->Tue Oct 14 00:38:44 2014 type=OBJ_PID msg=audit(1413272324.413:441376): opid=11688 oauid=52872 ouid=0 oses=20 ocomm="sleep" type=SYSCALL msg=audit(1413272324.413:441376): arch=c000003e syscall=62 success=yes exit=0 a0=2da8 a1=9 a2=0 a3=0 items=0 ppid=6107 pid=6108 auid=52872 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsg id=0 tty=pts2 ses=20 comm="bash" exe="/bin/bash" key=(null)

The other way is to set up kernel tracing which may be an over-kill when audit system can do same work.

198

answered Sep 21 '22 01:09

Gary Luo

Finally I figured out the reason why.

The child process kills itself and it's a linux kernel bug.

Details:

1) The child process kills itself, that's why strace/perf/auditctl cannot track it down.

2) The JNI call to create a process is triggered from a Java thread. When the thread eventually dies, it's also destroying the process that it creates.

3) In my code to fork and execve() a child process, I have the code to monitor parent process death and kill my child process with the following line: prctl( PR_SET_PDEATHSIG, SIGKILL ); I didn't pay special attention to this flag before b/c it's considered as a BEST PRACTICE for my other projects where child process is forked from the main thread.

5) From this article, https://bugzilla.kernel.org/show_bug.cgi?id=43300. it's a kernel bug, fixed in kernel version 3.4.0, my ubuntu box from AWS is kernel version 3.13.0-29-generic.

My machine configuration: ===>Ubuntu 14.04 LTS ===>3.13.0-29-generic

Some useful links to the issues:

a) http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them

b) prctl(PR_SET_PDEATHSIG, SIGNAL) is called on parent thread exit, not parent process exit

c) https://bugzilla.kernel.org/show_bug.cgi?id=43300

answered Sep 24 '22 01:09

Howy

Related questions
                            
                                Stop Iframe from preventing scrolling of parent document?
                            
                                Parentheses in django if statement
                            
                                AUDIO_OUTPUT_FLAG_FAST denied by client
                            
                                "tput: No value for $TERM and no -T specified " error logged by CRON process
                            
                                Does Spring @Autowired inject beans by name or by type?
                            
                                Is there a way to STOP not TERMINATE instances using auto-scaling in AWS?
                            
                                How do you set a strftime with javascript?
                            
                                Replicate pull to refresh in XCTest UI testing
                            
                                Removing the left padding on an Android EditText
                            
                                Android Test Module (Gradle Plugin 1.3) doesn't work: "debug-classes not found"
                            
                                What does REPEATED field in Google Bigquery mean?
                            
                                Connection to Office 365 by EWS API

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Who sends a SIGKILL to my process mysteriously on ubuntu server

Tags:

Howy

People also ask

2 Answers

Gary Luo

Howy

Recent Activity

Donate For Us