I've got some code which fork()
s, calls setsid()
in the child, and starts some processing. If any of the children quit (waitpid(-1, 0)
), I kill all the child process groups:
child_pids = []
for child_func in child_functions:
pid = fork()
if pid == 0:
setsid()
child_func()
exit()
else:
child_pids.append(pid)
waitpid(-1, 0)
for child_pid in child_pids:
try:
killpg(child_pid, SIGTERM)
except OSError as e:
if e.errno != 3: # 3 == no such process
print "Error killing %s: %s" %(child_pid, e)
However, occasionally the call to killpg
will fail with “operation not permitted”:
Error killing 22841: [Errno 1] Operation not permitted
Why might this be happening?
A complete, working example:
from signal import SIGTERM from sys import exit from time import sleep from os import * def slow(): fork() sleep(10) def fast(): sleep(1) child_pids = [] for child_func in [fast, slow, slow, fast]: pid = fork() if pid == 0: setsid() child_func() exit(0) else: child_pids.append(pid) waitpid(-1, 0) for child_pid in child_pids: try: killpg(child_pid, SIGTERM) except OSError as e: print "Error killing %s: %s" %(child_pid, e)
Which yields:
$ python killpg.py Error killing 23293: [Errno 3] No such process Error killing 23296: [Errno 1] Operation not permitted
I added some debugging too (slightly modified source). It's happening when you try to kill a process group that's already exited, and in Zombie status. Oh, and it's easily repeatable just with [fast, fast]
.
$ python so.py
spawned pgrp 6035
spawned pgrp 6036
Reaped pid: 6036, status: 0
6035 6034 6035 Z (Python)
6034 521 6034 S+ python so.py
6037 6034 6034 S+ sh -c ps -e -o pid,ppid,pgid,state,command | grep -i python
6039 6037 6034 R+ grep -i python
killing pg 6035
Error killing 6035: [Errno 1] Operation not permitted
6035 6034 6035 Z (Python)
6034 521 6034 S+ python so.py
6040 6034 6034 S+ sh -c ps -e -o pid,ppid,pgid,state,command | grep -i python
6042 6040 6034 S+ grep -i python
killing pg 6036
Error killing 6036: [Errno 3] No such process
Not sure how to deal with that. Maybe you can put the waitpid in a while loop to reap all terminated child processes, and then proceed with pgkill()ing the rest.
But the answer to your question is you're getting EPERMs because you're not allowed to killpg a zombie process group leader (at least on Mac OS).
Also, this is verifiable outside python. If you put a sleep in there, find the pgrp of one of those zombies, and attempt to kill its process group, you also get EPERM:
$ kill -TERM -6115
-bash: kill: (-6115) - Operation not permitted
Confirmed this also doesn't happen on Linux.
You apparently can't kill a process group that consists of zombies. When a process exits, it becomes a zombie until someone calls waitpid
on it. Typically, init
will take ownership of children whose parents have died, to avoid orphan zombie children.
So, the process is still "around" in some sense, but it gets no CPU time and ignores any kill
commands sent directly to it. If a process group consists entirely of zombies, however, the behaviour appears to be that killing the process group throws EPERM
instead of silently failing. Note that killing a process group containing non-zombies still succeeds.
Example program demonstrating this:
import os
import time
res = os.fork()
if res:
time.sleep(0.2)
pgid = os.getpgid(res)
print pgid
while 1:
try:
print os.kill(-pgid, 9)
except Exception, e:
print e
break
print 'wait', os.waitpid(res, 0)
try:
print os.kill(-pgid, 9)
except Exception, e:
print e
else:
os.setpgid(0, 0)
while 1:
pass
The output looks like
56621
None
[Errno 1] Operation not permitted
wait (56621, 9)
[Errno 3] No such process
The parent kills the child with SIGKILL, then tries again. The second time, it gets EPERM
, so it waits for the child (reaping it and destroying its process group). So, the third kill
produces ESRCH
as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With