Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I reproduce zombie process with bash as PID1 in docker?

I have a Docker container that runs bash at PID1 which in turn runs a long-running (complex) service that sometimes produces zombie processes parented to the bash at PID1. These zombies are seemingly never reaped.

I'm trying to reproduce this issue in a minimal container so that I can test mitigations, such as using a proper init as PID1 rather than bash.

However, I have been unable to reproduce the zombie processes. The bash at PID1 seems to reap children, even those it inherited from another process.

Here is what I tried:

docker run -d ubuntu:14.04 bash -c \
  'bash -c "start-stop-daemon --background --start --pidfile /tmp/sleep.pid --exec /bin/sleep -- 30; sleep 300"'

My expectation was that start-stop-daemon would double-fork to create a process parented to the bash at PID1, then exec into sleep 30, and when the sleep exits I expected the process to remain as a zombie. The sleep 300 simulates a long-running service.

However, bash reaps the process, and I can observe that by running strace on the bash process (from the host machine running docker):

$ sudo strace -p 2051
strace: Process 2051 attached
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9
wait4(-1,

I am running docker 1.11.1-rc1, though I have the same experience with docker 1.9.

$ docker --version
Docker version 1.11.1-rc1, build c90c70c
$ uname -r
4.4.8-boot2docker

Given that strace shows bash reaping (orphaned) children, is bash a suitable PID1 in a docker container? What else might be causing the zombies I'm seeing in the more complex container? How can I reproduce?

Edit:

I managed to attach strace to a bash PID1 on one of the live containers exhibiting the problem.

Process 20381 attached
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11185
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 11191
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 11203
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11155
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11151
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11152
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11154
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 11332
...

Not sure exactly what all those exiting processes are, but none of the PIDs match those of the few defunct zombie processes that were shown by docker exec $id ps aux | grep defunct.

Maybe the trick is to catch it in action and see what wait4() returns on a process that remains a zombie...

like image 386
Patrick Avatar asked May 04 '16 08:05

Patrick


People also ask

What is zombie process in Docker?

A (normal) init is designed to reap zombies when the parent process that failed to wait on them exits and the zombies hang around. The init process then becomes the zombies parent and they can be cleaned up. Next, a container is a cgroup of processes running in their own PID namespace.

What is PID 1 in container?

In a traditional unix systems, PID 1 is usually the init/systemd process. When a container is started, the application becomes PID 1 in its assigned namespace.

How does Linux handle zombie processes?

The zombie processes can be removed from the system by sending the SIGCHLD signal to the parent, using the kill command. If the zombie process is still not eliminated from the process table by the parent process, then the parent process is terminated if that is acceptable.


1 Answers

I also wanted to verify if my jenkins container slaves can generate zombies or not.

Since my images run the scl binary which in turn starts the java JLNP client, I performed the following in jenkins slave groovy script console:

def process=new ProcessBuilder("bash", '-c', 'sleep 10 </dev/null &>/dev/null & disown').redirectErrorStream(true).start()
println process.inputStream.text
println " ps -ef".execute().text

Zombies have been generated. That is with scl ending up as PID 1.

Then I looked at your question and decided to try out bash. My first attempt was changing ENTRYPOINT to this:

bash -c "/usr/bin/scl enable rh-ror42 -- /usr/local/bin/run-jnlp-client $1 $2" --

Then looking at ps output I realized that PID 1 was not bash but in fact PID 1 was still the scl binary. Finally changed command to:

bash -c "/usr/bin/scl enable rh-ror42 -- /usr/local/bin/run-jnlp-client $1 $2 ; ls" --

That is adding some random second command after the scl command. And voila - bash became PID 1 and no zombies generate anymore.

Looking at your example, I see that you run bash -c with more than one command. So in your test bed, you are running something like my last command. But in your work containers, it is likely that you run bash -c with only one command and it appears bash became clever enough to effectively do an exec. And probably in your work containers that generate zombies, bash is not actually PID 1 contrary to what you expect.

Perhaps you can ps -ef inside your existing work containers and verify if my guess is correct.

like image 115
akostadinov Avatar answered Sep 28 '22 06:09

akostadinov