Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we run multi-process program in docker?

I have some code using multi-process like this:

import multiprocessing
from multiprocessing import Pool

pool = Pool(processes=100)
result = []

for job in job_list:        
    result.append( 
        pool.apply_async(
            handle_job, (job)
            )
        )
pool.close()
pool.join()

This program is doing heavy calculation on very big data set. So we need multi-process to handle the job concurrently to improve performance.

I have been told that to the hosting system, one docker container is just one process. So I am wondering how my multi-process will be handled in Docker?

Below are my concern:

  1. Since the container is just one process, will my multi-process code become multi-threading in the process?

  2. Will the performance become down? Because the reason I use multi-process is to get job done concurrently to get better performance.

like image 946
Kramer Li Avatar asked Jul 22 '16 06:07

Kramer Li


1 Answers

I suspect much of the confusion comes from thinking of containers as a lightweight VM. Instead, think of Linux containers as a way to run a process with some settings for namespaces and cgroups.

One of those namespaces is the pid namespace. When you configure it, you see the first process in that namespace as pid 1 from within the namespace. From another pid namespace, you cannot see those other namespaces, or the host namespace. And on the host, outside of any namespace, you will see all processes, including those in any namespace.

When you fork a new process, you inherit the same namespaces and cgroups, so you will get a new pid within the pid namespace, allowing you to run multiple processes just like any other Linux environment. Inside of the container, you can run a ps command (assuming it's included in your image) and see multiple processes running:

$ docker run -it --rm busybox /bin/sh
/ # sleep 30s &
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    7 root      0:00 sleep 30s
    8 root      0:00 ps -ef

Where the advice comes to only run a single process is not from multi-threaded apps, but instead from people treating the container as a lightweight VM. They will spawn multiple applications that have no hard dependency on each other, like a web server and a database and a mail server. When this is done, there are a couple key issues:

  • Container logs are unusable. They are either cluttered with multiple processes all writing to the same stdout/stderr. Or they are empty with logs instead written to the container filesystem, where they are often lost.
  • Error handling is problematic. If the mail server has an error, should the database be shutdown and restarted to try correcting the issue? If you don't kill the whole container, how do you know the mail server is down?

In short, the design of managing containers assumes one application per container, and if you break that assumption, you get to keep both pieces when the tooling doesn't support your use case.

A few words of caution:

  • Once pid 1 exits, your container ends, regardless of whether your forked processes are still running or not. This means all processes are killed and reaped.
  • Typically on Linux, when the parent processes die without waiting on their child pids, a zombie process is eventually reaped by the init process running as pid 1. This reaping process does not pass the pid namespace boundary, so if you fork child processes, make sure pid 1 inside the container is waiting on those child processes to clean them up. A common pid 1 process for this task is tini (init spelled backwards). There's even a flag to have docker run this for you (--init).
like image 200
BMitch Avatar answered Oct 08 '22 17:10

BMitch