Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python subprocess returns wrong exit code

I wrote a script to launch a number of processes (simple unit tests) to run in parallel. It will do N jobs with num_workers parallel processes at a time.

My first implementation ran the processes in batches of num_workers and seemed to be working fine (I used the false command here to test the behavior)

import subprocess

errors = 0
num_workers = 10
N = 100
i = 0

while i < N:
    processes = []
    for j in range(i, min(i+num_workers, N)):
        p = subprocess.Popen(['false'])
        processes.append(p)

    [p.wait() for p in processes]
    exit_codes = [p.returncode for p in processes]

    errors += sum(int(e != 0) for e in exit_codes)
    i += num_workers

print(f"There were {errors}/{N} errors")

However, the tests do not take equal amounts of time so I was sometimes waiting for a slow test to finish. Thus I rewrote it to keep assigning tasks as they finished

import subprocess
import os


errors = 0
num_workers = 40
N = 100
assigned = 0
completed = 0
processes = set()

while completed < N:
    if assigned < N:
        p = subprocess.Popen(['false'])
        processes.add((assigned, p))
        assigned += 1
    if len(processes) >= num_workers or assigned == N:
        os.wait()

    for i, p in frozenset(processes):
        if p.poll() is not None:
            completed += 1
            processes.remove((i, p))
            err = p.returncode
            print(i, err)
            if err != 0:
                errors += 1

print(f"There were {errors}/{N} errors")

However this produces wrong results for the last few processes. For instance, in the example above it produces 98/100 errors instead of 100. I checked and this has nothing to do with concurrency; the 2 latest jobs were returning with exit code 0 for some reason.

Why does this happen?

like image 309
Jose Javier Gonzalez Ortiz Avatar asked Nov 08 '22 10:11

Jose Javier Gonzalez Ortiz


1 Answers

The problem is with os.wait(). It doesn't only wait for a child process to exit: it also returns the pid and "exit status indication" of that child, as the documentation says. This requires waiting until the child process has terminated; but once the child has terminated, its return code is no longer available for poll. Here's a simple test to reproduce the problem:

false_runner.py

import os
import subprocess
p = subprocess.Popen(['false'], stderr=subprocess.DEVNULL)
pid, retcode = os.wait()
print("From os.wait: {}".format(retcode))
print("From popen object before poll: {}".format(p.returncode))
p.poll()
print("From popen object after poll: {}".format(p.returncode))

output

njv@organon:~/tmp$ python false_runner.py
From os.wait: 256
From Popen object before poll: None
From Popen object after poll: 0

The source code for _internal_poll, called by Popen.poll, makes it clear what's happening here: when Popen tries to call _waitpid on its child process's pid, it gets ChildProcessError: [Errno 10] No child processes, and assigns itself a returncode of 0, because there's no way to determine the child process's return code at this point.

The reason this happens only for the last couple of subprocesses in your example is because os.wait is only getting called for the or assigned == N case, and only once or twice, because your subprocess is so fast. If you slow it down a bit, you'll get more random behavior.

As for a fix: I'd probably just replace os.wait() with a sleep.

like image 61
Nathan Vērzemnieks Avatar answered Nov 15 '22 06:11

Nathan Vērzemnieks