I am using the subprocess
module to start a process from python. I want to be able to access the output (stdout, stderr) as soon as it is written/buffered.
For example, imagine I want to run a python file called counter.py
via a subprocess
. The contents of counter.py
is as follows:
import sys
for index in range(10):
# Write data to standard out.
sys.stdout.write(str(index))
# Push buffered data to disk.
sys.stdout.flush()
The parent process responsible for executing the counter.py
example is as follows:
import subprocess
command = ['python', 'counter.py']
process = subprocess.Popen(
cmd,
bufsize=1,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
Using the counter.py
example I can access the data before the process has completed. This is great! This is exactly what I want. However, removing the sys.stdout.flush()
call prevents the data from being accessed at the time I want it. This is bad! This is exactly what I don't want. My understanding is that the flush()
call forces the data to be written to disk and before the data is written to disk it exists only in a buffer. Remember I want to be able to run just about any process. I do not expect the process to perform this kind of flushing but I still expect the data to be available in real time (or close to it). Is there a way to achieve this?
A quick note about the parent process. You may notice I am using bufsize=0
for line buffering. I was hoping this would cause a flush to disk for every line but it doesn't seem to work that way. How does this argument work?
You will also notice I am using subprocess.PIPE
. This is because it appears to be the only value which produces IO objects between the parent and child processes. I have come to this conclusion by looking at the Popen._get_handles
method in the subprocess
module (I'm referring to the Windows definition here). There are two important variables, c2pread
and c2pwrite
which are set based on the stdout
value passed to the Popen
constructor. For instance, if stdout
is not set, the c2pread
variable is not set. This is also the case when using file descriptors and file-like objects. I don't really know whether this is significant or not but my gut instinct tells me I would want both read and write IO objects for what I am trying to achieve - this is why I chose subprocess.PIPE
. I would be very grateful if someone could explain this in more detail. Likewise, if there is a compelling reason to use something other than subprocess.PIPE
I am all ears.
import time
import subprocess
import threading
import Queue
class StreamReader(threading.Thread):
"""
Threaded object used for reading process output stream (stdout, stderr).
"""
def __init__(self, stream, queue, *args, **kwargs):
super(StreamReader, self).__init__(*args, **kwargs)
self._stream = stream
self._queue = queue
# Event used to terminate thread. This way we will have a chance to
# tie up loose ends.
self._stop = threading.Event()
def stop(self):
"""
Stop thread. Call this function to terminate the thread.
"""
self._stop.set()
def stopped(self):
"""
Check whether the thread has been terminated.
"""
return self._stop.isSet()
def run(self):
while True:
# Flush buffered data (not sure this actually works?)
self._stream.flush()
# Read available data.
for line in iter(self._stream.readline, b''):
self._queue.put(line)
# Breather.
time.sleep(0.25)
# Check whether thread has been terminated.
if self.stopped():
break
cmd = ['python', 'counter.py']
process = subprocess.Popen(
cmd,
bufsize=1,
stdout=subprocess.PIPE,
)
stdout_queue = Queue.Queue()
stdout_reader = StreamReader(process.stdout, stdout_queue)
stdout_reader.daemon = True
stdout_reader.start()
# Read standard out of the child process whilst it is active.
while True:
# Attempt to read available data.
try:
line = stdout_queue.get(timeout=0.1)
print '%s' % line
# If data was not read within time out period. Continue.
except Queue.Empty:
# No data currently available.
pass
# Check whether child process is still active.
if process.poll() != None:
# Process is no longer active.
break
# Process is no longer active. Nothing more to read. Stop reader thread.
stdout_reader.stop()
Here I am performing the logic which reads standard out from the child process in a thread. This allows for the scenario in which the read is blocking until data is available. Instead of waiting for some potentially long period of time, we check whether there is available data, to be read within a time out period, and continue looping if there is not.
I have also tried another approach using a kind of non-blocking read. This approach uses the ctypes
module to access Windows system calls. Please note that I don't fully understand what I am doing here - I have simply tried to make sense of some example code I have seen in other posts. In any case, the following snippet doesn't solve the buffering issue. My understanding is that it's just another way to combat a potentially long read time.
import os
import subprocess
import ctypes
import ctypes.wintypes
import msvcrt
cmd = ['python', 'counter.py']
process = subprocess.Popen(
cmd,
bufsize=1,
stdout=subprocess.PIPE,
)
def read_output_non_blocking(stream):
data = ''
available_bytes = 0
c_read = ctypes.c_ulong()
c_available = ctypes.c_ulong()
c_message = ctypes.c_ulong()
fileno = stream.fileno()
handle = msvcrt.get_osfhandle(fileno)
# Read available data.
buffer_ = None
bytes_ = 0
status = ctypes.windll.kernel32.PeekNamedPipe(
handle,
buffer_,
bytes_,
ctypes.byref(c_read),
ctypes.byref(c_available),
ctypes.byref(c_message),
)
if status:
available_bytes = int(c_available.value)
if available_bytes > 0:
data = os.read(fileno, available_bytes)
print data
return data
while True:
# Read standard out for child process.
stdout = read_output_non_blocking(process.stdout)
print stdout
# Check whether child process is still active.
if process.poll() != None:
# Process is no longer active.
break
Comments are much appreciated.
Cheers
Here is a simple Python program to demonstrate communication between the parent process and child process using the pipe method. pipe() System call : The method pipe() creates a pipe and returns a pair of file descriptors (r, w) usable for reading and writing, respectively.
The fork() is used to create a process, it has no argument and its return the process ID. The main reason for using fork() to create a new process which becomes the child process of the caller. When a new child process is created, both processes will execute the next instruction.
At issue here is buffering by the child process. Your subprocess
code already works as well as it could, but if you have a child process that buffers its output then there is nothing that subprocess
pipes can do about this.
I cannot stress this enough: the buffering delays you see are the responsibility of the child process, and how it handles buffering has nothing to do with the subprocess
module.
You already discovered this; this is why adding sys.stdout.flush()
in the child process makes the data show up sooner; the child process uses buffered I/O (a memory cache to collect written data) before sending it down the sys.stdout
pipe 1.
Python automatically uses line-buffering when sys.stdout
is connected to a terminal; the buffer flushes whenever a newline is written. When using pipes, sys.stdout
is not connected to a terminal and a fixed-size buffer is used instead.
Now, the Python child process can be told to handle buffering differently; you can set an environment variable or use a command-line switch to alter how it uses buffering for sys.stdout
(and sys.stderr
and sys.stdin
). From the Python command line documentation:
-u
Force stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode.[...]
PYTHONUNBUFFERED
If this is set to a non-empty string it is equivalent to specifying the -u option.
If you are dealing with child processes that are not Python processes and you experience buffering issues with those, you'll need to look at the documentation of those processes to see if they can be switched to use unbuffered I/O, or be switched to more desirable buffering strategies.
One thing you could try is to use the script -c
command to provide a pseudo-terminal to a child process. This is a POSIX tool, however, and is probably not available on Windows.
1.It should be noted that when flushing a pipe, no data is 'written to disk'; all data remains entirely in memory here. I/O buffers are just memory caches to get the best performance out of I/O by handling data in larger chunks. Only if you have a disk-based file object would fileobj.flush()
cause it to push any buffers to the OS, which usually means that data is indeed written to disk.
expect has a command called 'unbuffer':
http://expect.sourceforge.net/example/unbuffer.man.html
that will disable buffering for any command
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With