I'm trying to call a process on a file after part of it has been read. For example:
with open('in.txt', 'r') as a, open('out.txt', 'w') as b:
header = a.readline()
subprocess.call(['sort'], stdin=a, stdout=b)
This works fine if I don't read anything from a before doing the subprocess.call, but if I read anything from it, the subprocess doesn't see anything. This is using python 2.7.3. I can't find anything in the documentation that explains this behaviour, and a (very) brief glance at the subprocess source didn't reveal a cause.
By default, subprocess. run() takes stdin (standard input) from our Python program and passes it through unchanged to the subprocess. For example, on a Linux or macOS system, the cat - command outputs exactly what it receives from stdin .
To write to a Python subprocess' stdin, we can use the communicate method. to call Popen with the command we want to run in a list. And we set stdout , stdin , and stderr all to PIPE to pipe them to their default locations.
To start a new process, or in other words, a new subprocess in Python, you need to use the Popen function call. It is possible to pass two parameters in the function call. The first parameter is the program you want to start, and the second is the file argument.
If you open the file unbuffered then it works:
import subprocess
with open('in.txt', 'rb', 0) as a, open('out.txt', 'w') as b:
header = a.readline()
rc = subprocess.call(['sort'], stdin=a, stdout=b)
subprocess
module works at a file descriptor level (low-level unbuffered I/O of the operating system). It may work with os.pipe()
, socket.socket()
, pty.openpty()
, anything with a valid .fileno()
method if OS supports it.
It is not recommended to mix the buffered and unbuffered I/O on the same file.
On Python 2, file.flush()
causes the output to appear e.g.:
import subprocess
# 2nd
with open(__file__) as file:
header = file.readline()
file.seek(file.tell()) # synchronize (for io.open and Python 3)
file.flush() # synchronize (for C stdio-based file on Python 2)
rc = subprocess.call(['cat'], stdin=file)
The issue can be reproduced without subprocess
module with os.read()
:
#!/usr/bin/env python
# 2nd
import os
with open(__file__) as file: #XXX fully buffered text file EATS INPUT
file.readline() # ignore header line
os.write(1, os.read(file.fileno(), 1<<20))
If the buffer size is small then the rest of the file is printed:
#!/usr/bin/env python
# 2nd
import os
bufsize = 2 #XXX MAY EAT INPUT
with open(__file__, 'rb', bufsize) as file:
file.readline() # ignore header line
os.write(2, os.read(file.fileno(), 1<<20))
It eats more input if the first line size is not evenly divisible by bufsize
.
The default bufsize
and bufsize=1
(line-buffered) behave similar on my machine: the beginning of the file vanishes -- around 4KB.
file.tell()
reports for all buffer sizes the position at the beginning of the 2nd line. Using next(file)
instead of file.readline()
leads to file.tell()
around 5K on my machine on Python 2 due to the read-ahead buffer bug (io.open()
gives the expected 2nd line position).
Trying file.seek(file.tell())
before the subprocess call doesn't help on Python 2 with default stdio-based file objects. It works with open()
functions from io
, _pyio
modules on Python 2 and with the default open
(also io
-based) on Python 3.
Trying io
, _pyio
modules on Python 2 and Python 3 with and without file.flush()
produces various results. It confirms that mixing buffered and unbuffered I/O on the same file descriptor is not a good idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With