How can I find out whether a call to sys.stdin.readline() (or, more generally, readline() on any file descriptor based file object) is going to block?
This comes up when I am writing a line-based text filter program in python; that is, the program repeatedly reads a line of text from input, maybe transforms it, and then writes it to output.
I'd like to implement a reasonable output buffering strategy. My criteria are:
So, unbuffered output is no good, because it violates (1) (too many writes to the OS). And line-buffered output is no good, because it still violates (1) (it doesn't make sense to flush the output to the OS on each of a million lines in bulk). And default-buffered output is no good, because it violates (2) (it will withhold output inappropriately if output is to a file or pipe).
I think a good solution, for most cases, would be: "flush sys.stdout whenever (its buffer is full or) sys.stdin.readline() is about to block". Can that be implemented?
(Note, I don't claim this strategy is perfect for all cases. For example, it's probably not ideal in cases where the program is cpu-bound; in that case it may be wise to flush more often, to avoid withholding output while doing long computations.)
For definitiveness, let's say I'm implementing unix's "cat -n" program in python.
(Actually "cat -n" is smarter than line-at-a-time; that is, it knows how to read and write part of a line before the full line has been read; but, for this example, I'm going to implement it line-at-a-time anyway.)
(well-behaved, but violates criterion (1), i.e. it's unreasonably slow since it flushes too much):
#!/usr/bin/python
# cat-n.linebuffered.py
import sys
num_lines_read = 0
while True:
line = sys.stdin.readline()
if line == '': break
num_lines_read += 1
print("%d: %s" % (num_lines_read, line))
sys.stdout.flush()
(fast but violates criterion (2), i.e. unfriendly output withholding)
#!/usr/bin/python
# cat-n.defaultbuffered.py
import sys
num_lines_read = 0
while True:
line = sys.stdin.readline()
if line == '': break
num_lines_read += 1
print("%d: %s" % (num_lines_read, line))
#!/usr/bin/python
num_lines_read = 0
while True:
if sys_stdin_readline_is_about_to_block(): # <--- How do I implement this??
sys.stdout.flush()
line = sys.stdin.readline()
if line == '': break
num_lines_read += 1
print("%d: %s" % (num_lines_read, line))
So the question is: is it possible to implement sys_stdin_readline_is_about_to_block()
?
I'd like an answer that works in both python2 and python3. I've looked into each of the following techniques, but nothing has panned out so far.
Use select([sys.stdin],[],[],0)
to find out whether reading from sys.stdin will block. (This does not work when sys.stdin is a buffered file object, for at least one and possibly two reasons: (1) it will wrongly say "will not block" if a partial line is ready to read from the underlying input pipe, (2) it will wrongly say "will block" if sys.stdin's buffer contains a full input line but the underlying pipe is not ready for additional reading... I think).
Non-blocking io, using os.fdopen(sys.stdin.fileno(), 'r')
and fcntl
with O_NONBLOCK
(I could not get this to work with readline() in any python version:
in python2.7, it loses input whenever a partial line comes in;
in python3, it seems to be impossible to distinguish between "would block"
and end-of-input. ??)
asyncio (It's not clear to me what of this is available in python2; and I don't think it works with sys.stdin; however, I'd still be interested in an answer that worked only when reading from a pipe returned from subprocess.Popen()).
Create a thread to do the readline()
loop and pass each line to the main
program via a queue.Queue; then the main program can poll the queue before
reading each line from it, and whenever it sees it's about to block, flush stdout first.
(I tried this, and actually got it working, see below, but it's horribly slow, much slower than line buffering.)
Note that this doesn't strictly answer the question "how to tell whether sys.stdin.readline() is going to block", but it manages to implement the desired buffering strategy anyway. It's too slow, though.
#!/usr/bin/python
# cat-n.threaded.py
import queue
import sys
import threading
def iter_with_abouttoblock_cb(callable, sentinel, abouttoblock_cb, qsize=100):
# child will send each item through q to parent.
q = queue.Queue(qsize)
def child_fun():
for item in iter(callable, sentinel):
q.put(item)
q.put(sentinel)
child = threading.Thread(target=child_fun)
# The child thread normally runs until it sees the sentinel,
# but we mark it daemon so that it won't prevent the parent
# from exiting prematurely if it wants.
child.daemon = True
child.start()
while True:
try:
item = q.get(block=False)
except queue.Empty:
# q is empty; call abouttoblock_cb before blocking
abouttoblock_cb()
item = q.get(block=True)
if item == sentinel:
break # do *not* yield sentinel
yield item
child.join()
num_lines_read = 0
for line in iter_with_abouttoblock_cb(sys.stdin.readline,
sentinel='',
abouttoblock_cb=sys.stdout.flush):
num_lines_read += 1
sys.stdout.write("%d: %s" % (num_lines_read, line))
The following commands (in bash on linux) show the expected buffering behavior: "defaultbuffered" buffers too aggressively, whereas "linebuffered" and "threaded" buffer just right.
(Note that the | cat
at the end of the pipeline is to make python block-buffer instead of line-buffer by default.)
for which in defaultbuffered linebuffered threaded; do
for python in python2.7 python3.5; do
echo "$python cat-n.$which.py:"
(echo z; echo -n a; sleep 1; echo b; sleep 1; echo -n c; sleep 1; echo d; echo x; echo y; echo z; sleep 1; echo -n e; sleep 1; echo f) | $python cat-n.$which.py | cat
done
done
Output:
python2.7 cat-n.defaultbuffered.py:
[... pauses 5 seconds here. Bad! ...]
1: z
2: ab
3: cd
4: x
5: y
6: z
7: ef
python3.5 cat-n.defaultbuffered.py:
[same]
python2.7 cat-n.linebuffered.py:
1: z
[... pauses 1 second here, as expected ...]
2: ab
[... pauses 2 seconds here, as expected ...]
3: cd
4: x
5: y
6: z
[... pauses 2 seconds here, as expected ...]
6: ef
python3.5 cat-n.linebuffered.py:
[same]
python2.7 cat-n.threaded.py:
[same]
python3.5 cat-n.threaded.py:
[same]
(in bash on linux):
for which in defaultbuffered linebuffered threaded; do
for python in python2.7 python3.5; do
echo -n "$python cat-n.$which.py: "
timings=$(time (yes 01234567890123456789012345678901234567890123456789012345678901234567890123456789 | head -1000000 | $python cat-n.$which.py >| /tmp/REMOVE_ME) 2>&1)
echo $timings
done
done
/bin/rm /tmp/REMOVE_ME
Output:
python2.7 cat-n.defaultbuffered.py: real 0m1.490s user 0m1.191s sys 0m0.386s
python3.5 cat-n.defaultbuffered.py: real 0m1.633s user 0m1.007s sys 0m0.311s
python2.7 cat-n.linebuffered.py: real 0m5.248s user 0m2.198s sys 0m2.704s
python3.5 cat-n.linebuffered.py: real 0m6.462s user 0m3.038s sys 0m3.224s
python2.7 cat-n.threaded.py: real 0m25.097s user 0m18.392s sys 0m16.483s
python3.5 cat-n.threaded.py: real 0m12.655s user 0m11.722s sys 0m1.540s
To reiterate, I'd like a solution that never blocks while holding buffered output (both "linebuffered" and "threaded" are good in this respect), and that is also fast: that is, comparable in speed to "defaultbuffered".
In Python, the readlines() method reads the entire stream, and then splits it up at the newline character and creates a list of each line. The above creates a list called lines, where each element will be a line (as determined by the end of line character).
Use CTRL-D .
If you want to stop at some moment, then use readline() to read only one line at a time.
stdin. readline() is the fastest one when reading strings and input() when reading integers.
You certainly can use select
: this is what it’s for, and its performance is good for a small number of file descriptors. You have to implement the line buffering/breaking yourself so you can detect whether there’s more input available after buffering (what turns out to be) a partial line.
You can do all the buffering yourself (which is reasonable, since select
operates at the level of file descriptors), or you can set stdin
to be non-blocking and use file.read()
or BufferedReader.read()
(depending on your Python version) to consume whatever is available. You must use non-blocking input regardless of buffering if your input might be an Internet socket, since common implementations of select
can spuriously indicate readable data from a socket. (The Python 2 version raises IOError
with EAGAIN
in that case; the Python 3 version returns None
.)
(os.fdopen
doesn't help here, since it doesn't create a new file descriptor for fcntl
to use. On some systems, you can open /dev/stdin
with O_NONBLOCK
.)
A Python 2 implementation based on the default (buffered) file.read()
:
import sys,os,select,fcntl,errno
fcntl.fcntl(sys.stdin.fileno(),fcntl.F_SETFL,os.O_NONBLOCK)
rfs=[sys.stdin.fileno()]
xfs=rfs+[sys.stdout.fileno()]
buf=""
lnum=0
timeout=None
rd=True
while rd:
rl,_,xl=select.select(rfs,(),xfs,timeout)
if xl: raise IOError # "exception" occurred (TCP OOB data?)
if rl:
try: rd=sys.stdin.read() # read whatever we have
except IOError as e: # spurious readiness?
if e.errno!=errno.EAGAIN: raise # die on other errors
else: buf+=rd
nl0=0 # previous newline
while True:
nl=buf.find('\n',nl0)
if nl<0:
buf=buf[nl0:] # hold partial line for "processing"
break
lnum+=1
print "%d: %s"%(lnum,buf[nl0:nl])
timeout=0
nl0=nl+1
else: # no input yet
sys.stdout.flush()
timeout=None
if buf: sys.stdout.write("%d: %s"%(lnum+1,buf)) # write any partial last line
For just cat -n
, we could write out partial lines as soon as we get them, but this holds on to them to represent processing the whole line at once.
On my (unimpressive) machine, your yes
test takes "real 0m2.454s user 0m2.144s sys 0m0.504s".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With