Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you tell whether sys.stdin.readline() is going to block?

Tags:

python

How can I find out whether a call to sys.stdin.readline() (or, more generally, readline() on any file descriptor based file object) is going to block?

This comes up when I am writing a line-based text filter program in python; that is, the program repeatedly reads a line of text from input, maybe transforms it, and then writes it to output.

I'd like to implement a reasonable output buffering strategy. My criteria are:

  1. It should be efficient when processing millions of lines in bulk-- mostly buffer the output, with occasional flushes.
  2. It should never block on input while holding buffered output.

So, unbuffered output is no good, because it violates (1) (too many writes to the OS). And line-buffered output is no good, because it still violates (1) (it doesn't make sense to flush the output to the OS on each of a million lines in bulk). And default-buffered output is no good, because it violates (2) (it will withhold output inappropriately if output is to a file or pipe).

I think a good solution, for most cases, would be: "flush sys.stdout whenever (its buffer is full or) sys.stdin.readline() is about to block". Can that be implemented?

(Note, I don't claim this strategy is perfect for all cases. For example, it's probably not ideal in cases where the program is cpu-bound; in that case it may be wise to flush more often, to avoid withholding output while doing long computations.)

For definitiveness, let's say I'm implementing unix's "cat -n" program in python.

(Actually "cat -n" is smarter than line-at-a-time; that is, it knows how to read and write part of a line before the full line has been read; but, for this example, I'm going to implement it line-at-a-time anyway.)

Line-buffered implementation

(well-behaved, but violates criterion (1), i.e. it's unreasonably slow since it flushes too much):

#!/usr/bin/python
# cat-n.linebuffered.py
import sys
num_lines_read = 0
while True:
  line = sys.stdin.readline()
  if line == '': break
  num_lines_read += 1
  print("%d: %s" % (num_lines_read, line))
  sys.stdout.flush()

Default-buffered implementation

(fast but violates criterion (2), i.e. unfriendly output withholding)

#!/usr/bin/python
# cat-n.defaultbuffered.py
import sys
num_lines_read = 0
while True:
  line = sys.stdin.readline()
  if line == '': break
  num_lines_read += 1
  print("%d: %s" % (num_lines_read, line))

Desired implementation:

#!/usr/bin/python
num_lines_read = 0
while True:
  if sys_stdin_readline_is_about_to_block():  # <--- How do I implement this??
    sys.stdout.flush()
  line = sys.stdin.readline()
  if line == '': break
  num_lines_read += 1
  print("%d: %s" % (num_lines_read, line))

So the question is: is it possible to implement sys_stdin_readline_is_about_to_block()?

I'd like an answer that works in both python2 and python3. I've looked into each of the following techniques, but nothing has panned out so far.

  • Use select([sys.stdin],[],[],0) to find out whether reading from sys.stdin will block. (This does not work when sys.stdin is a buffered file object, for at least one and possibly two reasons: (1) it will wrongly say "will not block" if a partial line is ready to read from the underlying input pipe, (2) it will wrongly say "will block" if sys.stdin's buffer contains a full input line but the underlying pipe is not ready for additional reading... I think).

  • Non-blocking io, using os.fdopen(sys.stdin.fileno(), 'r') and fcntl with O_NONBLOCK (I could not get this to work with readline() in any python version: in python2.7, it loses input whenever a partial line comes in; in python3, it seems to be impossible to distinguish between "would block" and end-of-input. ??)

  • asyncio (It's not clear to me what of this is available in python2; and I don't think it works with sys.stdin; however, I'd still be interested in an answer that worked only when reading from a pipe returned from subprocess.Popen()).

  • Create a thread to do the readline() loop and pass each line to the main program via a queue.Queue; then the main program can poll the queue before reading each line from it, and whenever it sees it's about to block, flush stdout first. (I tried this, and actually got it working, see below, but it's horribly slow, much slower than line buffering.)

Threaded implementation:

Note that this doesn't strictly answer the question "how to tell whether sys.stdin.readline() is going to block", but it manages to implement the desired buffering strategy anyway. It's too slow, though.

#!/usr/bin/python
# cat-n.threaded.py
import queue
import sys
import threading
def iter_with_abouttoblock_cb(callable, sentinel, abouttoblock_cb, qsize=100):
  # child will send each item through q to parent.
  q = queue.Queue(qsize)
  def child_fun():
    for item in iter(callable, sentinel):
      q.put(item)
    q.put(sentinel)
  child = threading.Thread(target=child_fun)
  # The child thread normally runs until it sees the sentinel,
  # but we mark it daemon so that it won't prevent the parent
  # from exiting prematurely if it wants.
  child.daemon = True
  child.start()
  while True:
    try:
      item = q.get(block=False)
    except queue.Empty:
      # q is empty; call abouttoblock_cb before blocking
      abouttoblock_cb()
      item = q.get(block=True)
    if item == sentinel:
      break  # do *not* yield sentinel
    yield item
  child.join()

num_lines_read = 0
for line in iter_with_abouttoblock_cb(sys.stdin.readline,
                                      sentinel='',
                                      abouttoblock_cb=sys.stdout.flush):
  num_lines_read += 1
  sys.stdout.write("%d: %s" % (num_lines_read, line))

Verifying buffering behavior:

The following commands (in bash on linux) show the expected buffering behavior: "defaultbuffered" buffers too aggressively, whereas "linebuffered" and "threaded" buffer just right.

(Note that the | cat at the end of the pipeline is to make python block-buffer instead of line-buffer by default.)

for which in defaultbuffered linebuffered threaded; do
  for python in python2.7 python3.5; do
    echo "$python cat-n.$which.py:"
      (echo z; echo -n a; sleep 1; echo b; sleep 1; echo -n c; sleep 1; echo d; echo x; echo y; echo z; sleep 1; echo -n e; sleep 1; echo f) | $python cat-n.$which.py | cat
  done
done

Output:

python2.7 cat-n.defaultbuffered.py:
[... pauses 5 seconds here. Bad! ...]
1: z
2: ab
3: cd
4: x
5: y
6: z
7: ef
python3.5 cat-n.defaultbuffered.py:
[same]
python2.7 cat-n.linebuffered.py:
1: z
[... pauses 1 second here, as expected ...]
2: ab
[... pauses 2 seconds here, as expected ...]
3: cd
4: x
5: y
6: z
[... pauses 2 seconds here, as expected ...]
6: ef
python3.5 cat-n.linebuffered.py:
[same]
python2.7 cat-n.threaded.py:
[same]
python3.5 cat-n.threaded.py:
[same]

Timings:

(in bash on linux):

for which in defaultbuffered linebuffered threaded; do
  for python in python2.7 python3.5; do
    echo -n "$python cat-n.$which.py:  "
      timings=$(time (yes 01234567890123456789012345678901234567890123456789012345678901234567890123456789 | head -1000000 | $python cat-n.$which.py >| /tmp/REMOVE_ME) 2>&1)
      echo $timings
  done
done
/bin/rm /tmp/REMOVE_ME

Output:

python2.7 cat-n.defaultbuffered.py:  real 0m1.490s user 0m1.191s sys 0m0.386s
python3.5 cat-n.defaultbuffered.py:  real 0m1.633s user 0m1.007s sys 0m0.311s
python2.7 cat-n.linebuffered.py:  real 0m5.248s user 0m2.198s sys 0m2.704s
python3.5 cat-n.linebuffered.py:  real 0m6.462s user 0m3.038s sys 0m3.224s
python2.7 cat-n.threaded.py:  real 0m25.097s user 0m18.392s sys 0m16.483s
python3.5 cat-n.threaded.py:  real 0m12.655s user 0m11.722s sys 0m1.540s

To reiterate, I'd like a solution that never blocks while holding buffered output (both "linebuffered" and "threaded" are good in this respect), and that is also fast: that is, comparable in speed to "defaultbuffered".

like image 436
Don Hatch Avatar asked Oct 19 '18 13:10

Don Hatch


People also ask

What does Sys stdin readline () do?

In Python, the readlines() method reads the entire stream, and then splits it up at the newline character and creates a list of each line. The above creates a list called lines, where each element will be a line (as determined by the end of line character).

How do you stop SYS stdin Readlines?

Use CTRL-D .

How do you stop a line reading in Python?

If you want to stop at some moment, then use readline() to read only one line at a time.

Is SYS stdin readline faster than input?

stdin. readline() is the fastest one when reading strings and input() when reading integers.


1 Answers

You certainly can use select: this is what it’s for, and its performance is good for a small number of file descriptors. You have to implement the line buffering/breaking yourself so you can detect whether there’s more input available after buffering (what turns out to be) a partial line.

You can do all the buffering yourself (which is reasonable, since select operates at the level of file descriptors), or you can set stdin to be non-blocking and use file.read() or BufferedReader.read() (depending on your Python version) to consume whatever is available. You must use non-blocking input regardless of buffering if your input might be an Internet socket, since common implementations of select can spuriously indicate readable data from a socket. (The Python 2 version raises IOError with EAGAIN in that case; the Python 3 version returns None.)

(os.fdopen doesn't help here, since it doesn't create a new file descriptor for fcntl to use. On some systems, you can open /dev/stdin with O_NONBLOCK.)

A Python 2 implementation based on the default (buffered) file.read():

import sys,os,select,fcntl,errno

fcntl.fcntl(sys.stdin.fileno(),fcntl.F_SETFL,os.O_NONBLOCK)

rfs=[sys.stdin.fileno()]
xfs=rfs+[sys.stdout.fileno()]

buf=""
lnum=0
timeout=None
rd=True
while rd:
  rl,_,xl=select.select(rfs,(),xfs,timeout)
  if xl: raise IOError          # "exception" occurred (TCP OOB data?)
  if rl:
    try: rd=sys.stdin.read()    # read whatever we have
    except IOError as e:        # spurious readiness?
      if e.errno!=errno.EAGAIN: raise # die on other errors
    else: buf+=rd
    nl0=0                       # previous newline
    while True:
      nl=buf.find('\n',nl0)
      if nl<0:
        buf=buf[nl0:]           # hold partial line for "processing"
        break
      lnum+=1
      print "%d: %s"%(lnum,buf[nl0:nl])
      timeout=0
      nl0=nl+1
  else:                         # no input yet
    sys.stdout.flush()
    timeout=None

if buf: sys.stdout.write("%d: %s"%(lnum+1,buf)) # write any partial last line

For just cat -n, we could write out partial lines as soon as we get them, but this holds on to them to represent processing the whole line at once.

On my (unimpressive) machine, your yes test takes "real 0m2.454s user 0m2.144s sys 0m0.504s".

like image 182
Davis Herring Avatar answered Oct 23 '22 23:10

Davis Herring