I would like to read a CSV file from the standard input and process each row as it comes. My CSV outputting code writes rows one by one, but my reader waits the stream to be terminated before iterating the rows. Is this a limitation of <code>csv</code> module? Am I doing something wrong? My reader code: <pre class="prettyprint"><code>import csv import sys import time reader = csv.reader(sys.stdin) for row in reader: print "Read: (%s) %r" % (time.time(), row) </code></pre> My writer code: <pre class="prettyprint"><code>import csv import sys import time writer = csv.writer(sys.stdout) for i in range(8): writer.writerow(["R%d" % i, "$" * (i+1)]) sys.stdout.flush() time.sleep(0.5) </code></pre> Output of <code>python test_writer.py | python test_reader.py</code>: <pre class="prettyprint"><code>Read: (1309597426.3) ['R0', '$'] Read: (1309597426.3) ['R1', '$$'] Read: (1309597426.3) ['R2', '$$$'] Read: (1309597426.3) ['R3', '$$$$'] Read: (1309597426.3) ['R4', '$$$$$'] Read: (1309597426.3) ['R5', '$$$$$$'] Read: (1309597426.3) ['R6', '$$$$$$$'] Read: (1309597426.3) ['R7', '$$$$$$$$'] </code></pre> As you can see all print statements are executed at the same time, but I expect there to be a 500ms gap.

As it says in the documentation, <blockquote> In order to make a <code>for</code> loop the most efficient way of looping over the lines of a file (a very common operation), the <code>next()</code> method uses a hidden read-ahead buffer. </blockquote> And you can see by looking at the implementation of the <code>csv</code> module (line 784) that <code>csv.reader</code> calls the <code>next()</code> method of the underlyling iterator (via <code>PyIter_Next</code>). So if you really want unbuffered reading of CSV files, you need to convert the file object (here <code>sys.stdin</code>) into an iterator whose <code>next()</code> method actually calls <code>readline()</code> instead. This can easily be done using the two-argument form of the <code>iter</code> function. So change the code in <code>test_reader.py</code> to something like this: <pre class="prettyprint"><code>for row in csv.reader(iter(sys.stdin.readline, '')): print("Read: ({}) {!r}".format(time.time(), row)) </code></pre> For example, <pre class="prettyprint lang-none prettyprint-override"><code>$ python test_writer.py | python test_reader.py Read: (1388776652.964925) ['R0', '$'] Read: (1388776653.466134) ['R1', '$$'] Read: (1388776653.967327) ['R2', '$$$'] Read: (1388776654.468532) ['R3', '$$$$'] [etc] </code></pre> Can you explain why you need unbuffered reading of CSV files? There might be a better solution to whatever it is you are trying to do.

How to read a CSV file from a stream and process each line as it is written?

Tags:

python

stream

csv

line-by-line

I would like to read a CSV file from the standard input and process each row as it comes. My CSV outputting code writes rows one by one, but my reader waits the stream to be terminated before iterating the rows. Is this a limitation of csv module? Am I doing something wrong?

My reader code:

import csv import sys import time   reader = csv.reader(sys.stdin) for row in reader:     print "Read: (%s) %r" % (time.time(), row)

My writer code:

import csv import sys import time   writer = csv.writer(sys.stdout) for i in range(8):     writer.writerow(["R%d" % i, "$" * (i+1)])     sys.stdout.flush()     time.sleep(0.5)

Output of python test_writer.py | python test_reader.py:

Read: (1309597426.3) ['R0', '$'] Read: (1309597426.3) ['R1', '$$'] Read: (1309597426.3) ['R2', '$$$'] Read: (1309597426.3) ['R3', '$$$$'] Read: (1309597426.3) ['R4', '$$$$$'] Read: (1309597426.3) ['R5', '$$$$$$'] Read: (1309597426.3) ['R6', '$$$$$$$'] Read: (1309597426.3) ['R7', '$$$$$$$$']

As you can see all print statements are executed at the same time, but I expect there to be a 500ms gap.

420

asked Jul 02 '11 09:07

muhuk

1 Answers

As it says in the documentation,

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.

And you can see by looking at the implementation of the csv module (line 784) that csv.reader calls the next() method of the underlyling iterator (via PyIter_Next).

So if you really want unbuffered reading of CSV files, you need to convert the file object (here sys.stdin) into an iterator whose next() method actually calls readline() instead. This can easily be done using the two-argument form of the iter function. So change the code in test_reader.py to something like this:

for row in csv.reader(iter(sys.stdin.readline, '')):     print("Read: ({}) {!r}".format(time.time(), row))

For example,

$ python test_writer.py | python test_reader.py Read: (1388776652.964925) ['R0', '$'] Read: (1388776653.466134) ['R1', '$$'] Read: (1388776653.967327) ['R2', '$$$'] Read: (1388776654.468532) ['R3', '$$$$'] [etc]

Can you explain why you need unbuffered reading of CSV files? There might be a better solution to whatever it is you are trying to do.

125

answered Sep 24 '22 21:09

Gareth Rees

Related questions
                            
                                Setting variables with exec inside a function
                            
                                What's the best way to distribute python command-line tools?
                            
                                Default sub-command, or handling no sub-command with argparse
                            
                                Python dynamic inheritance: How to choose base class upon instance creation?
                            
                                Difference between frompyfunc and vectorize in numpy
                            
                                LSTM Autoencoder
                            
                                how to reverse the URL of a ViewSet's custom action in django restframework
                            
                                Why is the compiler package discontinued in Python 3?
                            
                                Use pdb.set_trace() in a script that reads stdin via a pipe
                            
                                Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
                            
                                Break on unhandled exception in pycharm
                            
                                Who runs the callback when using apply_async method of a multiprocessing pool?
                            
                                Python logging configuration file
                            
                                Why is 2 * x * x faster than 2 * ( x * x ) in Python 3.x, for integers?
                            
                                TFIDF for Large Dataset
                            
                                What's the equivalent of Python's Celery project for Java?
                            
                                grid search over multiple classifiers
                            
                                Is it good practice to use `import __main__`?
                            
                                Python auto import extension for VSCode
                            
                                Psycopg2, Postgresql, Python: Fastest way to bulk-insert

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With