I've got the following snippet of code <pre class="prettyprint"><code>def send(self, queue, fd): for line in fd: data = line.strip() if data: queue.write(json.loads(data)) </code></pre> Which of course works just fine, but I wonder sometimes if there is a "better" way to write that construct where you only act on non-blank lines. The challenge is this should use the iterative nature of the for the 'fd' read and be able to handle files in the 100+ MB range. UPDATE - In your haste to get points for this question you're ignoring an import part, which is memory usage. For instance the expression: <pre class="prettyprint"><code> non_blank_lines = (line.strip() for line in fd if line.strip()) </code></pre> Is going to buffer the whole file into memory, not to mention performing a strip() action twice. Which will work for small files, but fails when you've got 100+MB of data (or once in a while a 100GB). Part of the challenge is the following works, but is soup to read: <pre class="prettyprint"><code>for line in ifilter(lambda l: l, imap(lambda l: l.strip(), fd)): queue.write(json.loads(line)) </code></pre> Look for magic folks! FINAL UPDATE: PEP-289 is very useful for my own better understanding of the difference between [] and () with iterators involved.

There's nothing wrong with the code as written, it's readable and efficient. An alternative approach would be to write it as a generator comprehension: <pre class="prettyprint"><code>def send(self, queue, fd): non_blank_lines = (line.strip() for line in fd if line.strip()) for line in non_blank_lines: queue.write(json.loads(data)) </code></pre> This approach can be beneficial (terser) if you are applying a function that can take an iterator: e.g. python3 print <pre class="prettyprint"><code>non_blank_lines = (line.strip() for line in fd if line.strip()) print(*non_blank_lines, file='foo') </code></pre> To do away with the multiple calls to strip(), chain together generator comprehensions <pre class="prettyprint"><code>stripped_lines = (line.strip() for line in fd) non_blank_lines = (line for line in stripped_lines if line) </code></pre> Note that generator expressions will not adversely affect memory as detailed in this pep. For a more in depth look at this approach, and some performance bench marks, take a look at this set of notes. Finally note that rstrip() will outperform strip() if you don't need the full behaviour of strip().

Processing only non-blank lines

Tags:

python

I've got the following snippet of code

def send(self, queue, fd):
    for line in fd:
        data = line.strip()
        if data:
            queue.write(json.loads(data))

Which of course works just fine, but I wonder sometimes if there is a "better" way to write that construct where you only act on non-blank lines.

The challenge is this should use the iterative nature of the for the 'fd' read and be able to handle files in the 100+ MB range.

UPDATE - In your haste to get points for this question you're ignoring an import part, which is memory usage. For instance the expression:

 non_blank_lines = (line.strip() for line in fd if line.strip())

Is going to buffer the whole file into memory, not to mention performing a strip() action twice. Which will work for small files, but fails when you've got 100+MB of data (or once in a while a 100GB).

Part of the challenge is the following works, but is soup to read:

for line in ifilter(lambda l: l, imap(lambda l: l.strip(), fd)):
    queue.write(json.loads(line))

Look for magic folks!

FINAL UPDATE: PEP-289 is very useful for my own better understanding of the difference between [] and () with iterators involved.

543

asked Dec 03 '12 17:12

koblas

1 Answers

There's nothing wrong with the code as written, it's readable and efficient.

An alternative approach would be to write it as a generator comprehension:

def send(self, queue, fd):
    non_blank_lines = (line.strip() for line in fd if line.strip())
    for line in non_blank_lines:
        queue.write(json.loads(data))

This approach can be beneficial (terser) if you are applying a function that can take an iterator: e.g. python3 print

non_blank_lines = (line.strip() for line in fd if line.strip())
print(*non_blank_lines, file='foo')

To do away with the multiple calls to strip(), chain together generator comprehensions

stripped_lines = (line.strip() for line in fd)
non_blank_lines = (line for line in stripped_lines if line)

Note that generator expressions will not adversely affect memory as detailed in this pep.

For a more in depth look at this approach, and some performance bench marks, take a look at this set of notes.

Finally note that rstrip() will outperform strip() if you don't need the full behaviour of strip().

146

answered Sep 19 '22 07:09

cmh

Related questions
                            
                                How do you check when a file is done being copied in Python?
                            
                                Python script to get files from one server into another and store them in separate directories?
                            
                                Python Selenium 'WebDriver' object has no attribute error
                            
                                Project Euler #18 - how to brute force all possible paths in tree-like structure using Python?
                            
                                Numpy Matrix class: Default constructor attributes for inherited class
                            
                                How to parse single file using Python bindings to Clang?
                            
                                Sublime Text2 Import error: No module named Gnuplot
                            
                                possible to raise exception that includes non-english characters in python 2?
                            
                                Commit existing journal file in SQLite from prior terminated connection to database
                            
                                What do tab colors mean in PyCharm?
                            
                                pandas's resample with fill_method: Need to know data from which row was copied?
                            
                                Python pytz: what happens if a country does away with DST?
                            
                                SWIG Python - wrapping a function that expects a double pointer to a struct
                            
                                Namespace packages and pip install -e
                            
                                Secure Python chat with SSH - How?
                            
                                carving 2D numpy array by index
                            
                                Cookies using Python and Google App Engine
                            
                                python string ' " ' : single double quote inside string
                            
                                How to do a while ( x < y ) in jinja2
                            
                                Python- How to find the average of multiple values/key in a dictionary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With