I have a text file like this: <pre class="prettyprint"><code>11 2 3 4 11 111 </code></pre> Using Python 2.7, I want to turn it into a list of lists of lines, where line breaks divide items in the inner list and empty lines divide items in the outer list. Like so: <pre class="prettyprint"><code>[["11","2","3","4"],["11"],["111"]] </code></pre> And for this purpose, I wrote a generator function that would yield the inner lists one at a time once passed an open file object: <pre class="prettyprint"><code>def readParag(fileObj): currentParag = [] for line in fileObj: stripped = line.rstrip() if len(stripped) > 0: currentParag.append(stripped) elif len(currentParag) > 0: yield currentParag currentParag = [] </code></pre> That works fine, and I can call it from within a list comprehension, producing the desired result. However, it subsequently occurred to me that I might be able to do the same thing more concisely using <code>itertools.takewhile</code> (with a view to rewriting the generator function as a generator expression, but we'll leave that for now). This is what I tried: <pre class="prettyprint"><code>from itertools import takewhile def readParag(fileObj): yield [ln.rstrip() for ln in takewhile(lambda line: line != "\n", fileObj)] </code></pre> In this case, the resulting generator yields only one result (the expected first one, i.e. <code>["11","2","3","4"]</code>). I had hoped that calling its <code>next</code> method again would cause it to evaluate <code>takewhile(lambda line: line != "\n", fileObj)</code> again on the remainder of the file, thus leading it to yield another list. But no: I got a <code>StopIteration</code> instead. So I surmised that the <code>take while</code> expression was being evaluated once only, at the time when the generator object was created, and not each time I called the resultant generator object's <code>next</code> method. This supposition made me wonder what would happen if I called the generator function again. The result was that it created a new generator object that also yielded a single result (the expected second one, i.e. <code>["11"]</code>) before throwing a <code>StopIteration</code> back at me. So in fact, writing this as a generator function effectively gives the same result as if I'd written it as an ordinary function and <code>return</code>ed the list instead of <code>yield</code>ing it. I guess I could solve this problem by creating my own class to use instead of a generator (as in John Millikin's answer to this question). But the point is that I was hoping to write something more concise than my original generator function (possibly even a generator expression). Can somebody tell me what I'm doing wrong, and how to get it right?

What you're trying to do is a perfect job for <code>groupby</code>: <pre class="prettyprint"><code>from itertools import groupby def read_parag(filename): with open(filename) as f: for k,g in groupby((line.strip() for line in f), bool): if k: yield list(g) </code></pre> which will give: <pre class="prettyprint"><code>>>> list(read_parag('myfile.txt') [['11', '2', '3', '4'], ['11'], ['111']] </code></pre> Or in one line: <pre class="prettyprint"><code>[list(g) for k,g in groupby((line.strip() for line in open('myfile.txt')), bool) if k] </code></pre>

itertools.takewhile within a generator function - why is it evaluated once only?

Tags:

python

generator

itertools

I have a text file like this:

Using Python 2.7, I want to turn it into a list of lists of lines, where line breaks divide items in the inner list and empty lines divide items in the outer list. Like so:

[["11","2","3","4"],["11"],["111"]]

And for this purpose, I wrote a generator function that would yield the inner lists one at a time once passed an open file object:

def readParag(fileObj):
    currentParag = []
    for line in fileObj:
        stripped = line.rstrip()
    if len(stripped) > 0: currentParag.append(stripped)
    elif len(currentParag) > 0:
        yield currentParag
        currentParag = []

That works fine, and I can call it from within a list comprehension, producing the desired result. However, it subsequently occurred to me that I might be able to do the same thing more concisely using itertools.takewhile (with a view to rewriting the generator function as a generator expression, but we'll leave that for now). This is what I tried:

from itertools import takewhile    
def readParag(fileObj):
    yield [ln.rstrip() for ln in takewhile(lambda line: line != "\n", fileObj)]

In this case, the resulting generator yields only one result (the expected first one, i.e. ["11","2","3","4"]). I had hoped that calling its next method again would cause it to evaluate takewhile(lambda line: line != "\n", fileObj) again on the remainder of the file, thus leading it to yield another list. But no: I got a StopIteration instead. So I surmised that the take while expression was being evaluated once only, at the time when the generator object was created, and not each time I called the resultant generator object's next method.

This supposition made me wonder what would happen if I called the generator function again. The result was that it created a new generator object that also yielded a single result (the expected second one, i.e. ["11"]) before throwing a StopIteration back at me. So in fact, writing this as a generator function effectively gives the same result as if I'd written it as an ordinary function and returned the list instead of yielding it.

I guess I could solve this problem by creating my own class to use instead of a generator (as in John Millikin's answer to this question). But the point is that I was hoping to write something more concise than my original generator function (possibly even a generator expression). Can somebody tell me what I'm doing wrong, and how to get it right?

835

asked Aug 07 '12 19:08

Westcroft_to_Apse

1 Answers

What you're trying to do is a perfect job for groupby:

from itertools import groupby

def read_parag(filename):
    with open(filename) as f:
        for k,g in groupby((line.strip() for line in f), bool):
            if k:
                yield list(g)

which will give:

>>> list(read_parag('myfile.txt')
[['11', '2', '3', '4'], ['11'], ['111']]

Or in one line:

[list(g) for k,g in groupby((line.strip() for line in open('myfile.txt')), bool) if k]

164

answered Nov 15 '22 14:11

Rik Poggi

Related questions
                            
                                unix vim Error detected while processing BufRead Auto commands
                            
                                Python .sort() not working as expected
                            
                                Algorithm (Python): find the smallest number greater than k
                            
                                Creating Python daemon - 'module' object has no attribute 'DaemonContext'
                            
                                abbreviating a double comparison in python
                            
                                Creating a dictionary with same values [duplicate]
                            
                                Suggestions on get_text() in BeautifulSoup
                            
                                SQLAlchemy + PostgreSQL + PG regex
                            
                                Iterate over large file with progress indicator in Python?
                            
                                APScheduler(Advance Python Scheduler) ImportError: No module named scheduler
                            
                                'For' loop behaviour in Python
                            
                                Reducing noise on Data
                            
                                How to resolve runtime error due to size mismatch in PyTorch?
                            
                                Looking for File Traversal Functions in Python that are Like Java's
                            
                                Python: Convert a string to an integer
                            
                                Finding most recently edited file in python
                            
                                Could random.randint(1,10) ever return 11?
                            
                                Making python imports more structured?
                            
                                Get python class object from string [duplicate]
                            
                                How to get all dates (month, day and year) between two dates in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With