Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimized way to count the number of lines with conditions

Tags:

python

I have seen that a fast way to count the number of lines in a file is to do like this way:

n_lines=sum(1 for line in open(myfile))

I would like to know if it is possible to put some conditions in the sum function in order to have something like that:

n_lines=sum(1 for line in open(PATHDIFF) if line=='\n' break if line.startswith('#') continue)

Thank you in advance.

like image 989
SOCKet Avatar asked Aug 30 '25 18:08

SOCKet


2 Answers

You can, with certain restrictions. You are passing a generator expression as the argument to sum, and a generator expression can take one expression with the if clause. You can combine your conditions like this:

n_lines=sum(1 for line in open(PATHDIFF)
                if line != '\n' and not line.startswith('#'))

However, this doesn't short-circuit the iteration of your file when you hit a newline; it continues to read through the file to the end. To avoid that, you can use itertools.takewhile, which will only read from the iterator produced by the generator expression until you read a newline.

from itertools import takewhile
n_lines = sum(1 for line in takewhile(lambda x: x != '\n',
                                      open(PATHDIFF))
                   if not line.startswith('#'))

You can also use itertools.ifilterfalse to fill the same role as the condition clause of the generator expression.

from itertools import takewhile, ifilterfalse
n_lines = sum(1 for line in ifilterfalse(lambda x: x.startswith('#'),
                                         takewhile(lambda x: x != '\n',
                                                   open(PATHDIFF))))

Of course, now your code starts to look like you are writing in Scheme or Lisp. The generator expression is a little easier to read, but the itertool module is useful for building up modified iterators that you can pass around as distinct objects.


On a different topic, you should always make sure you close any files you open, which means not using anonymous file handles in your iterators. The cleanest way to do this is to use a with statement:

with open(PATHDIFF) as f:
    n_lines = sum(1 for line in f if line != '\n' and not line.startswith('#'))

The other examples can be similarly modified; just replace open(PATHDIFF) with f where it occurs.

like image 91
chepner Avatar answered Sep 02 '25 08:09

chepner


In fact there's a fast way (borrowing from Funcy) to compute the length of an iterator without consuming it:

Example:

from collections import deque
from itertools import count, izip


def ilen(seq):
    counter = count()
    deque(izip(seq, counter), maxlen=0)  # (consume at C speed)
    return next(counter)


def lines(filename)
    with open(filename, 'r') as f:
        return ilen(
            None for line in f
            if line != "\n" and not line.startswith("#")
        )


nlines = lines("file.txt")
like image 30
James Mills Avatar answered Sep 02 '25 09:09

James Mills