Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't Python's len(readlines) equal Bash's 'wc -l' command?

Tags:

python

bash

For some large file,

lines_a = len(fa.readlines())
print(lines_a)

And for Bash (on Mac):

wc -l

the result are different!

What is the possible reason?

like image 730
Andy Yuan Avatar asked Dec 13 '22 22:12

Andy Yuan


2 Answers

wc -l prints the number of newlines in input. In other words, its definition of "line" in "line count" requires the line to end with a newline, and is actually defined by POSIX.

This definition of line can yield surprising behavior if the last line in your file does not end with a newline. Despite such line being displayed in text editors and pagers just fine, wc will not count it as a line. For example:

$ printf 'foo\nbar\n' | wc -l
2
$ printf 'foo\nbar' | wc -l
1

Python's readlines() method, on the other hand, is designed to provide the data in the file so that it can be perfectly reconstructed. For that reason, it provides each line with the final newline, and the last non-empty line as-is (with or without the final newline). For the above example, it returns lists ["foo\n", "bar\n"] and ["foo\n", "bar"] respectively, both of length two:

$ printf 'foo\nbar' | python -c 'import sys; print len(sys.stdin.readlines())'
2
$ printf 'foo\nbar\n' | python -c 'import sys; print len(sys.stdin.readlines())'
2
like image 63
user4815162342 Avatar answered Dec 16 '22 10:12

user4815162342


Just mention that I met similar problem when I was doing machine translation task. The main reason that the line number is not right, maybe because you have not open the file in 'b' mode. So try to

with open('some file', 'rb') as f:
    print(len(f.readlines()))

You will get the same number as wc -l

like image 20
Zhen Yang Avatar answered Dec 16 '22 10:12

Zhen Yang