Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why readline() is much slower than readlines() in Python?

Tags:

python

In an interview, an interviewer asked me why readline() is much slower than readlines() in Python?

I answered that readlines() need to read many times, which need more spending.

I don't know whether my answer is correct.

If my answer is correct, what is the spending?

Thanks!

PS: I know the difference between readline() and readlines()!!!

readlines() read the whole file in one time, and readline() can read only one line.

I hope somebody can tell me some knowledge about IO of the file from the disk.

like image 407
Apache Avatar asked Apr 08 '14 16:04

Apache


1 Answers

Just for fun, I wrote a bunch of functions that iterate over a file and put each line into a list:

#!/usr/bin/python

def readlines():
    with open("sorted_output.txt") as f:
        line = f.readlines()

def readline():
    with open("sorted_output.txt") as f:
        line = f.readline()
        lines = []
        while line:
            lines.append(line)
            line = f.readline()

def iterate():
    with open("sorted_output.txt") as f:
        lines = []
        for line in f:
            lines.append(line)

def comprehension():
    with open("sorted_output.txt") as f:
        lines = [line for line in f]

Here's is how each of them performed on a file with 69,073 lines, using Python 2.6 (note, these results may be different on newer versions of Python):

dano@hostname:~> python -mtimeit -s 'import test' 'test.readline()'
10 loops, best of 3: 78.3 msec per loop
dano@hostname:~> python -mtimeit -s 'import test' 'test.readlines()'
10 loops, best of 3: 21.6 msec per loop
dano@hostname:~> python -mtimeit -s 'import test' 'test.comprehension()'
10 loops, best of 3: 23.6 msec per loop
dano@hostname:~> python -mtimeit -s 'import test' 'test.iterate()'
10 loops, best of 3: 33.3 msec per loop

So, readlines() is the fastest here, though iterating over each line using a list comprehension almost matches it. My guess is the speed differences between each approach is mostly the result of the high overhead of function calls in Python (the more function calls required, the slower the approach), but there may be other factors, as well. Hopefully someone more knowledgeable than me can comment on that.

In addition to performance, one other important consideration when deciding which of these methods to use is memory cost. Using readlines() will read the entire file into memory at once. If you're dealing with a huge file, it could cause serious performance issues or crash the program altogether if you were to try to read the entire thing into memory at once. In those cases, you'd want to use the approach in iterate(), since it only reads one line into memory at a time. In cases where you're just doing some kind of processing on each line and then throwing it away, this is usually the way to go, even if it is slightly slower than readlines(), because you don't take the same memory hit. Of course, if your goal in the end is to store the entire file in a Python list, you're going to pay that memory cost anyway, so readlines() will work fine.

like image 191
dano Avatar answered Sep 18 '22 17:09

dano