Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a difference between : "file.readlines()", "list(file)" and "file.read().splitlines(True)"?

What is the difference between :

with open("file.txt", "r") as f:
    data = list(f)

Or :

with open("file.txt", "r") as f:
    data = f.read().splitlines(True)

Or :

with open("file.txt", "r") as f:
    data = f.readlines()

They seem to produce the exact same output. Is one better (or more pythonic) than the other ?

like image 417
Bermuda Avatar asked Jul 23 '18 13:07

Bermuda


4 Answers

In the 3 cases, you're using a context manager to read a file. This file is a file object.

File Object

An object exposing a file-oriented API (with methods such as read() or write()). Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.). File objects are also called file-like objects or streams. The canonical way to create a file object is by using the open() function. https://docs.python.org/3/glossary.html#term-file-object

list

with open("file.txt", "r") as f:
    data = list(f)

This works because your file object is a stream like object. converting to list works roughly like this :

[element for element in generator until I hit stopIteration]

readlines method

with open("file.txt", "r") as f:
    data = f.readlines()

The method readlines() reads until EOF using readline() and returns a list containing the lines.

Difference with list :

  1. You can specify the number of elements you want to read : fileObject.readlines( sizehint )

  2. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read.

read

When should I ever use file.read() or file.readlines()?

like image 62
madjaoue Avatar answered Oct 03 '22 14:10

madjaoue


All three of your options produce the same end result, but nonetheless, one of them is definitely worse than the other two: doing f.read().splitlines(True).

The reason this is the worst option is that it requires the most memory. f.read() reads the file content into memory as a single (maybe huge) string object, then calling .splitlines(True) on that additionally creates the list of the individual lines, and then only after that does the string object containing the file's entire content get garbage collected and its memory freed. So, at the moment of peak memory use - just before the memory for the big string is freed - this approach requires enough memory to store the entire content of the file in memory twice - once as a string, and once as an array of strings.

By contrast, doing list(f) or f.readlines() will read a line from disk, add it to the result list, then read the next line, and so on. So the whole file content is never duplicated in memory, and the peak memory use will thus be about half that of the .splitlines(True) approach. These approaches are thus superior to using .read() and .splitlines(True).

As for list(f) vs f.readlines(), there's no concrete advantage to either of them over the other; the choice between them is a matter of style and taste.

like image 39
Mark Amery Avatar answered Oct 17 '22 04:10

Mark Amery


Explicit is better than implicit, so I prefer:

with open("file.txt", "r") as f:
    data = f.readlines()

But, when it is possible, the most pythonic is to use the file iterator directly, without loading all the content to memory, e.g.:

with open("file.txt", "r") as f:
    for line in f:
       my_function(line)
like image 7
Gelineau Avatar answered Oct 17 '22 03:10

Gelineau


TL;DR;

Considering you need a list to manipulate them afterwards, your three proposed solutions are all syntactically valid. There is no better (or more pythonic) solution, especially since they all are recommended by the official Python documentation. So, choose the one you find the most readable and be consistent with it throughout your code. If performance is a deciding factor, see my timeit analysis below.


Here is the timeit (10000 loops, ~20 line in test.txt),

import timeit

def foo():
    with open("test.txt", "r") as f:
        data = list(f)

def foo1():
    with open("test.txt", "r") as f:
        data = f.read().splitlines(True)

def foo2():
    with open("test.txt", "r") as f:
        data = f.readlines()

print(timeit.timeit(stmt=foo, number=10000))
print(timeit.timeit(stmt=foo1, number=10000))
print(timeit.timeit(stmt=foo2, number=10000))

>>>> 1.6370758459997887
>>>> 1.410844805999659
>>>> 1.8176437409965729

I tried it with multiple number of loops and lines, and f.read().splitlines(True) always seems to be performing a bit better than the two others.

Now, syntactically speaking, all of your examples seems to be valid. Refer to this documentation for more informations.

According to it, if your goal is to read lines form a file,

for line in f:
    ...

where they states that it is memory efficient, fast, and leads to simple code. Which would be another good alternative in your case if you don't need to manipulate them in a list.

EDIT

Note that you don't need to pass your True boolean to splitlines. It has your wanted behavior by default.

My personal recommendation

I don't want to make this answer too opinion-based, but I think it would be beneficial for you to know, that I don't think performance should be your deciding factor until it is actually a problem for you. Especially since all syntax are allowed and recommended in the official Python doc I linked.

So, my advice is,:

First, pick the most logical one for your particular case and then choose the one you find the most readable and be consistent with it throughout your code.

like image 4
scharette Avatar answered Oct 17 '22 05:10

scharette