Why opening and iterating over file handle over twice as fast in Python 2 vs Python 3?

Tags:

I can't work out why it's so much faster to parse this file in Python 2.7 than in Python 3.6. I've found this pattern both on macOS and Arch-Linux independently. Can others replicate it? Any explanation?

Warning: the code snippet writes a ~2GB file

Timings:

$ python2 test.py 
5.01580309868
$ python3 test.py 
10.664075019994925

Code for test.py:

import os

SEQ_LINE = 'ATCGN'* 80 + '\n'

if not os.path.isfile('many_medium.fa'):
    with open('many_medium.fa', 'w') as out_f:
        for i in range(1000000):
            out_f.write('>{}\n'.format(i))
            for _ in range(5):
                out_f.write(SEQ_LINE)

from timeit import timeit

def f():
    with open('many_medium.fa') as f:
        for line in f:
            pass

print(timeit('f()', setup='from __main__ import f', number=5))

349

asked Sep 27 '18 21:09

Chris_Rands

1 Answers

Because in Python 2, the standard open() call creates a far simpler file object than the Python 3 open() call does. The Python 3 open call is the same thing as io.open(), and the same framework is available on Python 2.

To make this a fair comparison, you'd have to add the following line to the top of your test:

from io import open

With that change, the timings on Python 2 go from 5.5 seconds, to 37 seconds. Compared to that figure, the 11 seconds Python 3 takes on my system to run the test really is much, much faster.

So what is happening here? The io library offers much more functionality than the old Python 2 file object:

File objects returned by open() consist of up to 3 layers of composed functionality, allowing you to control buffering and text handling.
Support for non-blocking I/O streams
A consistent interface across a wide range of streams
Much more control over the universal newline translation feature.
Full Unicode support.

That extra functionality comes at a performance price.

But your Python 2 test reads byte strings, newlines are always translated to \n, and the file object the code is working with is pretty close to the OS-supplied file primitive, with all the downsides. In Python 3, you usually want to process data from files as text, so opening a file in text mode gives you a file object that decodes the binary data to Unicode str objects.

So how can you make things go 'faster' on Python 3? That depends on your specific use case, but you have some options:

For text-mode files, disable universal newline handling, especially when handling a file that uses line endings that differ from the platform standard. Set the newline parameter to the expected newline character sequence, like \n. Binary mode only supports \n as line separator.
Process the file as binary data, and don't decode to str. Alternatively, decode to Latin-1, a straight one-on-one mapping from byte to codepoint. This is an option when your data is ASCII-only too, where Latin-1 omits an error check on the bytes being in the range 0-127 rather than 0-255.

When using mode='rb', Python 3 can easily match the Python 2 timings, the test only takes 5.05 seconds on my system, using Python 3.7.

Using latin-1 as the codec vs. UTF-8 (the usual default) makes only a small difference; UTF-8 can be decoded very efficiently. But it could make a difference for other codecs. You generally want to set the encoding parameter explicitly, and not rely on the default encoding used.

answered Sep 27 '22 20:09

Martijn Pieters

Related questions
                            
                                Seaborn Jointplot add colors for each class
                            
                                Python Sqlite3 - Using f strings for update database function
                            
                                Python Logging - Only for own imported modules
                            
                                What is the difference between static_rnn and dynamic_rnn?
                            
                                pytorch how to remove cuda() from tensor
                            
                                Plotly Value error - Invalid property for colour
                            
                                Sending emojis with selenium's send_keys()
                            
                                Setting up developement environment: PyCharm, python-gtk, windows
                            
                                How to loop dictionary with multiple values in Jinja?
                            
                                Python wait Slurm job?
                            
                                How do I stagger or offset x-axis labels in Matplotlib?
                            
                                How to plot scipy.hierarchy.dendrogram using polar coordinates?
                            
                                Fatal Python error: init_sys_streams: can't initialize sys standard streams AttributeError: module 'io' has no attribute 'OpenWrapper'
                            
                                LinearConstraint in scipy.optimize
                            
                                matplotlib get_color for subplot
                            
                                how to set label for each subplot in a plot in matplotlib?
                            
                                Python how to remove last comma from print(string, end=“, ”)
                            
                                Get a Discord Role by Id
                            
                                How to remove nan and inf values from a numpy matrix?
                            
                                How to select an inter-year period with xarray?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why opening and iterating over file handle over twice as fast in Python 2 vs Python 3?

Tags:

python

file

python-3.x

parsing

python-2.7

Chris_Rands

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us