I can't work out why it's so much faster to parse this file in Python 2.7 than in Python 3.6. I've found this pattern both on macOS and Arch-Linux independently. Can others replicate it? Any explanation?
Warning: the code snippet writes a ~2GB file
Timings:
$ python2 test.py
5.01580309868
$ python3 test.py
10.664075019994925
Code for test.py
:
import os
SEQ_LINE = 'ATCGN'* 80 + '\n'
if not os.path.isfile('many_medium.fa'):
with open('many_medium.fa', 'w') as out_f:
for i in range(1000000):
out_f.write('>{}\n'.format(i))
for _ in range(5):
out_f.write(SEQ_LINE)
from timeit import timeit
def f():
with open('many_medium.fa') as f:
for line in f:
pass
print(timeit('f()', setup='from __main__ import f', number=5))
So is Python 3 faster than Python 2? Yes! in almost all tests. The notable exceptions were the crypto_paes test, where Python 3 was 1.35x slower (because of the integer types), python_startup as 1.39x slower.
Using with means that the file will be closed as soon as you leave the block. This is beneficial because closing a file is something that can easily be forgotten and ties up resources that you no longer need.
Hence, there can be at most 95141 possible file descriptors opened at once. To change this use: where 104854 is max number which you want. I agree with everyone else here.
Python File read() Method The read() method returns the specified number of bytes from the file. Default is -1 which means the whole file.
Because in Python 2, the standard open()
call creates a far simpler file object than the Python 3 open()
call does. The Python 3 open
call is the same thing as io.open()
, and the same framework is available on Python 2.
To make this a fair comparison, you'd have to add the following line to the top of your test:
from io import open
With that change, the timings on Python 2 go from 5.5 seconds, to 37 seconds. Compared to that figure, the 11 seconds Python 3 takes on my system to run the test really is much, much faster.
So what is happening here? The io
library offers much more functionality than the old Python 2 file
object:
open()
consist of up to 3 layers of composed functionality, allowing you to control buffering and text handling.That extra functionality comes at a performance price.
But your Python 2 test reads byte strings, newlines are always translated to \n
, and the file object the code is working with is pretty close to the OS-supplied file primitive, with all the downsides. In Python 3, you usually want to process data from files as text, so opening a file in text mode gives you a file object that decodes the binary data to Unicode str
objects.
So how can you make things go 'faster' on Python 3? That depends on your specific use case, but you have some options:
newline
parameter to the expected newline character sequence, like \n
. Binary mode only supports \n
as line separator.str
. Alternatively, decode to Latin-1, a straight one-on-one mapping from byte to codepoint. This is an option when your data is ASCII-only too, where Latin-1 omits an error check on the bytes being in the range 0-127 rather than 0-255.When using mode='rb'
, Python 3 can easily match the Python 2 timings, the test only takes 5.05 seconds on my system, using Python 3.7.
Using latin-1
as the codec vs. UTF-8 (the usual default) makes only a small difference; UTF-8 can be decoded very efficiently. But it could make a difference for other codecs. You generally want to set the encoding
parameter explicitly, and not rely on the default encoding used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With