I was surprised to know that Python 3.5.2
is much slower than Python 2.7.12
. I wrote a simple command line command that calculates the number of lines in a huge CSV-file.
$ cat huge.csv | python -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 15 seconds
$ cat huge.csv | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 66 seconds
Python 2.7.12 took 15 seconds, Python 3.5.2 took 66 seconds. I expected that the difference may take place, but why is it so huge? What's new in Python 3 that makes it much slower towards such kind of tasks? Is there a faster way to calculate the number of lines in Python 3?
My CPU is Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
.
The size of huge.csv
is 18.1 Gb and it contains 101253515 lines.
Asking this question, I don't need exactly to find the number of lines of a big file at any cost. I just wrote a particular case where Python 3 is much slower. Actually, I am developing a script in Python 3 that deals with big CSV files, some operations don't suppose of using csv
library. I know, I could write the script in Python 2, and it would be acceptable towards the speed. But I would like to know a way to write similar script in Python 3. This is why I am interested what makes Python 3 slower in my example and how it can be improved by "honest" python approaches.
sys.stdin
object is a bit more complicated in Python3 then it was in Python2. For example by default reading from sys.stdin
in Python3 converts the input into unicode, thus it fails on non-unicode bytes:
$ echo -e "\xf8" | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <genexpr>
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
Note that Python2 doesn't have any problem with that input. So as you can see Python3's sys.stdin
does more things under the hood. I'm not sure if this is exactly responsible for the performance loss but you can investigate it further by trying sys.stdin.buffer
under Python3:
import sys
print(sum(1 for _ in sys.stdin.buffer))
Note that .buffer
doesn't exist in Python2. I've done some tests and I don't see real difference in performance between Python2's sys.stdin
and Python3's sys.stdin.buffer
but YMMV.
EDIT Here are some random results on my machine: ubuntu 16.04, i7 cpu, 8GiB RAM. First some C code (as a base for comparison):
#include <unistd.h>
int main() {
char buffer[4096];
size_t total = 0;
while (true) {
int result = ::read(STDIN_FILENO, buffer, sizeof(buffer));
total += result;
if (result == 0) {
break;
}
}
return 0;
};
now the file size:
$ ls -s --block-size=M | grep huge2.txt
10898M huge2.txt
and tests:
// a.out is a simple C equivalent code (except for the final print)
$ time cat huge2.txt | ./a.out
real 0m20.607s
user 0m0.236s
sys 0m10.600s
$ time cat huge2.txt | python -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889
real 1m24.268s
user 1m20.216s
sys 0m8.724s
$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin.buffer))"
898773889
real 1m19.734s
user 1m14.432s
sys 0m11.940s
$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889
real 2m0.326s
user 1m56.148s
sys 0m9.876s
So the file I've used was a bit smaller and times were longer ( it seems that you have a better machine and I didn't have patience for larger files :D ). Anyway Python2 and Python3's sys.stdin.buffer
are quite similar in my tests. Python3's sys.stdin
is way slower. And all of them are waaaay behind the C code (which has almost 0 user time).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With