Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Python3 much slower than Python2 on my task?

I was surprised to know that Python 3.5.2 is much slower than Python 2.7.12. I wrote a simple command line command that calculates the number of lines in a huge CSV-file.

$ cat huge.csv | python -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 15 seconds

$ cat huge.csv | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 66 seconds

Python 2.7.12 took 15 seconds, Python 3.5.2 took 66 seconds. I expected that the difference may take place, but why is it so huge? What's new in Python 3 that makes it much slower towards such kind of tasks? Is there a faster way to calculate the number of lines in Python 3?

My CPU is Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz.

The size of huge.csv is 18.1 Gb and it contains 101253515 lines.

Asking this question, I don't need exactly to find the number of lines of a big file at any cost. I just wrote a particular case where Python 3 is much slower. Actually, I am developing a script in Python 3 that deals with big CSV files, some operations don't suppose of using csv library. I know, I could write the script in Python 2, and it would be acceptable towards the speed. But I would like to know a way to write similar script in Python 3. This is why I am interested what makes Python 3 slower in my example and how it can be improved by "honest" python approaches.

like image 614
Fomalhaut Avatar asked Nov 07 '17 12:11

Fomalhaut


1 Answers

sys.stdin object is a bit more complicated in Python3 then it was in Python2. For example by default reading from sys.stdin in Python3 converts the input into unicode, thus it fails on non-unicode bytes:

$ echo -e "\xf8" | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <genexpr>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Note that Python2 doesn't have any problem with that input. So as you can see Python3's sys.stdin does more things under the hood. I'm not sure if this is exactly responsible for the performance loss but you can investigate it further by trying sys.stdin.buffer under Python3:

import sys
print(sum(1 for _ in sys.stdin.buffer))

Note that .buffer doesn't exist in Python2. I've done some tests and I don't see real difference in performance between Python2's sys.stdin and Python3's sys.stdin.buffer but YMMV.

EDIT Here are some random results on my machine: ubuntu 16.04, i7 cpu, 8GiB RAM. First some C code (as a base for comparison):

#include <unistd.h>

int main() {
    char buffer[4096];
    size_t total = 0;
    while (true) {
        int result = ::read(STDIN_FILENO, buffer, sizeof(buffer));
        total += result;
        if (result == 0) {
            break;
        }
    }
    return 0;
};

now the file size:

$ ls -s --block-size=M | grep huge2.txt 
10898M huge2.txt

and tests:

// a.out is a simple C equivalent code (except for the final print)
$ time cat huge2.txt | ./a.out

real    0m20.607s
user    0m0.236s
sys     0m10.600s


$ time cat huge2.txt | python -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889

real    1m24.268s
user    1m20.216s
sys     0m8.724s


$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin.buffer))"
898773889

real    1m19.734s
user    1m14.432s
sys     0m11.940s


$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889

real    2m0.326s
user    1m56.148s
sys     0m9.876s

So the file I've used was a bit smaller and times were longer ( it seems that you have a better machine and I didn't have patience for larger files :D ). Anyway Python2 and Python3's sys.stdin.buffer are quite similar in my tests. Python3's sys.stdin is way slower. And all of them are waaaay behind the C code (which has almost 0 user time).

like image 198
freakish Avatar answered Sep 21 '22 17:09

freakish