Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python read() from stdout much slower than reading line by line (slurping?)

I have a python SubProcess call that runs an executable and pipes the output to my subprocess stdout.

In cases where the stdout data is relatively small (~2k lines), the performance between reading line by line and reading as a chunk (stdout.read()) is comparable...with stdout.read() being slightly faster.

Once the data gets to be larger (say 30k+ lines), the performance for reading line by line is significantly better.

This is my comparison script:

proc=subprocess.Popen(executable,stdout=subprocess.PIPE)
tic=time.clock()
for line in (iter(proc.stdout.readline,b'')):
    tmp.append(line)
print("line by line = %.2f"%(time.clock()-tic))

proc=subprocess.Popen(executable,stdout=subprocess.PIPE)
tic=time.clock()
fullFile=proc.stdout.read()
print("slurped = %.2f"%(time.clock()-tic))

And these are the results for a read of ~96k lines (or 50mb of on disk memory):

line by line = 5.48
slurped = 153.03

I am unclear why the performance difference is so extreme. My expectation is that the read() version should be faster than storing the results line by line. Of course, I was expecting faster line by line results in practical case where there is significant per line processing that could be done during the read.

Can anyone give me insight into the read() performance cost?

like image 940
PaulD Avatar asked Jan 27 '14 16:01

PaulD


2 Answers

This is not just Python, reading by chars without buffering is always slower then reading-in lines or big chunks.

Consider these two simple C programs:

[readchars.c]

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>

int main(void) {
        FILE* fh = fopen("largefile.txt", "r");
        if (fh == NULL) {
                perror("Failed to open file largefile.txt");
                exit(1);
        }

        int c;
        c = fgetc(fh);
        while (c != EOF) {
                c = fgetc(fh);
        }

        return 0;
}

[readlines.c]

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>

int main(void) {
        FILE* fh = fopen("largefile.txt", "r");
        if (fh == NULL) {
                perror("Failed to open file largefile.txt");
                exit(1);
        }

        char* s = (char*) malloc(120);
        s = fgets(s, 120, fh);
        while ((s != NULL) && !feof(fh)) {
                s = fgets(s, 120, fh);
        }

        free(s);

        return 0;
}

And their results (YMMW, largefile.txt was ~200MB text file):

$ gcc readchars.c -o readchars
$ time ./readchars            
./readchars  1.32s user 0.03s system 99% cpu 1.350 total
$ gcc readlines.c -o readlines
$ time ./readlines            
./readlines  0.27s user 0.03s system 99% cpu 0.300 total
like image 139
Dan Keder Avatar answered Oct 15 '22 14:10

Dan Keder


Try adding a bufsize option to your Popen call and see if it makes a difference:

proc=subprocess.Popen(executable, bufsize=-1, stdout=subprocess.PIPE)

Popen includes an option to set the buffer size for reading input. bufsize defaults to 0, which means unbuffered input. Any other value means a buffer of approximately that size. A negative value means to use the system default, which means fully buffered input.

The Python docs include this note:

Note: if you experience performance issues, it is recommended that you try to enable buffering by setting bufsize to either -1 or a large enough positive value (such as 4096).

like image 23
Kevin Avatar answered Oct 15 '22 14:10

Kevin