My task is very simple: Read and parse a large file in C++ on Linux. There are two ways:
Parse byte by byte.
while(/*...*/) {
... = fgetc(...);
/* do something with the char */
}
Parse buffer by buffer.
while(/*...*/) {
char buffer[SOME_LARGE_NUMBER];
fread(buffer, SOME_LARGE_NUMBER, 1, ...);
/* parse the buffer */
}
Now, parsing byte by byte is easier for me (no check for how full the buffer is, etc.). However, I heard that reading large pieces is more efficient.
What is the philosophy? Is "optimal" buffering a task of the kernel, so it is already buffered when I call fgetc()
? Or is it suggested that I handle it to gain best efficiency?
Also, apart from all philosophy: What's the reality on Linux here?
Regardless of the performance or underlying buffering of fgetc()
, calling a function for every single byte you require, versus having a decent sized buffer to iterate over, is overhead that the kernel cannot help you with.
I did some quick and dirty timings for my local system (obviously YMMV).
I chose a ~200k file, and summed each byte. I did this 20000 times, alternating every 1000 cycles between reading using fgetc()
and reading using fread()
. I timed each 1000 cycles as a single lump. I compiled a release build, with optimisations turned on.
The fgetc()
loop variant was consistently 45x slower than the fread()
loop.
After prompting in the comments, I also compared getc()
, and also varying the stdio buffer. There were no noticeable changes in performance.
The stdio buffer is not a part of the kernel. It is a part of the user space.
However you can effect the size of that buffer using setbuf. When that buffer is not full enough the stdio library will fill it by issuing the read system function.
So it will not matter using fgetc or fread it terms of switching between kernel and user.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With