What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I've read there are performance problems with fseek because of I/O buffers, but I don't want to read 2-3 GB of data into memory before grabbing every 30th byte either.
Each integer is usually around 4 bytes of storage. So if you are storing the number in binary in the text file, and the binary equivalent is 1101110111010101, there are 16 integers in that binary number. 16 * 4 = 64. So your number will take up about 64 bytes of storage.
Answer: A binary file is usually very much smaller than a text file that contains an equivalent amount of data. I/O with smaller files is faster, too, since there are fewer bytes to move.
What I'd suggest is that you create a buffer of a few thousand bytes, read every 30th byte from it, reload the buffer with the next few thousand bytes, and continue until you reach the eof. That way the amount of data read into memory is limited, and you also don't have to read from the file as often. You'll find that the larger the buffer you create, the faster it'll be.
Edit: Actually, as suggested below, you'll probably want to make your buffer a few hundred kb's, not a few thousand bytes (like I said - bigger buffer = faster file read).
Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd
.
#include <stdio.h> #include <stdlib.h> const long long size = 1024LL*1024*MEGS; const int step = 32; int main() { FILE *in = fopen("/cygdrive/c/rand1.data", "rb"); int total = 0; #if SEEK long long i = 0; char buf[1]; while (i < size) { fread(buf, 1, 1, in); total += (unsigned char) buf[0]; fseek(in, step - 1, SEEK_CUR); i += step; } #endif #ifdef BUFSZ long long i = 0; char buf[BUFSZ]; while (i < size) { fread(buf, BUFSZ, 1, in); i += BUFSZ; for (int j = 0; j < BUFSZ; j += step) total += (unsigned char) buf[j]; } #endif printf("%d\n", total); }
Results:
$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2 83595817 real 0m1.391s user 0m0.030s sys 0m0.030s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2 83595817 real 0m0.172s user 0m0.108s sys 0m0.046s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2 83595817 real 0m0.031s user 0m0.030s sys 0m0.015s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2 83595817 real 0m0.141s user 0m0.140s sys 0m0.015s $ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2 83595817 real 0m20.797s user 0m1.733s sys 0m9.140s
Summary:
I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.
At this point I'm pulling fseek out of the ring so the other two can continue:
$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741 real 0m33.437s user 0m0.749s sys 0m1.562s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2 -117681741 real 0m6.078s user 0m5.030s sys 0m0.484s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741 real 0m1.141s user 0m0.280s sys 0m0.500s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2 -117681741 real 0m6.094s user 0m4.968s sys 0m0.640s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741 real 0m1.140s user 0m0.171s sys 0m0.640s
1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:
$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2 -938074821 real 3m25.515s user 0m5.155s sys 0m12.640s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2 -938074821 real 3m59.015s user 1m11.061s sys 0m10.999s $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2 -938074821 real 3m42.423s user 0m5.577s sys 0m14.484s
Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.
Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.
Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.
incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With