Fastest way to read every 30th byte of large binary file?

Tags:

What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I've read there are performance problems with fseek because of I/O buffers, but I don't want to read 2-3 GB of data into memory before grabbing every 30th byte either.

848

asked Mar 06 '10 23:03

K_T

2 Answers

What I'd suggest is that you create a buffer of a few thousand bytes, read every 30th byte from it, reload the buffer with the next few thousand bytes, and continue until you reach the eof. That way the amount of data read into memory is limited, and you also don't have to read from the file as often. You'll find that the larger the buffer you create, the faster it'll be.

Edit: Actually, as suggested below, you'll probably want to make your buffer a few hundred kb's, not a few thousand bytes (like I said - bigger buffer = faster file read).

110

answered Oct 25 '22 18:10

Cam

Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h> #include <stdlib.h>  const long long size = 1024LL*1024*MEGS; const int step = 32;  int main() {     FILE *in = fopen("/cygdrive/c/rand1.data", "rb");     int total = 0;     #if SEEK         long long i = 0;         char buf[1];         while (i < size) {             fread(buf, 1, 1, in);             total += (unsigned char) buf[0];             fseek(in, step - 1, SEEK_CUR);             i += step;         }     #endif     #ifdef BUFSZ         long long i = 0;         char buf[BUFSZ];         while (i < size) {             fread(buf, BUFSZ, 1, in);             i += BUFSZ;             for (int j = 0; j < BUFSZ; j += step)                  total += (unsigned char) buf[j];         }     #endif     printf("%d\n", total); }

Results:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2 83595817  real    0m1.391s user    0m0.030s sys     0m0.030s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2 83595817  real    0m0.172s user    0m0.108s sys     0m0.046s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2 83595817  real    0m0.031s user    0m0.030s sys     0m0.015s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2 83595817  real    0m0.141s user    0m0.140s sys     0m0.015s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2 83595817  real    0m20.797s user    0m1.733s sys     0m9.140s

Summary:

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741  real    0m33.437s user    0m0.749s sys     0m1.562s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2 -117681741  real    0m6.078s user    0m5.030s sys     0m0.484s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741  real    0m1.141s user    0m0.280s sys     0m0.500s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2 -117681741  real    0m6.094s user    0m4.968s sys     0m0.640s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741  real    0m1.140s user    0m0.171s sys     0m0.640s

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2 -938074821  real    3m25.515s user    0m5.155s sys     0m12.640s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2 -938074821  real    3m59.015s user    1m11.061s sys     0m10.999s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2 -938074821  real    3m42.423s user    0m5.577s sys     0m14.484s

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

answered Oct 25 '22 18:10

4 revs

Related questions
                            
                                Sending emails in asp.net with specific name instead of sender email
                            
                                MySQL JOIN with IF conditions
                            
                                Can I use file_get_contents() to compare two files?
                            
                                How to scan for available bluetooth devices in range in android?
                            
                                C++ vector::clear
                            
                                Django request.GET
                            
                                Regular expression that checks for 2 specific words
                            
                                Remove trailing newline
                            
                                C++: When should I use structs instead of classes and where are the speed differences?
                            
                                How can I fix the Permission error when I call session_start()?
                            
                                What is the difference between VB and VB.NET? [closed]
                            
                                How do I extract query parameters from a URL string in PHP?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to read every 30th byte of large binary file?

Tags:

K_T

People also ask

2 Answers

Cam

4 revs

Recent Activity

Donate For Us