Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to read every 30th byte of large binary file?

Tags:

What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I've read there are performance problems with fseek because of I/O buffers, but I don't want to read 2-3 GB of data into memory before grabbing every 30th byte either.

like image 848
K_T Avatar asked Mar 06 '10 23:03

K_T


People also ask

How much space does binary take up?

Each integer is usually around 4 bytes of storage. So if you are storing the number in binary in the text file, and the binary equivalent is 1101110111010101, there are 16 integers in that binary number. 16 * 4 = 64. So your number will take up about 64 bytes of storage.

Why binary file is faster?

Answer: A binary file is usually very much smaller than a text file that contains an equivalent amount of data. I/O with smaller files is faster, too, since there are fewer bytes to move.


2 Answers

What I'd suggest is that you create a buffer of a few thousand bytes, read every 30th byte from it, reload the buffer with the next few thousand bytes, and continue until you reach the eof. That way the amount of data read into memory is limited, and you also don't have to read from the file as often. You'll find that the larger the buffer you create, the faster it'll be.

Edit: Actually, as suggested below, you'll probably want to make your buffer a few hundred kb's, not a few thousand bytes (like I said - bigger buffer = faster file read).

like image 110
Cam Avatar answered Oct 25 '22 18:10

Cam


Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h> #include <stdlib.h>  const long long size = 1024LL*1024*MEGS; const int step = 32;  int main() {     FILE *in = fopen("/cygdrive/c/rand1.data", "rb");     int total = 0;     #if SEEK         long long i = 0;         char buf[1];         while (i < size) {             fread(buf, 1, 1, in);             total += (unsigned char) buf[0];             fseek(in, step - 1, SEEK_CUR);             i += step;         }     #endif     #ifdef BUFSZ         long long i = 0;         char buf[BUFSZ];         while (i < size) {             fread(buf, BUFSZ, 1, in);             i += BUFSZ;             for (int j = 0; j < BUFSZ; j += step)                  total += (unsigned char) buf[j];         }     #endif     printf("%d\n", total); } 

Results:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2 83595817  real    0m1.391s user    0m0.030s sys     0m0.030s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2 83595817  real    0m0.172s user    0m0.108s sys     0m0.046s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2 83595817  real    0m0.031s user    0m0.030s sys     0m0.015s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2 83595817  real    0m0.141s user    0m0.140s sys     0m0.015s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2 83595817  real    0m20.797s user    0m1.733s sys     0m9.140s 

Summary:

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741  real    0m33.437s user    0m0.749s sys     0m1.562s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2 -117681741  real    0m6.078s user    0m5.030s sys     0m0.484s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741  real    0m1.141s user    0m0.280s sys     0m0.500s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2 -117681741  real    0m6.094s user    0m4.968s sys     0m0.640s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2 -117681741  real    0m1.140s user    0m0.171s sys     0m0.640s 

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2 -938074821  real    3m25.515s user    0m5.155s sys     0m12.640s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2 -938074821  real    3m59.015s user    1m11.061s sys     0m10.999s  $ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2 -938074821  real    3m42.423s user    0m5.577s sys     0m14.484s 

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

like image 34
4 revs Avatar answered Oct 25 '22 18:10

4 revs