I have a large binary file (12 GB) from which I want to assemble a smaller binary file (16 KB) on the fly. Assume the file is on disk, and that the bytes for the smaller file are somewhat randomly distributed in the large binary file. What's the best and fastest way to do this? So far I've not been able to do better than about three minutes. Things I've tried, which have more or less the same performance: <ol> <li>Converting the file to the HDF5 format and using the C interface (slow).</li> <li>Writing a small C program to fseek() through the file (slow).</li> </ol> How can I randomly access this data really fast? I want to get to less than a couple of seconds for the query.

Well, the speed you can achieve for this largely depends on the total number of read operations you perform in order to extract the 96 kB which make up the payload for your new file. Why is that so? Because random reads from (spinning) disks are seek-limited; the read as such is (almost) infinitely fast compared to the time it takes to re-position the magnetic heads. Since you're saying the access pattern is random, you're also not likely to benefit from any readahead that the operating system may decide to use; you can, if you so choose, therefore switch that off via <code>fadvise(fd, 0, MAX_OFFSET, FADV_RANDOM);</code> on the filedescriptor for the big file. Or, <code>madvise()</code> if you've chosen to <code>mmap()</code> it. But that'll only gain you if you're performing big reads (and you know a big readahead would be nonsense). For small reads, it's exclusively the seek time that'll determine the total. Assuming you need <code>N</code> random reads and you've got an <code>M</code> msec seek time, it'll take at least <code>N * m</code> milliseconds to perform the data extraction (if you've got the disk to yourself ...). There is no way to break this barrier. Edit: A few things on mitigating strategies: As mentioned by several people, the key to approach this problem is to minimize seeks. There are several strategies for this: <ol> <li>Issue asynchronous reads if you can (i.e. if read operation <code>N+1</code> doesn't depend on what read operation <code>N</code> did, then you can issue both concurrently). This allows the operating system / device driver to queue them up and possibly re-order them (or merge them with reads done by other concurrently running processes) for most efficient seeking.</li> <li>If you know the positions all in advance, then perform scatter-gather I/O (the UN*X <code>preadv()</code> would come to mind), to the same effect.</li> <li>Query your filesystem and/or block device for the best / minimum blocksize; how to do this is system-dependent, see e.g. statvfs() or even ioctl_list. If you know that, you might possibly use the technique mentioned by Nemo (merge two small reads within the "optimal" block size into a single large read, needing no seek).</li> <li>Possibly even use query interfaces like <code>FIEMAP</code> / <code>FIBMAP</code> (the Windows equivalent would roughly be <code>FSCTL_GET_RETRIEVAL_POINTERS</code>) to determine where the physical blocks for your file data are, and perform a decision on read merging based on that (there's no point issuing a large "nonseeking" read if actually that crosses a physical block boundary and the filesystem turns it into two).</li> <li>If you build up the positions to read from over a comparatively large time, then reading (asynchronously) as you still compute future read offsets will also help to hide seek latency, as you're putting compute cycles / wait time to good use.</li> </ol> In general, if none of the above applies, you'll have to bite the bullet and accept the seek latency. Buy a solid state disk and/or use a RAM-backed filesystem if you can justify the costs (and/or the volatility of RAM).

Random access of a large binary file

Tags:

performance

c

file

linux

binary

I have a large binary file (12 GB) from which I want to assemble a smaller binary file (16 KB) on the fly. Assume the file is on disk, and that the bytes for the smaller file are somewhat randomly distributed in the large binary file. What's the best and fastest way to do this? So far I've not been able to do better than about three minutes.

Things I've tried, which have more or less the same performance:

Converting the file to the HDF5 format and using the C interface (slow).
Writing a small C program to fseek() through the file (slow).

How can I randomly access this data really fast?

I want to get to less than a couple of seconds for the query.

934

asked Jul 11 '11 14:07

Genausactly

2 Answers

The answer is basically "no".

A single mechanical disk drive is going to take 10 ms or so to perform a seek, because it has to move the disk head. 16000 seeks times 10 milliseconds per seek equals 160 seconds. It makes absolutely no difference how you write your code; e.g. mmap() is going to make no difference.

Welcome to the physical world, software person :-). You must improve the locality of your operations.

First, sort the locations you are accessing. Nearby locations in the file are likely to be nearby on disk, and seeking between nearby locations is faster than seeking randomly.

Next, your disk can probably read sequential data at around 100 megabytes/second; that is, it can read 1 megabyte sequentially in around the same time it takes to perform a seek. So if two of your values are less than 1 megabyte apart, you are better off reading all of the data between them than performing the seek between them. (But benchmark this to find the optimal trade-off on your hardware.)

Finally, a RAID can help with throughput (but not seek time). It can also provide multiple disk heads that can seek concurrently if you want to multi-thread your read code.

But in general, accessing random data is about the worst thing you can ask your computer to do, whether in memory or on disk. And the relative difference between sequential access and random access increases every year because physics is local. (Well, the physics we depend on here, anyway.)

[edit]

@JeremyP's suggestion to use SSDs is a good one. If they are an option, they have an effective seek time of 0.1 ms or so. Meaning you could expect your code to run 50-100 times faster on such hardware. (I did not think of this because I usually work with files in the 1 TB range where SSDs would be too expensive.)

[edit 2]

As @FrankH mentions in a comment, some of my suggestions assume that the file is contiguous on disk, which of course is not guaranteed. You can help to improve this by using a good file system (e.g. XFS) and by giving "hints" at file creation time (e.g. use posix_fallocate to inform the kernel you intend to populate a large file).

answered Oct 15 '22 07:10

Nemo

Well, the speed you can achieve for this largely depends on the total number of read operations you perform in order to extract the 96 kB which make up the payload for your new file.

Why is that so? Because random reads from (spinning) disks are seek-limited; the read as such is (almost) infinitely fast compared to the time it takes to re-position the magnetic heads.

Since you're saying the access pattern is random, you're also not likely to benefit from any readahead that the operating system may decide to use; you can, if you so choose, therefore switch that off via fadvise(fd, 0, MAX_OFFSET, FADV_RANDOM); on the filedescriptor for the big file. Or, madvise() if you've chosen to mmap() it. But that'll only gain you if you're performing big reads (and you know a big readahead would be nonsense). For small reads, it's exclusively the seek time that'll determine the total.

Assuming you need N random reads and you've got an M msec seek time, it'll take at least N * m milliseconds to perform the data extraction (if you've got the disk to yourself ...). There is no way to break this barrier.

Edit: A few things on mitigating strategies:

As mentioned by several people, the key to approach this problem is to minimize seeks. There are several strategies for this:

Issue asynchronous reads if you can (i.e. if read operation N+1 doesn't depend on what read operation N did, then you can issue both concurrently). This allows the operating system / device driver to queue them up and possibly re-order them (or merge them with reads done by other concurrently running processes) for most efficient seeking.
If you know the positions all in advance, then perform scatter-gather I/O (the UN*X preadv() would come to mind), to the same effect.
Query your filesystem and/or block device for the best / minimum blocksize; how to do this is system-dependent, see e.g. statvfs() or even ioctl_list. If you know that, you might possibly use the technique mentioned by Nemo (merge two small reads within the "optimal" block size into a single large read, needing no seek).
Possibly even use query interfaces like FIEMAP / FIBMAP (the Windows equivalent would roughly be FSCTL_GET_RETRIEVAL_POINTERS) to determine where the physical blocks for your file data are, and perform a decision on read merging based on that (there's no point issuing a large "nonseeking" read if actually that crosses a physical block boundary and the filesystem turns it into two).
If you build up the positions to read from over a comparatively large time, then reading (asynchronously) as you still compute future read offsets will also help to hide seek latency, as you're putting compute cycles / wait time to good use.

In general, if none of the above applies, you'll have to bite the bullet and accept the seek latency. Buy a solid state disk and/or use a RAM-backed filesystem if you can justify the costs (and/or the volatility of RAM).

answered Oct 15 '22 08:10

FrankH.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Random access of a large binary file

Tags:

performance

c

file

linux

binary

Genausactly

People also ask

2 Answers

Nemo

FrankH.

Recent Activity

Donate For Us

Random access of a large binary file

Tags:

performance

c

file

linux

binary

Genausactly

People also ask

2 Answers

Nemo

FrankH.

Related questions

Recent Activity

Donate For Us