Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random access of a large binary file

I have a large binary file (12 GB) from which I want to assemble a smaller binary file (16 KB) on the fly. Assume the file is on disk, and that the bytes for the smaller file are somewhat randomly distributed in the large binary file. What's the best and fastest way to do this? So far I've not been able to do better than about three minutes.

Things I've tried, which have more or less the same performance:

  1. Converting the file to the HDF5 format and using the C interface (slow).
  2. Writing a small C program to fseek() through the file (slow).

How can I randomly access this data really fast?

I want to get to less than a couple of seconds for the query.

like image 934
Genausactly Avatar asked Jul 11 '11 14:07

Genausactly


People also ask

How do I randomly access a file?

Random access files permit nonsequential, or random, access to a file's contents. To access a file randomly, you open the file, seek a particular location, and read from or write to that file.

Do files have random access?

Random-access file is a term used to describe a file or set of files that are accessed directly instead of requiring that other files be read first. Computer hard drives access files directly, where tape drives commonly access files sequentially.

Why are binary files faster to access?

Binary files also usually have faster read and write times than text files, because a binary image of the record is stored directly from memory to disk (or vice versa). In a text file, everything has to be converted back and forth to text, and this takes time.

Does Python have random access files?

Purpose of linecache module in Python's standard library is to facilitate random access to any text file, although this module is extensively used by Python's traceback module to generate error trace stack. Further prettyprints of reading are held in a cache so that it saves time while reading lines repeatedly.


2 Answers

The answer is basically "no".

A single mechanical disk drive is going to take 10 ms or so to perform a seek, because it has to move the disk head. 16000 seeks times 10 milliseconds per seek equals 160 seconds. It makes absolutely no difference how you write your code; e.g. mmap() is going to make no difference.

Welcome to the physical world, software person :-). You must improve the locality of your operations.

First, sort the locations you are accessing. Nearby locations in the file are likely to be nearby on disk, and seeking between nearby locations is faster than seeking randomly.

Next, your disk can probably read sequential data at around 100 megabytes/second; that is, it can read 1 megabyte sequentially in around the same time it takes to perform a seek. So if two of your values are less than 1 megabyte apart, you are better off reading all of the data between them than performing the seek between them. (But benchmark this to find the optimal trade-off on your hardware.)

Finally, a RAID can help with throughput (but not seek time). It can also provide multiple disk heads that can seek concurrently if you want to multi-thread your read code.

But in general, accessing random data is about the worst thing you can ask your computer to do, whether in memory or on disk. And the relative difference between sequential access and random access increases every year because physics is local. (Well, the physics we depend on here, anyway.)

[edit]

@JeremyP's suggestion to use SSDs is a good one. If they are an option, they have an effective seek time of 0.1 ms or so. Meaning you could expect your code to run 50-100 times faster on such hardware. (I did not think of this because I usually work with files in the 1 TB range where SSDs would be too expensive.)

[edit 2]

As @FrankH mentions in a comment, some of my suggestions assume that the file is contiguous on disk, which of course is not guaranteed. You can help to improve this by using a good file system (e.g. XFS) and by giving "hints" at file creation time (e.g. use posix_fallocate to inform the kernel you intend to populate a large file).

like image 74
Nemo Avatar answered Oct 15 '22 07:10

Nemo


Well, the speed you can achieve for this largely depends on the total number of read operations you perform in order to extract the 96 kB which make up the payload for your new file.

Why is that so? Because random reads from (spinning) disks are seek-limited; the read as such is (almost) infinitely fast compared to the time it takes to re-position the magnetic heads.

Since you're saying the access pattern is random, you're also not likely to benefit from any readahead that the operating system may decide to use; you can, if you so choose, therefore switch that off via fadvise(fd, 0, MAX_OFFSET, FADV_RANDOM); on the filedescriptor for the big file. Or, madvise() if you've chosen to mmap() it. But that'll only gain you if you're performing big reads (and you know a big readahead would be nonsense). For small reads, it's exclusively the seek time that'll determine the total.

Assuming you need N random reads and you've got an M msec seek time, it'll take at least N * m milliseconds to perform the data extraction (if you've got the disk to yourself ...). There is no way to break this barrier.

Edit: A few things on mitigating strategies:

As mentioned by several people, the key to approach this problem is to minimize seeks. There are several strategies for this:

  1. Issue asynchronous reads if you can (i.e. if read operation N+1 doesn't depend on what read operation N did, then you can issue both concurrently). This allows the operating system / device driver to queue them up and possibly re-order them (or merge them with reads done by other concurrently running processes) for most efficient seeking.
  2. If you know the positions all in advance, then perform scatter-gather I/O (the UN*X preadv() would come to mind), to the same effect.
  3. Query your filesystem and/or block device for the best / minimum blocksize; how to do this is system-dependent, see e.g. statvfs() or even ioctl_list. If you know that, you might possibly use the technique mentioned by Nemo (merge two small reads within the "optimal" block size into a single large read, needing no seek).
  4. Possibly even use query interfaces like FIEMAP / FIBMAP (the Windows equivalent would roughly be FSCTL_GET_RETRIEVAL_POINTERS) to determine where the physical blocks for your file data are, and perform a decision on read merging based on that (there's no point issuing a large "nonseeking" read if actually that crosses a physical block boundary and the filesystem turns it into two).
  5. If you build up the positions to read from over a comparatively large time, then reading (asynchronously) as you still compute future read offsets will also help to hide seek latency, as you're putting compute cycles / wait time to good use.

In general, if none of the above applies, you'll have to bite the bullet and accept the seek latency. Buy a solid state disk and/or use a RAM-backed filesystem if you can justify the costs (and/or the volatility of RAM).

like image 22
FrankH. Avatar answered Oct 15 '22 08:10

FrankH.