I have a large binary file (12 GB) from which I want to assemble a smaller binary file (16 KB) on the fly. Assume the file is on disk, and that the bytes for the smaller file are somewhat randomly distributed in the large binary file. What's the best and fastest way to do this? So far I've not been able to do better than about three minutes.
Things I've tried, which have more or less the same performance:
How can I randomly access this data really fast?
I want to get to less than a couple of seconds for the query.
Random access files permit nonsequential, or random, access to a file's contents. To access a file randomly, you open the file, seek a particular location, and read from or write to that file.
Random-access file is a term used to describe a file or set of files that are accessed directly instead of requiring that other files be read first. Computer hard drives access files directly, where tape drives commonly access files sequentially.
Binary files also usually have faster read and write times than text files, because a binary image of the record is stored directly from memory to disk (or vice versa). In a text file, everything has to be converted back and forth to text, and this takes time.
Purpose of linecache module in Python's standard library is to facilitate random access to any text file, although this module is extensively used by Python's traceback module to generate error trace stack. Further prettyprints of reading are held in a cache so that it saves time while reading lines repeatedly.
The answer is basically "no".
A single mechanical disk drive is going to take 10 ms or so to perform a seek, because it has to move the disk head. 16000 seeks times 10 milliseconds per seek equals 160 seconds. It makes absolutely no difference how you write your code; e.g. mmap() is going to make no difference.
Welcome to the physical world, software person :-). You must improve the locality of your operations.
First, sort the locations you are accessing. Nearby locations in the file are likely to be nearby on disk, and seeking between nearby locations is faster than seeking randomly.
Next, your disk can probably read sequential data at around 100 megabytes/second; that is, it can read 1 megabyte sequentially in around the same time it takes to perform a seek. So if two of your values are less than 1 megabyte apart, you are better off reading all of the data between them than performing the seek between them. (But benchmark this to find the optimal trade-off on your hardware.)
Finally, a RAID can help with throughput (but not seek time). It can also provide multiple disk heads that can seek concurrently if you want to multi-thread your read code.
But in general, accessing random data is about the worst thing you can ask your computer to do, whether in memory or on disk. And the relative difference between sequential access and random access increases every year because physics is local. (Well, the physics we depend on here, anyway.)
[edit]
@JeremyP's suggestion to use SSDs is a good one. If they are an option, they have an effective seek time of 0.1 ms or so. Meaning you could expect your code to run 50-100 times faster on such hardware. (I did not think of this because I usually work with files in the 1 TB range where SSDs would be too expensive.)
[edit 2]
As @FrankH mentions in a comment, some of my suggestions assume that the file is contiguous on disk, which of course is not guaranteed. You can help to improve this by using a good file system (e.g. XFS) and by giving "hints" at file creation time (e.g. use posix_fallocate to inform the kernel you intend to populate a large file).
Well, the speed you can achieve for this largely depends on the total number of read operations you perform in order to extract the 96 kB which make up the payload for your new file.
Why is that so? Because random reads from (spinning) disks are seek-limited; the read as such is (almost) infinitely fast compared to the time it takes to re-position the magnetic heads.
Since you're saying the access pattern is random, you're also not likely to benefit from any readahead that the operating system may decide to use; you can, if you so choose, therefore switch that off via fadvise(fd, 0, MAX_OFFSET, FADV_RANDOM);
on the filedescriptor for the big file. Or, madvise()
if you've chosen to mmap()
it. But that'll only gain you if you're performing big reads (and you know a big readahead would be nonsense). For small reads, it's exclusively the seek time that'll determine the total.
Assuming you need N
random reads and you've got an M
msec seek time, it'll take at least N * m
milliseconds to perform the data extraction (if you've got the disk to yourself ...). There is no way to break this barrier.
Edit: A few things on mitigating strategies:
As mentioned by several people, the key to approach this problem is to minimize seeks. There are several strategies for this:
N+1
doesn't depend on what read operation N
did, then you can issue both concurrently). This allows the operating system / device driver to queue them up and possibly re-order them (or merge them with reads done by other concurrently running processes) for most efficient seeking.preadv()
would come to mind), to the same effect.FIEMAP
/ FIBMAP
(the Windows equivalent would roughly be FSCTL_GET_RETRIEVAL_POINTERS
) to determine where the physical blocks for your file data are, and perform a decision on read merging based on that (there's no point issuing a large "nonseeking" read if actually that crosses a physical block boundary and the filesystem turns it into two).In general, if none of the above applies, you'll have to bite the bullet and accept the seek latency. Buy a solid state disk and/or use a RAM-backed filesystem if you can justify the costs (and/or the volatility of RAM).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With