Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to do many small, blind writes on a huge file (in C++)?

I have some very large (>4 GB) files containing (millions of) fixed-length binary records. I want to (efficiently) join them to records in other files by writing pointers (i.e. 64-bit record numbers) into those records at specific offsets.

To elaborate, I have a pair of lists of (key, record number) tuples sorted by key for each join I want to perform on a given pair of files, say, A and B. Iterating through a list pair and matching up the keys yields a list of (key, record number A, record number B) tuples representing the joined records (assuming a 1:1 mapping for simplicity). To complete the join, I conceptually need to seek to each A record in the list and write the corresponding B record number at the appropriate offset, and vice versa. My question is what is the fastest way to actually do this?

Since the list of joined records is sorted by key, the associated record numbers are essentially random. Assuming the file is much larger than the OS disk cache, doing a bunch of random seeks and writes seems extremely inefficient. I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. This has the benefit of greatly increasing the chances that the appropriate records will be cached for a cluster after updating its first pointer. However, even at this point, is it generally better to do a bunch of seeks and blind writes, or to read chunks of the file manually, update the appropriate pointers, and write the chunks back out? While the former method is much simpler and could be optimized by the OS to do the bare minimum of sector reads (since it knows the sector size) and copies (it can avoid copies by reading directly into properly aligned buffers), it seems that it will incur extremely high syscall overhead.

While I'd love a portable solution (even if it involves a dependency on a widely used library, such as Boost), modern Windows and Linux are the only must-haves, so I can make use of OS-specific APIs (e.g. CreateFile hints or scatter/gather I/O). However, this can involve a lot of work to even try out, so I'm wondering if anyone can tell me if it's likely worth the effort.

like image 696
Trevor Robinson Avatar asked Jul 09 '10 20:07

Trevor Robinson


2 Answers

It looks like you can solve this by use of data structures. You've got three constraints:

  • Access time must be reasonably quick
  • Data must be kept sorted
  • You are on a spinning disk

B+ Trees were created specifically to address the kind of workload you're dealing with here. There are several links to implementations in the linked Wikipedia article.

Essentially, a B+ tree is a binary search tree, except groups of nodes are held together in groups. That way, instead of having to seek around for each node, the B+ tree loads only a chunk at a time. And it keeps a bit of information around to know which chunk it's going to need in a search.

EDIT: If you need to sort by more than one item, you can do something like:


+--------+-------------+-------------+---------+
| Header | B+Tree by A | B+Tree by B | Records |
+--------+-------------+-------------+---------+
      ||      ^     |     ^    |          ^
      |\------/     |     |    |          |
      \-------------------/    |          |
                    |          |          |
                    \----------+----------/

I.e. you have seperate B+Trees for each key, and a seperate list of records, pointers to which are stored in the B+ trees.

like image 127
Billy ONeal Avatar answered Nov 15 '22 04:11

Billy ONeal


I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. it seems that it will incur extremely high syscall overhead.

You can use memory mapped access to the file to avoid syscall overhead. mmap() on *NIX, and CreateFileMapping() on Windows.

Split file logically into blocks, e.g. 32MB. If somethings needs to be changed in the block, mmap() it , modify data, optionally msync() if desired, munmap() and then move to the next block.

That would have been something I have tried first. OS would automatically read whatever needs to be read (on first access to the data), and it will queue IO anyway it likes.

Important things to keep in mind is that the real IO isn't that fast. Performance-wise limiting factors for random access are (1) the number of IOs per second (IOPS) storage can handle and (2) the number of disk seeks. (Usual IOPS is in hundreds range. Usual seek latency is 3-5ms.) Storage for example can read/write 50MB/s: one continuous block of 50MB in one second. But if you would try to patch byte-wise 50MB file, then seek times would simply kill the performance. Up to some limit, it is OK to read more and write more, even if to update only few bytes.

Another limit to observe is the OS' max size of IO operation: it depends on the storage but most OSs would split IO tasks larger than 128K. The limit can be changed and best if it is synchronized with the similar limit in the storage.

Also keep in mind the storage. Many people forget that storage is often only one. I'm trying here to say that starting crapload of threads doesn't help IO, unless you have multiple storages. Even single CPU/core is capable of easily saturating RAID10 with its 800 read IOPS and 400 write IOPS limits. (But a dedicated thread per storage at least theoretically makes sense.)

Hope that helps. Other people here often mention Boost.Asio which I have no experience with - but it is worth checking.

P.S. Frankly, I would love to hear other (more informative) responses to your question. I was in the boat several times already, yet had no chance to really get down to it. Books/links/etc related to IO optimizations (regardless of platform) are welcome ;)

like image 20
Dummy00001 Avatar answered Nov 15 '22 05:11

Dummy00001