Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linux splice() + kernel AIO when writing to disk

With kernel AIO and O_DIRECT|O_SYNC, there is no copying into kernel buffers and it is possible to get fine grained notification when data is actually flushed to disk. However, it requires data to be held in user space buffers for io_prep_pwrite().

With splice(), it is possible to move data directly to disk from kernel space buffers (pipes) without never having to copy it around. However, splice() returns immediately after data is queued and does not wait for actual writes to the disk.

The goal is to move data from sockets to disk without copying it around while getting confirmation that it has been flushed out. How to combine both previous approaches?

By combining splice() with O_SYNC, I expect splice() to block and one has to use multiple threads to mask latency. Alternatively, one could use asynchronous io_prep_fsync()/io_prep_fdsync(), but this waits for all data to be flushed, not for a specific write. Neither is perfect.

What would be required is a combination of splice() with kernel AIO, allowing zero copy and asynchronous confirmation of writes, such that a single event driven thread can move data from sockets to the disk and get confirmations when required, but this doesn't seem to be supported. Is there a good workaround / alternative approach?

like image 497
jop Avatar asked Nov 27 '13 10:11

jop


1 Answers

To get a confirmation of the writes, you can't use splice().

There's aio stuff in userspace, but if you were doing it in the kernel it might come to finding out which bio's (block I/O) are generated and waiting for those:

Block I/O structure:

  • http://www.makelinux.net/books/lkd2/ch13lev1sec3

If you want to use AIO, you will need to use io_getevents():

  • http://man7.org/linux/man-pages/man2/io_getevents.2.html

Here are some examples on how to perform AIO:

  • http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt

If you do it from userspace and use msync it's still kind of up in the air if it is actually on spinning rust yet.

msync() docs:

  • http://man7.org/linux/man-pages/man2/msync.2.html

You might have to soften expectations in order to make it more robust, because it might be very expensive to actually be sure that the writes are fisically written on disk.

The 'highest' typical standard for write assurance in light of something like power removal is a journal recording operation that modifies the storage. The journal itself is append only and you can see if entries are complete when you play it back. That very last journal entry may not be complete, so something may still be potentially lost.

like image 156
canolucas Avatar answered Nov 08 '22 20:11

canolucas