Hypothetically, suppose I want to perform sequential writing to a potentially very large file.
If I mmap() a gigantic region and madvise(MADV_SEQUENTIAL) on that entire region, then I can write to the memory in a relatively efficient manner. This I have gotten to work just fine.
Now, in order to free up various OS resources as I am writing, I occasionally perform a munmap() on small chunks of memory that have already been written to. My concern is that munmap() and msync()will block my thread, waiting for the data to be physically committed to disk. I cannot slow down my writer at all, so I need to find another way.
Would it be better to use madvise(MADV_DONTNEED) on the small, already-written chunk of memory? I want to tell the OS to write that memory to disk lazily, and not to block my calling thread.
The manpage on madvise() has this to say, which is rather ambiguous:
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the
application is finished with the given range, so the kernel can free
resources associated with it.) Subsequent accesses of pages in this
range will succeed, but will result either in re-loading of the memory
contents from the underlying mapped file (see mmap(2)) or
zero-fill-on-demand pages for mappings without an underlying file.
For your own good, stay away from MADV_DONTNEED
. Linux will not take this as a hint to throw pages away after writing them back, but to throw them away immediately. This is not considered a bug, but a deliberate decision.
Ironically, the reasoning is that the functionality of a non-destructive MADV_DONTNEED
is already given by msync(MS_INVALIDATE|MS_ASYNC)
, MS_ASYNC
on the other hand does not start I/O (in fact, it does nothing at all, following the reasoning that dirty page writeback works fine anyway), fsync
always blocks, and sync_file_range
may block if you exceed some obscure limit and is considered "extremely dangerous" by the documentation, whatever that means.
Either way, you must msync(MS_SYNC)
, or fsync
(both blocking), or sync_file_range
(possibly blocking) followed by fsync
, or you will lose data with MADV_DONTNEED
. If you cannot afford to possibly block, you have no choice, sadly, but to do this in another thread.
For recent Linux kernels (just tested on Linux 5.4), MADV_DONTNEED
works as expected when the mapping is NOT private, e.g. mmap
without MAP_PRIVATE
flag. I'm not sure what's the behavior on previous versions of Linux kernel.
From release 4.15 of the Linux man-pages project's madvise
manpage:
After a successful
MADV_DONTNEED
operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings.
Linux added a new flag MADV_FREE
with the same behavior in BSD systems in Linux 4.5
which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again.
For why MADV_DONTNEED
for private mapping may result zero filled pages upon future access, watch Bryan Cantrill's rant as mentioned in comments of @Damon's answer. Spoiler: it comes from Tru64 UNIX.
As already mentioned, MADV_DONTNEED
is not your friend. Since Linux 5.4, you can use MADV_COLD
to tell the kernel it should page out that memory when there is memory pressure. This seems to be exactly what is wanted in this situation.
Read more here: https://lwn.net/Articles/793462/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With