Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement or emulate MADV_ZERO?

I would like to be able to zero out a range of a file memory-mapping without invoking any io (in order to efficiently sequentially overwrite huge files without incurring any disk read io).

Doing std::memset(ptr, 0, length) will cause pages to be read from disk if they are not already in memory even if the entire pages are overwritten thus totally trashing disk performance.

I would like to be able to do something like madvise(ptr, length, MADV_ZERO) which would zero out the range (similar to FALLOC_FL_ZERO_RANGE) in order to cause zero fill page faults instead of regular io page faults when accessing the specified range.

Unfortunately MADV_ZERO does not exists. Even though the corresponding flag FALLOC_FL_ZERO_RANGE does exists in fallocate and can be used with fwrite to achieve a similar effect, though without instant cross process coherency.

One possible alternative I would guess is to use MADV_REMOVE. However, that can from my understanding cause file fragmentation and also blocks other operations while completing which makes me unsure of its long term performance implications. My experience with Windows is that the similar FSCTL_SET_ZERO_DATA command can incur significant performance spikes when invoked.

My question is how one could implement or emulate MADV_ZERO for shared mappings, preferably in user mode?

1. /dev/zero/

I have read it being suggested to simply read /dev/zero into the selected range. Though I am not quite sure what "reading into the range" means and how to do it. Is it like a fread from /dev/zero into the memory range? Not sure how that would avoid a regular page fault on access?

For Linux, simply read /dev/zero into the selected range. The kernel already optimises this case for anonymous mappings.

If doing it in general turns out to be too hard to implement, I
propose MADV_ZERO should have this effect: exactly like reading
/dev/zero into the range, but always efficient.

EDIT: Following the thread a bit further it turns out that it will actually not work.

It does not do tricks when you are dealing with a shared mapping.

2. MADV_REMOVE

One guess of implementing it in Linux (i.e. not in user application which is what I would prefer) could be by simply copying and modifying MADV_REMOVE, i.e. madvise_remove to use FALLOC_FL_ZERO_RANGE instead of FALLOC_FL_PUNCH_HOLE. Though I am bit over my head in guessing this, especially as I don't quite understand what the code around the vfs_allocate is doing:

// madvice.c
static long madvise_remove(...)
  ...
  /*
   * Filesystem's fallocate may need to take i_mutex.  We need to
   * explicitly grab a reference because the vma (and hence the
   * vma's reference to the file) can go away as soon as we drop
   * mmap_sem.
   */
  get_file(f); // Increment ref count.
  up_read(&current->mm->mmap_sem); // Release a read lock? Why?
  error = vfs_fallocate(f,
            FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, // FALLOC_FL_ZERO_RANGE?
            offset, end - start);
  fput(f); // Decrement ref count.
  down_read(&current->mm->mmap_sem); // Acquire read lock. Why?
  return error;
}
like image 505
ronag Avatar asked Aug 31 '15 23:08

ronag


1 Answers

You probably cannot do what you want (in user space, without hacking the kernel). Notice that writing zero pages might not incur physical disk IO because of the page cache.

You might want to replace a file segment by a file hole (but this is not exactly what you want) in a sparse file, but some file systems (e.g. VFAT) don't have holes or sparse files. See lseek(2) with SEEK_HOLE, ftruncate(2)

like image 101
Basile Starynkevitch Avatar answered Nov 04 '22 09:11

Basile Starynkevitch