Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fallocate vs posix_fallocate

Tags:

c++

c

posix

I am debating which function to use between posix_fallocate and fallocate. posix_fallocate writes a file right away (initializes the characters to NULL). However, fallocate does not change the file size (when using FALLOC_FL_KEEP_SIZE flag). Based on my experimentation, it seems that fallocate does not write NULL or zero characters to the file.

Can someone please comment based on your experience? Thanks for your time.

like image 720
Bill Avatar asked Dec 28 '12 00:12

Bill


3 Answers

I take it you didn't look at the documentation that says

   The mode argument determines the operation to be performed on the given range.
   Currently only one flag is supported for mode:

   FALLOC_FL_KEEP_SIZE
          This flag allocates and initializes to zero the disk space within the
          range specified by offset and len.  After a successful call, subsequent
          writes into this range are guaranteed not to fail because of lack of
          disk space.  Preallocating zeroed blocks beyond the end of the file is
          useful for optimizing append workloads.  Preallocating blocks does not
          change the file size (as reported by stat(2)) even if it is less than
          offset+len.

   If FALLOC_FL_KEEP_SIZE flag is not specified in mode, the default behavior is
   almost same as when this flag is specified.  The only difference is that on
   success, the file size will be changed if offset + len is greater than the
   file size.  This default behavior closely resembles the behavior of the
   posix_fallocate(3) library function, and is intended as a method of optimally
   implementing that function.

The man page for posix_fallocate() doesn't appear to have the same thing mentioned, but instead, looking at the source here, it seems to write each block of the file (line 88).

man fallocate man posix_fallocate

like image 45
Mats Petersson Avatar answered Nov 16 '22 00:11

Mats Petersson


Having files that take up more storage space than their displayed length is not usual, so unless you have a good reason for doing that (e.g. you want to use the file length to keep track of how far a download got, for the purpose of resuming it), best to use the default fallocate(2) behaviour. (without FALLOC_FL_KEEP_SIZE). This is the same semantics as posix_fallocate(3).

The man page for fallocate(2) even says that its default behaviour (no flags) is intended as an optimal way of implementing posix_fallocate(3), and points to that as a portable way to allocate space.

The original question says something about writing zeros to the file. None of these calls write anything but metadata. If you read from space that's been preallocated but not yet written, you'll get zeros (not whatever was in that disk space previously, that would be a big security hole). You can only read up to the end of a file (the length, set by fallocate, ftruncate, or various other ways), so if you have a zero-length file and fallocate with FALLOC_FL_KEEP_SIZE, then you can't read anything. Nothing to do with preallocation, just file size semantics.

So if you're fine with the POSIX semantics, use it, because it's more portable. Every GNU/Linux system will support posix_fallocate(3), but so will some other systems.

However, thanks to POSIX semantics, it's not that simple. If you use it on a filesystem that doesn't support preallocation, it will still succeed, but do so by falling back to actually writing a zero in every block of the file.

Test program:

#include <fcntl.h>
int main() {
    int fd = open("foo", O_RDWR|O_CREAT, 0666);
    if (fd < 0) return 1;
    return posix_fallocate(fd, 0, 400000);
}

on XFS

$ strace ~/src/c/falloc
...
open("foo", O_RDWR|O_CREAT, 0666) = 3
fallocate(3, 0, 0, 400000)              = 0
exit_group(0)                           = ?

on a fat32 flash drive:

open("foo", O_RDWR|O_CREAT, 0666) = 3
fallocate(3, 0, 0, 400000)              = -1 EOPNOTSUPP (Operation not supported)
fstat(3, {st_mode=S_IFREG|0755, st_size=400000, ...}) = 0
fstatfs(3, {f_type="MSDOS_SUPER_MAGIC", f_bsize=65536, f_blocks=122113, f_bfree=38274, f_bavail=38274, f_files=0, f_ffree=0, f_fsid={2145, 0}, f_namelen=1530, f_frsize=65536}) = 0
pread(3, "\0", 1, 6783)                 = 1
pwrite(3, "\0", 1, 6783)                = 1
pread(3, "\0", 1, 72319)                = 1
pwrite(3, "\0", 1, 72319)               = 1
pread(3, "\0", 1, 137855)               = 1
pwrite(3, "\0", 1, 137855)              = 1
pread(3, "\0", 1, 203391)               = 1
pwrite(3, "\0", 1, 203391)              = 1
pread(3, "\0", 1, 268927)               = 1
pwrite(3, "\0", 1, 268927)              = 1
pread(3, "\0", 1, 334463)               = 1
pwrite(3, "\0", 1, 334463)              = 1
pread(3, "\0", 1, 399999)               = 1
pwrite(3, "\0", 1, 399999)              = 1
exit_group(0)                           = ?

It does avoid the reads if the file wasn't yet that long, but writing every block is still horrible.

If you want something simple, I'd still just go with posix_fallocate. There's a FreeBSD man page for it, and it's specified by POSIX, so every POSIX-compliant system provides it. The one drawback is that it will be horrible with glibc on a filesystem that doesn't support preallocation. See for example https://plus.google.com/+AaronSeigo/posts/FGtXM13QuhQ. For a program that works with large files, (e.g. torrents), this could be really bad.

You can thank POSIX semantics for requiring glibc to do this, as it doesn't define an error code for "the filesystem doesn't support preallocation". http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html. It also guarantees that if the call succeeds, subsequent writes into the allocated region won't fail due to lack of disk space. So the posix design doesn't provide a way to handle the case where the caller cares about efficiency / performance / fragmentation, rather than disk space guarantees. This forces the POSIX implementation to do the read-write loop, rather than leaving that as an option for callers that need a disk-space guarantee. Thanks POSIX...

I don't know whether non-GNU implementations of posix_fallocate similarly fall back to extremely slow read-write behaviour when the filesystem doesn't support preallocation. (FreeBSD, Solaris?). Apparently OS X (Darwin) doesn't implement posix_fallocate, unless it's very recent.

If you're looking to support preallocation across a lot of platforms, but without falling back to read-then-write if the OS has a way to just attempt preallocation, you have to use whatever platform-specific method is available. e.g. check out https://github.com/arvidn/libtorrent/blob/master/src/file.cpp

search for file::set_size. It has several ifdeffed blocks depending on what the compile target supports, starting with windows code to load DLLs and do stuff there, then fcntl F_PREALLOCATE, or fcntl F_ALLOCSP64, then Linux fallocate(2), then falls back to using posix_fallocate. Also, found this 2007 list post for OS X Darwin: http://lists.apple.com/archives/darwin-dev/2007/Dec/msg00040.html

like image 62
6 revs, 2 users 99% Avatar answered Nov 15 '22 23:11

6 revs, 2 users 99%


At least one bit of information is from the fallocate(2) man page:

int fallocate(int fd, int mode, off_t offset, off_t len);

DESCRIPTION
   This is a nonportable, Linux-specific system call.

Though the system call documentation does not say it, the fallocate(1) program man page says:

As of the Linux Kernel v2.6.31, the fallocate system call is supported
by the btrfs, ext4, ocfs2, and xfs filesystems.

This makes sense to me, as the NTFS, FAT, CDFS, and most other common file systems do not have an internal mechanism on disk to support the call. I presume support for those would be buffered by the kernel and the setting would not persist across system boots.

like image 44
wallyk Avatar answered Nov 15 '22 23:11

wallyk