Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Atomicity of `write(2)` to a local filesystem

Tags:

Apparently POSIX states that

Either a file descriptor or a stream is called a "handle" on the open file description to which it refers; an open file description may have several handles. […] All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. […] The handles need not be in the same process for these rules to apply. -- POSIX.1-2008

and

If two threads each call [the write() function], each call shall either see all of the specified effects of the other call, or none of them. -- POSIX.1-2008

My understanding of this is that when the first process issues a write(handle, data1, size1) and the second process issues write(handle, data2, size2), the writes can occur in any order but the data1 and data2 must be both pristine and contiguous.

But running the following code gives me unexpected results.

#include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <sys/wait.h> die(char *s) {   perror(s);   abort(); }  main() {   unsigned char buffer[3];   char *filename = "/tmp/atomic-write.log";   int fd, i, j;   pid_t pid;   unlink(filename);   /* XXX Adding O_APPEND to the flags cures it. Why? */   fd = open(filename, O_CREAT|O_WRONLY/*|O_APPEND*/, 0644);   if (fd < 0)     die("open failed");   for (i = 0; i < 10; i++) {     pid = fork();     if (pid < 0)       die("fork failed");     else if (! pid) {       j = 3 + i % (sizeof(buffer) - 2);       memset(buffer, i % 26 + 'A', sizeof(buffer));       buffer[0] = '-';       buffer[j - 1] = '\n';       for (i = 0; i < 1000; i++)         if (write(fd, buffer, j) != j)           die("write failed");       exit(0);     }   }   while (wait(NULL) != -1)     /* NOOP */;   exit(0); } 

I tried running this on Linux and Mac OS X 10.7.4 and using grep -a '^[^-]\|^..*-' /tmp/atomic-write.log shows that some writes are not contiguous or overlap (Linux) or plain corrupted (Mac OS X).

Adding the flag O_APPEND in the open(2) call fixes this problem. Nice, but I do not understand why. POSIX says

O_APPEND If set, the file offset shall be set to the end of the file prior to each write.

but this is not the problem here. My sample program never does lseek(2) but share the same file description and thus same file offset.

I have already read similar questions on Stackoverflow but they still do not fully answer my question.

Atomic write on file from two process does not specifically address the case where the processes share the same file description (as opposed to the same file).

How does one programmatically determine if “write” system call is atomic on a particular file? says that

The write call as defined in POSIX has no atomicity guarantee at all.

But as cited above it does have some. And what’s more, O_APPEND seems to trigger this atomicity guarantee although it seems to me that this guarantee should be present even without O_APPEND.

Can you explain further this behaviour ?

like image 413
kmkaplan Avatar asked May 18 '12 10:05

kmkaplan


2 Answers

man 2 write on my system sums it up nicely:

Note that not all file systems are POSIX conforming.

Here is a quote from a recent discussion on the ext4 mailing list:

Currently concurrent reads/writes are atomic only wrt individual pages, however are not on the system call. This may cause read() to return data mixed from several different writes, which I do not think it is good approach. We might argue that application doing this is broken, but actually this is something we can easily do on filesystem level without significant performance issues, so we can be consistent. Also POSIX mentions this as well and XFS filesystem already has this feature.

This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.

like image 129
NPE Avatar answered Oct 10 '22 05:10

NPE


Edit: Updated Aug 2017 with latest changes in OS behaviours.

Firstly, O_APPEND or the equivalent FILE_APPEND_DATA on Windows means that increments of the maximum file extent (file "length") are atomic under concurrent writers. This is guaranteed by POSIX, and Linux, FreeBSD, OS X and Windows all implement it correctly. Samba also implements it correctly, NFS before v5 does not as it lacks the wire format capability to append atomically. So if you open your file with append-only, concurrent writes will not tear with respect to one another on any major OS unless NFS is involved.

This says nothing about whether reads will ever see a torn write though, and on that POSIX says the following about atomicity of read() and write() to regular files:

All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links ... [many functions] ... read() ... write() ... If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them. [Source]

and

Writes can be serialized with respect to other reads and writes. If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. [Source]

but conversely:

This volume of POSIX.1-2008 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control. [Source]

A safe interpretation of all three of these requirements would suggest that all writes overlapping an extent in the same file must be serialised with respect to one another and to reads such that torn writes never appear to readers.

A less safe, but still allowed interpretation could be that reads and writes only serialise with each other between threads inside the same process, and between processes writes are serialised with respect to reads only (i.e. there is sequentially consistent i/o ordering between threads in a process, but between processes i/o is only acquire-release).

So how do popular OS and filesystems perform on this? As the author of proposed Boost.AFIO an asynchronous filesystem and file i/o C++ library, I decided to write an empirical tester. The results are follows for many threads in a single process.


No O_DIRECT/FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = 1 byte until and including 10.0.10240, from 10.0.14393 at least 1Mb, probably infinite as per the POSIX spec.

Linux 4.2.6 with ext4: update atomicity = 1 byte

FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.

O_DIRECT/FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = until and including 10.0.10240 up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in. Since 10.0.14393, at least 1Mb, probably infinite as per the POSIX spec.

Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite as per the POSIX spec. Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this problem in ext4.

FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.


So in summary, FreeBSD with ZFS and very recent Windows with NTFS is POSIX conforming. Very recent Linux with ext4 is POSIX conforming only with O_DIRECT.

You can see the raw empirical test results at https://github.com/ned14/afio/tree/master/programs/fs-probe. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.

like image 31
Niall Douglas Avatar answered Oct 10 '22 04:10

Niall Douglas