What's the correct way to durably rename a file in a POSIX file system? Specifically wondering about fsyncs on the directories. (If this depends on the OS/FS, I'm asking about Linux and ext3/ext4).
Note: there are other questions on StackOverflow about durable renames, but AFAICT they don't address fsync-ing the directories (which is what matters to me - I'm not even modifying file data).
I currently have (in Python):
dstdirfd = open(dstdirpath, O_DIRECTORY|O_RDONLY)
rename(srcdirpath + '/' + filename, dstdirpath + '/' + filename)
fsync(dstdirfd)
Specific questions:
Thanks in advance.
The C standard library provides a function called rename which does this action. In POSIX, which is extended from the C standard, the rename function will fail if the old and new names are on different mounted file systems. In SQL, renames are performed by using the CHANGE specification in ALTER TABLE statements.
The rename() function shall change the name of a file.
To rename a file in the terminal, move the file with mv from itself to itself with a new name.
Unfortunately Dave’s answer is wrong.
Not all POSIX systems might even have a durable storage. And if they do, it is still “allowed” to be hosed after a system crash. For those systems a no-op fsync() makes sense, and such fsync() is explicitly allowed under POSIX. It is also legal for the file to be recoverable in the old directory, the new directory, both, or any other location. POSIX makes no guarantees for system crashes or file system recoveries.
The real question should be:
How to do a durable rename on systems which support that through the POSIX API?
You need to do a fsync() on both, source and destination directory, because the minimum those fsync()s are supposed to do is persist how source or destination directory should look like.
Does a fsync(destdirfd) also implicitly fsync the source directory?
Or might I end up with the file showing up in both directories after a power cycle (“crash”), i.e. it's impossible to guarantee a durably atomic move operation?
If I fsync the source directory instead of the destination directory, will that also implicitly fsync the destination directory?
Are there any useful related testing/debugging/learning tools (fault injectors, introspection tools, mock filesystems, etc.)?
For a real crash, no. By the way, a real crash goes beyond the viewpoint of the kernel. The hardware might reorder writes (and fail to write everything), corrupting the filesystem. Ext4 is better prepared against this, because it enables write barries (mount options) by default (ext3 does not) and can detect corruption with journal checksums (also a mount option).
And for learning: find out if both changes are somehow linked in the journal! :-P
POSIX defines that the rename function must be atomic.
So if you rename(A, B), under no circumstances should you ever see a state with the file in both directories or neither directory. There will always be exactly one, no matter what you do with fsync() or whether the system crashes.
But that doesn't solve the problem of making sure the rename() operation is durable. POSIX answers this question:
If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion.
So if you fsync() a directory, pending rename operations must be transferred to disk by the time this returns. fsync() of either directory should be sufficient because atomicity of the rename() operation would require that both directories' changes be synced atomically.
Finally, in contrast to the claim in the blog post mentioned in another answer, the rationale for this explains the following:
The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.
A system that claimed to be POSIX compliant and that considered it correct behavior (i.e. not a bug or hardware failure) to complete an fsync() and not persist those changes across a system crash would have to be deliberately misrepresenting itself with respect to the spec.
(updated with additional info re: Linux-specific vs. portable behavior)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With