Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rename() atomicity and NFS?

Tags:

linux

atomic

nfs

With reference to: Is rename() atomic?

I'm asking something similar, but not quite the same, because what I want to know is is it safe to rely on the atomicty of rename() when using NFS?

Here's a scenario I'm dealing with - I have an 'index' file that must always be present.

So:

  • Client creates a new file
  • Client renames new file over 'old' index file.

Separate client:

  • Reads index file
  • refers to on disk structure based on the index.

This is making the assumption that rename() being atomic means - there will always be an 'index' file (although, it might be an out of date version, because caching and timing)

However the problem I'm hitting is this - that this is happening on NFS - and working - but several of my NFS clients are occasionally reporting "ENOENT" - no such file or directory. (e.g. in hundreds operations happening at 5m intervals, we get this error every couple of days).

So what I'm hoping is whether someone can enlighten me - should it actually be impossible to get 'ENOENT' in this scenario?

The reason I'm asking is this entry in RFC 3530:

The RENAME operation must be atomic to the client.

I'm wondering if that means just the client issuing the rename, and not the client viewing the directory? (I'm ok with a cached/out of date directory structure, but the point of this operation is that this file will always be 'present' in some form)

Sequence of operations (from the client performing the write operation) is:

21401 14:58:11 open("fleeg.ext", O_RDWR|O_CREAT|O_EXCL, 0666) = -1 EEXIST (File exists) <0.000443>
21401 14:58:11 open("fleeg.ext", O_RDWR) = 3 <0.000547>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=572, ...}) = 0 <0.000012>
21401 14:58:11 fadvise64(3, 0, 572, POSIX_FADV_RANDOM) = 0 <0.000008>
21401 14:58:11 fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=1, len=1}) = 0 <0.001994>
21401 14:58:11 open("fleeg.ext.i", O_RDWR|O_CREAT, 0666) = 4 <0.000538>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000008>
21401 14:58:11 fadvise64(4, 0, 42, POSIX_FADV_RANDOM) = 0 <0.000006>
21401 14:58:11 close(4)                 = 0 <0.000011>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=572, ...}) = 0 <0.000007>
21401 14:58:11 open("fleeg.ext.i", O_RDONLY) = 4 <0.000577>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 fadvise64(4, 0, 42, POSIX_FADV_RANDOM) = 0 <0.000006>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000007>
21401 14:58:11 read(4, "\3PAX\1\0\0O}\270\370\206\20\225\24\22\t\2\0\203RD\0\0\0\0\17\r\0\2\0\n"..., 42) = 42 <0.000552>
21401 14:58:11 close(4)                 = 0 <0.000013>
21401 14:58:11 fcntl(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=466, len=68}) = 0 <0.001418>
21401 14:58:11 pread(3, "\21@\203\244I\240\333\272\252d\316\261\3770\361#\222\200\313\224&J\253\5\354\217-\256LA\345\253"..., 38, 534) = 38 <0.000010>
21401 14:58:11 pread(3, "\21@\203\244I\240\333\272\252d\316\261\3770\361#\222\200\313\224&J\253\5\354\217-\256LA\345\253"..., 38, 534) = 38 <0.000010>
21401 14:58:11 pread(3, "\21\"\30\361\241\223\271\256\317\302\363\262F\276]\260\241-x\227b\377\205\356\252\236\211\37\17.\216\364"..., 68, 466) = 68 <0.000010>
21401 14:58:11 pread(3, "\21\302d\344\327O\207C]M\10xxM\377\2340\0319\206k\201N\372\332\265R\242\313S\24H"..., 62, 300) = 62 <0.000011>
21401 14:58:11 pread(3, "\21\362cv'\37\204]\377q\362N\302/\212\255\255\370\200\236\350\2237>7i`\346\271Cy\370"..., 104, 362) = 104 <0.000010>
21401 14:58:11 pwrite(3, "\21\302\3174\252\273.\17\v\247\313\324\267C\222P\303\n~\341F\24oh/\300a\315\n\321\31\256"..., 127, 572) = 127 <0.000012>
21401 14:58:11 pwrite(3, "\21\212Q\325\371\223\235\256\245\247\\WT$\4\227\375[\\\3263\222\0305\0\34\2049A;2U"..., 68, 699) = 68 <0.000009>
21401 14:58:11 pwrite(3, "\21\262\20Kc(!.\350\367i\253hkl~\254\335H\250.d\0036\r\342\v\242\7\255\214\31"..., 38, 767) = 38 <0.000009>
21401 14:58:11 fsync(3)                 = 0 <0.001007>
21401 14:58:11 fstat(3, {st_mode=S_IFREG|0600, st_size=805, ...}) = 0 <0.000009>
21401 14:58:11 open("fleeg.ext.i.tmp", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4 <0.001813>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000007>
21401 14:58:11 fadvise64(4, 0, 0, POSIX_FADV_RANDOM) = 0 <0.000007>
21401 14:58:11 write(4, "\3PAX\1\0\0qT2\225\226\20\225\24\22\t\2\0\205;D\0\0\0\0\17\r\0\2\0\n"..., 42) = 42 <0.000012>
21401 14:58:11 stat("fleeg.ext.i", {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000011>
21401 14:58:11 fchmod(4, 0100600)       = 0 <0.002517>
21401 14:58:11 fstat(4, {st_mode=S_IFREG|0600, st_size=42, ...}) = 0 <0.000008>
21401 14:58:11 close(4)                 = 0 <0.000011>
21401 14:58:11 rename("fleeg.ext.i.tmp", "fleeg.pax.i") = 0 <0.001201>
21401 14:58:11 close(3)                 = 0 <0.000795>
21401 14:58:11 munmap(0x7f1475cce000, 4198400) = 0 <0.000177>
21401 14:58:11 munmap(0x7f14760cf000, 4198400) = 0 <0.000173>
21401 14:58:11 futex(0x7f147cbcb908, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000010>
21401 14:58:11 exit_group(0)            = ?
21401 14:58:11 +++ exited with 0 +++

NB - Paths and files renamed in the above for consistency. fleeg.ext is the data file, and fleeg.ext.i is the index. During this process - the fleeg.ext.i file is being overwritten (by the .tmp file), which is why the belief is that there should always be a file at that path (either the old one, or the new that's just overwritten it).

On the reading client the PCAP looks like LOOKUP NFS call is what's failing:

124   1.375777  10.10.41.35 -> 10.10.41.9   NFS 226   LOOKUP    fleeg.ext.i V3 LOOKUP Call, DH: 0x6fbbff3a/fleeg.ext.i
125   1.375951   10.10.41.9 -> 10.10.41.35  NFS 186 5347  LOOKUP  0775 Directory  V3 LOOKUP Reply (Call In 124) Error: NFS3ERR_NOENT
126   1.375975  10.10.41.35 -> 10.10.41.9   NFS 226   LOOKUP    fleeg.ext.i V3 LOOKUP Call, DH: 0x6fbbff3a/fleeg.ext.i
127   1.376142   10.10.41.9 -> 10.10.41.35  NFS 186 5347  LOOKUP  0775 Directory  V3 LOOKUP Reply (Call In 126) Error: NFS3ERR_NOENT
like image 607
Sobrique Avatar asked Dec 28 '16 12:12

Sobrique


2 Answers

I think the problem is not in the RENAME not being atomic, but in the fact that OPENing a file via NFS is not atomic.

NFS uses Filehandles; in order to do something to a file, a client first obtains a Filehandle through a LOOKUP, then the obtained Filehandle is used to perform the other requests. A minimum of two datagram is required, and the time between them can, in particular circumstances, be quite "large".

What is happening to you, I suppose, is that a client (client1) performs a LOOKUP; just after that, the LOOKUPed file gets erased as a result of RENAME (by client2); the Filehandle client1 has is no more valid, because it refers to an inode, not to a named path.

The reason for all this is that NFS aims to be stateless. More info in this PDF: http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf

In pages 6 and 8 this behaviour is well explained.

like image 172
linuxfan says Reinstate Monica Avatar answered Oct 14 '22 14:10

linuxfan says Reinstate Monica


Should it actually be impossible to get ENOENT in this scenario?

It is quite possible. The RFC 3530 says:

The operation is required to be atomic to the client.

That most likely means it must be atomic to the client invoking this operation, not all clients.

And further on it says:

If the target directory already contains an entry with the name... the existing target is removed before the rename occurs.

This is the reason other clients get ENOENT sometimes.

In other words, rename is not atomic on NFS.

like image 27
Maxim Egorushkin Avatar answered Oct 14 '22 14:10

Maxim Egorushkin