I'm writing stress test suite for testing distributed file systems over NFS.
In some cases when some process deletes file, while some other process attempts to read from it, I'm getting "Stale file handle" error (116).
Is that kind of error is expected and acceptable in such race condition?
Test working as follows:
File is exists, as successful stat
operation shows:
controller_debug.log.2:2016-10-26 15:02:30,156;INFO - [LG-E27A-LNX:0xa]: finished 640522b4d94c453ea545cb86568320ca, result: success | stat | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.156
Process 0x1
on client CLIENT-A
completed successful delete:
controller_debug.log.2:2016-10-26 15:02:30,164;INFO - [CLIENT-A:0x1]: finished 5f5dfe6a06de495f851745a78857eec1, result: success | delete | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.161
3 milliseconds later, process 0xb
on client CLIENT-B
failed "read" op due to "Stale file handle"
controller_debug.log.2:2016-10-26 15:02:30,164;INFO - [CLIENT-B:0xb]: finished e84e2064ead042099310af1bd44821c0, result: failed | read | /mnt/DIRSPLIT-node0.b27-1/JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | [errno:116] | Stale file handle | 142 | data: {} | 2016/10/26 15:02:30.160 controller_debug.log.2:2016-10-26 15:02:30,164;ERROR - Operation read FAILED UNEXPECTEDLY on File JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 due to Stale file handle
Thanks
Stale file handles are refreshed when the process reopens the file. Doing so updates the file description with the file's new inode number if it exists. In most cases, the process must do this internally. Otherwise, we may have to restart it.
I.e. What causes an NFS stale file handle error? The answer is any change in the mounted file's underlying inode, disk device, or inode generation on the NFS server causes an NFS stale filehandle.
-type: This find command flag is used to define the type of file you want to remove (use an f for files and a d for directories). f: After using the -type flag, the f, in this case, was used to specify we want to remove files except for directories.
A filehandle becomes stale whenever the file or directory referenced by the handle is removed by another host, while your client still holds an active reference to the object.
This is totally expected. The NFS specification is clear about use of file handles after an object (be it file or directory) has been deleted. Section 4 clearly addresses this. For example:
The persistent filehandle will become stale or invalid when the file system object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE.
This is such a common problem, it even has its own entry in section A.10 of the NFS FAQ, which says one common cause of ESTALE errors is that:
The file handle refers to a deleted file. After a file is deleted on the server, clients don't find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.
The expected resolution is that your client app must close and reopen the file to see what has happened. Or, as the FAQ says:
... to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With