Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Stale file handle" error, when process trying read the file, that other process already had deleted

Tags:

python

linux

nfs

I'm writing stress test suite for testing distributed file systems over NFS.

In some cases when some process deletes file, while some other process attempts to read from it, I'm getting "Stale file handle" error (116).

Is that kind of error is expected and acceptable in such race condition?

Test working as follows:

  1. Starting x number of client machines
  2. Each client machine runs y processes
  3. Each process can do any file operation as stat/read/delete/open
  4. Mentioned file ops are standard python methods - os.stat/read/os.remove/open
  5. All files are empty 0 bytes data

File is exists, as successful stat operation shows:

controller_debug.log.2:2016-10-26 15:02:30,156;INFO - [LG-E27A-LNX:0xa]: finished 640522b4d94c453ea545cb86568320ca, result: success | stat | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.156

Process 0x1 on client CLIENT-A completed successful delete:

controller_debug.log.2:2016-10-26 15:02:30,164;INFO - [CLIENT-A:0x1]: finished 5f5dfe6a06de495f851745a78857eec1, result: success | delete | /JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | data: {} | 2016/10/26 15:02:30.161

3 milliseconds later, process 0xb on client CLIENT-B failed "read" op due to "Stale file handle"

controller_debug.log.2:2016-10-26 15:02:30,164;INFO - [CLIENT-B:0xb]: finished e84e2064ead042099310af1bd44821c0, result: failed | read | /mnt/DIRSPLIT-node0.b27-1/JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 | [errno:116] | Stale file handle | 142 | data: {} | 2016/10/26 15:02:30.160 controller_debug.log.2:2016-10-26 15:02:30,164;ERROR - Operation read FAILED UNEXPECTEDLY on File JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41 due to Stale file handle

Thanks

like image 240
Samuel Avatar asked Oct 26 '16 12:10

Samuel


People also ask

How do you fix a stale file handle?

Stale file handles are refreshed when the process reopens the file. Doing so updates the file description with the file's new inode number if it exists. In most cases, the process must do this internally. Otherwise, we may have to restart it.

What causes stale file handles?

I.e. What causes an NFS stale file handle error? The answer is any change in the mounted file's underlying inode, disk device, or inode generation on the NFS server causes an NFS stale filehandle.

How do I delete a stale file in Linux?

-type: This find command flag is used to define the type of file you want to remove (use an f for files and a d for directories). f: After using the -type flag, the f, in this case, was used to specify we want to remove files except for directories.

What is a stale file?

A filehandle becomes stale whenever the file or directory referenced by the handle is removed by another host, while your client still holds an active reference to the object.


1 Answers

This is totally expected. The NFS specification is clear about use of file handles after an object (be it file or directory) has been deleted. Section 4 clearly addresses this. For example:

The persistent filehandle will become stale or invalid when the file system object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE.

This is such a common problem, it even has its own entry in section A.10 of the NFS FAQ, which says one common cause of ESTALE errors is that:

The file handle refers to a deleted file. After a file is deleted on the server, clients don't find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.

The expected resolution is that your client app must close and reopen the file to see what has happened. Or, as the FAQ says:

... to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.

like image 100
Peter Brittain Avatar answered Oct 11 '22 20:10

Peter Brittain