Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

getting current file length / FileInfo.Length caching and stale information

Tags:

c#

.net

file-io

I am keeping track of a folder of files and their file lengths, at least one of these files is still getting written to.

I have to keep a continuously updated record of each file length which I use for other purposes.

The Update method is called every 15 seconds and updates the file's properties if the file length differs from the length determined in the previous update.

The update method looks something like this:

var directoryInfo = new DirectoryInfo(archiveFolder);
var archiveFiles = directoryInfo.GetFiles()
                                .OrderByDescending(f=>f.CreationTimeUtc); 
foreach (FileInfo fi in archiveFiles)
{
    //check if file existed in previous update already
    var origFileProps = cachedFiles.GetFileByName(fi.FullName);
    if (origFileProps != null && fi.Length == origFileProps.EndOffset)
    {
        //file length is unchanged
    }
    else
    {
        //Update the properties of this file
        //set EndOffset of the file to current file length
    }
}

I am aware of the fact that DirectoryInfo.GetFiles() is pre-populating many of the FileInfo properties including Length - and this is ok as long as no caching is done between updates (cached information should not be older than 15 seconds).

I was under the assumption that each DirectoryInfo.GetFiles() call generates a new set of FileInfos which all are populated with fresh information right then using the FindFirstFile/FindNextFile Win32 API. But this does not seem to be the case.

Very rarely, but eventually for sure I run into situations where the file length for a file that is getting written to is not updated for 5, 10 or even 20 minutes at a time (testing is done on Windows 2008 Server x64 if that matters).

A current workaround is to call fi.Refresh() to force an update on each file info. This internally seems to delegate to a GetFileAttributesEx Win32 API call to update the file information.

While the cost of forcing a refresh manually is tolerable I would rather understand why I am getting stale information in the first place. When is the FileInfo information generated and how does it relate to the call of DirectoryInfo.GetFiles() ? Is there a file I/O caching layer underneath that I don't fully grasp?

like image 440
BrokenGlass Avatar asked Oct 19 '11 21:10

BrokenGlass


1 Answers

Raymond Chen has now written a very detailed blog post about exactly this issue:

Why is the file size reported incorrectly for files that are still being written to?

In NTFS, file system metadata is a property not of the directory entry but rather of the file, with some of the metadata replicated into the directory entry as a tweak to improve directory enumeration performance. Functions like Find­First­File report the directory entry, and by putting the metadata that FAT users were accustomed to getting "for free", they could avoid being slower than FAT for directory listings. The directory-enumeration functions report the last-updated metadata, which may not correspond to the actual metadata if the directory entry is stale.

Essentially it comes down to performance: The directory information gathered from DirectoryInfo.GetFiles() and the FindFirstFile/FindNextFile Win32 API underneath is cached for performance reasons to guarantee better performance in NTFS than in the old FAT for acquiring directory information. Accurate file size information can only be acquired by calling Get­File­Size() on a file directly (in .NET call Refresh() on the FileInfo or acquire a FileInfo from the file name directly) - or opening and closing the file stream which causes the updated file information to be propagated to the directory metadata cache. The later case explains why the file size is immediately updated when the writing process closes the file.

This also explains that the problem seemingly did not show up in Windows 2003 Server - back then the file info was replicated more often / whenever the cache was flushed - this is not the case anymore for Windows 2008 Server:

As for how often, the answer is a little more complicated. Starting in Windows Vista (and its corresponding Windows Server version which I don't know but I'm sure you can look up, and by "you" I mean "Yuhong Bao"), the NTFS file system performs this courtesy replication when the last handle to a file object is closed. Earlier versions of NTFS replicated the data while the file was open whenever the cache was flushed, which meant that it happened every so often according to an unpredictable schedule. The result of this change is that the directory entry now gets updated less frequently, and therefore the last-updated file size is more out-of-date than it already was.

Reading the full article is very informative and recommended!

like image 65
BrokenGlass Avatar answered Sep 17 '22 19:09

BrokenGlass