I am keeping track of a folder of files and their file lengths, at least one of these files is still getting written to.
I have to keep a continuously updated record of each file length which I use for other purposes.
The Update
method is called every 15 seconds and updates the file's properties if the file length differs from the length determined in the previous update.
The update method looks something like this:
var directoryInfo = new DirectoryInfo(archiveFolder);
var archiveFiles = directoryInfo.GetFiles()
.OrderByDescending(f=>f.CreationTimeUtc);
foreach (FileInfo fi in archiveFiles)
{
//check if file existed in previous update already
var origFileProps = cachedFiles.GetFileByName(fi.FullName);
if (origFileProps != null && fi.Length == origFileProps.EndOffset)
{
//file length is unchanged
}
else
{
//Update the properties of this file
//set EndOffset of the file to current file length
}
}
I am aware of the fact that DirectoryInfo.GetFiles()
is pre-populating many of the FileInfo
properties including Length
- and this is ok as long as no caching is done between updates (cached information should not be older than 15 seconds).
I was under the assumption that each DirectoryInfo.GetFiles()
call generates a new set of FileInfos
which all are populated with fresh information right then using the FindFirstFile
/FindNextFile
Win32 API. But this does not seem to be the case.
Very rarely, but eventually for sure I run into situations where the file length for a file that is getting written to is not updated for 5, 10 or even 20 minutes at a time (testing is done on Windows 2008 Server x64 if that matters).
A current workaround is to call fi.Refresh()
to force an update on each file info. This internally seems to delegate to a GetFileAttributesEx
Win32 API call to update the file information.
While the cost of forcing a refresh manually is tolerable I would rather understand why I am getting stale information in the first place. When is the FileInfo
information generated and how does it relate to the call of DirectoryInfo.GetFiles()
? Is there a file I/O caching layer underneath that I don't fully grasp?
Raymond Chen has now written a very detailed blog post about exactly this issue:
Why is the file size reported incorrectly for files that are still being written to?
In NTFS, file system metadata is a property not of the directory entry but rather of the file, with some of the metadata replicated into the directory entry as a tweak to improve directory enumeration performance. Functions like FindFirstFile report the directory entry, and by putting the metadata that FAT users were accustomed to getting "for free", they could avoid being slower than FAT for directory listings. The directory-enumeration functions report the last-updated metadata, which may not correspond to the actual metadata if the directory entry is stale.
Essentially it comes down to performance: The directory information gathered from DirectoryInfo.GetFiles()
and the FindFirstFile
/FindNextFile
Win32 API underneath is cached for performance reasons to guarantee better performance in NTFS than in the old FAT for acquiring directory information. Accurate file size information can only be acquired by calling GetFileSize()
on a file directly (in .NET call Refresh()
on the FileInfo
or acquire a FileInfo
from the file name directly) - or opening and closing the file stream which causes the updated file information to be propagated to the directory metadata cache. The later case explains why the file size is immediately updated when the writing process closes the file.
This also explains that the problem seemingly did not show up in Windows 2003 Server - back then the file info was replicated more often / whenever the cache was flushed - this is not the case anymore for Windows 2008 Server:
As for how often, the answer is a little more complicated. Starting in Windows Vista (and its corresponding Windows Server version which I don't know but I'm sure you can look up, and by "you" I mean "Yuhong Bao"), the NTFS file system performs this courtesy replication when the last handle to a file object is closed. Earlier versions of NTFS replicated the data while the file was open whenever the cache was flushed, which meant that it happened every so often according to an unpredictable schedule. The result of this change is that the directory entry now gets updated less frequently, and therefore the last-updated file size is more out-of-date than it already was.
Reading the full article is very informative and recommended!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With