Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I poll a large number of files for changes?

I'd like to poll the file system for any changed, added or removed files or sub-directories. All changes should be detected quickly but without putting pressure on the machine. The OS is Windows >= Vista, the observed part is a local directory.

Typically, I would resort to a FileSystemWatcher, but this led to problems with other programs that tried to watch the same spot (prominently, Windows Explorer). Also, I heard that FSW is not really reliable even for local folders and with a large buffer.

The main issue I have is that the number of files and directories may be very large (guess 7-digits). Simply running a check for all files every second did noticeably affect my machine.

My next idea was to check different parts of the whole tree per second to reduce the overall impact, and possibly add a kind of heuristic, like checking files that get changed frequently in quicker succession.

I'm wondering if there are patterns for this kind of problem, or if anyone has experiences with this situation.

like image 202
mafu Avatar asked Aug 26 '11 11:08

mafu


2 Answers

We have implemented a similar feature, using C#. The FileSystemWatcher was inefficient with large directory trees.

Our alternative, was using FSNodes, an struct created by us, using the following Windows API calls:

    [StructLayout(LayoutKind.Sequential)]
        private struct FILETIME
    {
        public uint dwLowDateTime;
        public uint dwHighDateTime;
    };

    [StructLayout(LayoutKind.Sequential, CharSet=CharSet.Unicode)]
        private struct WIN32_FIND_DATA
    {
        public FileAttributes dwFileAttributes;
        public FILETIME ftCreationTime;
        public FILETIME ftLastAccessTime;
        public FILETIME ftLastWriteTime;
        public uint nFileSizeHigh;
        public uint nFileSizeLow;
        public int dwReserved0;
        public int dwReserved1;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst=MAX_PATH)]
        public string cFileName;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst=MAX_ALTERNATE)]
        public string cAlternate;
    }

    [DllImport("kernel32.dll", SetLastError = true)]
    static extern bool FindClose(IntPtr hFindFile);

    [DllImport("kernel32", CharSet=CharSet.Unicode)]
    private static extern IntPtr FindFirstFile(
        string lpFileName, out WIN32_FIND_DATA lpFindFileData);

    [DllImport("kernel32", CharSet=CharSet.Unicode)]
    private static extern bool FindNextFile(
        IntPtr hFindFile, out WIN32_FIND_DATA lpFindFileData);

What we do is a static processing. We save a metadata tree on disk and compare the stored directory tree vs the loaded one, searching modified (based on its timestamp (faster), or on the file hash). Also, we can manage deleted, added and moved, even moved-modified files (also based on the file hash).

This implementation mixed with a daemon that executed it each POLL_TIME, was valid for us. Hope it helps.

like image 154
Daniel Peñalba Avatar answered Sep 28 '22 00:09

Daniel Peñalba


My best guess would be to use USN journal if it is a local machine, you have administrator privileges and partitions are NTFS. USN journal is extremely fast and reliable. It is a long topis and this link explains everything: http://www.microsoft.com/msj/0999/journal/journal.aspx

like image 29
pg0xC Avatar answered Sep 28 '22 01:09

pg0xC