Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validating the existence of 350 million files over a network

I have a SQL Server table with around ~300,000,000 absolute UNC paths and I'm trying to (quickly) validate each one to make sure the path in the SQL Server table actually exists as a file on disk.

At face value, I'm querying the table in batches of 50,000 and incrementing a counter to advance my batch as I go.

Then, I'm using a data reader object to store my current batch set and loop through the batch, checking each file with a File.Exists(path) command, like in the following example.

Problem is, I'm processing at approx. 1000 files per second max on a quad core 3.4ghz i5 with 16gb ram which is going to take days. Is there a faster way to do this?

I do have a columnstore index on the SQL Server table and I've profiled it. I get batches of 50k records in <1s, so it's not a SQL bottleneck when issuing batches to the .net console app.

while (counter <= MaxRowNum)
{
    command.CommandText = "SELECT id, dbname, location FROM table where ID BETWEEN " + counter + " AND " + (counter+50000).ToString();

    connection.Open();

    using (var reader = command.ExecuteReader())
    {
        var indexOfColumn1 = reader.GetOrdinal("ID");
        var indexOfColumn2 = reader.GetOrdinal("dbname");
        var indexOfColumn3 = reader.GetOrdinal("location");

        while (reader.Read())
        {
            var ID = reader.GetValue(indexOfColumn1);
            var DBName = reader.GetValue(indexOfColumn2);
            var Location = reader.GetValue(indexOfColumn3);

            if (!File.Exists(@Location.ToString()))
            {
                //log entry to logging table
            }
        }
    }

    // increment counter to grab next batch
    counter += 50000;

    // report on progress, I realize this might be off and should be incremented based on ID
    Console.WriteLine("Last Record Processed: " + counter.ToString());
    connection.Close();
}

Console.WriteLine("Done");
Console.Read();

EDIT: Adding some additional info:

thought about doing this all via the database itself; it's sql server enterprise with 2tb of ram and 64 cores. The problem is the sql server service account doesn't have access to the nas paths hosting the data so my cmdshell runs via an SP failed (I don't control the AD stuff), and the UNC paths have hundreds of thousands of individual sub directories based on an MD5 hash of the file. So enumerating contents of directories ends up not being useful because you may have a file 10 directories deep housing only 1 file. That's why I have to do a literal full path match/check.

Oh, and the paths are very long in general. I actually tried loading them all to a list in memory before I realized it was the equivalent of 90gb of data (lol, oops). Totally agree with other comments on threading it out. The database is super fast, not worried at all there. Hadn't considered SMB chatter though, that very well may be what I'm running up against. – JRats 13 hours ago

Oh! And I'm also only updating the database if a file doesn't exist. If it does, I don't care. So my database runs are minimized to grabbing batches of paths. Basically, we migrated a bunch of data from slower storage to this nimble appliance and I was asked to make sure everything actually made it over by writing something to verify existence per file.

Threading helped quite a bit. I spanned the file check over 4 threads and got my processing power up to about 3,300 records / second, which is far better, but I'm still hoping to get even quicker if I can. Is there a good way to tell if I'm getting bound by SMB traffic? I noticed once I tried to bump up my thread count to 4 or 5, my speed dropped down to a trickle; I thought maybe I was deadlocking somewhere, but no.

Oh, and I can't do a FilesOnNetwork check for the exact reason you said, there's 3 or 4x as many files actually hosted there compared to what I want to check. There's probably 1.5b files or so on that nimble appliance.

like image 854
JRats Avatar asked Dec 05 '15 11:12

JRats


3 Answers

Optimizing the SQL side is moot here because you are file IO bound.

I would use Directory.EnumerateFiles to obtain a list of all files that exist. Enumerating the files in a directory should be much faster than testing each file individually.

You can even invert the problem entirely and bulk insert that file list into a database temp table so that you can do SQL based set processing right in the database.

If you want to go ahead and test individually you probably should do this in parallel. It is not clear that the process is really disk bound. Might be network or CPU bound.

Parallelism will help here by overlapping multiple requests. It's the network latency, not the bandwidth that's likely to be the problem. At DOP 1 at least one machine is idle at any given time. There are times where both are idle.


there's 3 or 4x as many files actually hosted there compared to what I want to check

Use the dir /b command to pipe a list of all file names into a .txt file. Execute that locally on the machine that has the files, but if impossible execute remotely. Then use bcp to bulk insert them into a table into the database. Then, you can do a fast existence check in a single SQL query which will be highly optimized. You'll be getting a hash join.

If you want to parallelism the dir phase of this strategy you can write a program for that. But maybe there is no need to and dir is fast enough despite being single-threaded.

like image 103
usr Avatar answered Nov 15 '22 00:11

usr


The bottleneck most likely is network traffic, or more specifically: SMB traffic. Your machine talks SMB to retrieve the file info from the network storage. SMB traffic is "chatty", you need a few messages to check a file's existence and your permission to read it.

For what it's worth, on my network I can query the existence of about a hundred files per second over SMB, while listing 15K files recursively takes 10 seconds.

What can be faster is to retrieve the remote directory listing on beforehand. This will be trivial if the directory structure is predictable - and if the storage does not contain many irrelevant files in those directories.

Then your code will look like this:

HashSet<string> filesOnNetwork = new HashSet<string>(Directory.EnumerateFiles(
                            baseDirectory, "*.*", SearchOption.AllDirectories));

foreach (var fileToCheck in filesFromDatabase)
{
    fileToCheckExists = filesOnNetwork.Contains(fileToCheck);
}

This may work adversely if there are many more files on the network than you need to check, as the filling of and searching through filesOnNetwork will become the bottleneck of your application.

like image 11
CodeCaster Avatar answered Nov 15 '22 01:11

CodeCaster


On your current solution getting batches of 50,000 and open and closing the connection serves NO purpose but to slow things doen. DataReader streams. Just open it once and read them all one at a time. Under the covers Reader will send batches at a time. DataReader won't try and jam the client with 300,000,000 rows when you have only read 10.

I think you are worried about optimizing the fastest step - reading from SQL

Validating a file path is going to be the slowest step

I like the answer from CodeCaster but at 350 million you are going to get into object size limits with .NET. And by reading into a HashSet it does not start working until that step is done.

I would use a BlockingCollection with two collections

  • enumerate files
  • write to db

The slowest step is read file names so do that as fast as possible and don't interrupt. Do that on a device close to the storage device. Run the program on a SAN attached device.

I know you are going to say write to db is slow but it only has to be faster than enumerate file. Just have a binary columns for found - don't write the full filename to a #temp. I will bet dollars to donuts an (optimized) update is faster than enumerate files. Chunk your updates like 10,000 rows at a time to keep the round trips down. And I would do the update asynch so you can build up the next update while the current is processing.

Then in the end you have check the DB for any file that was not marked as found.

Don't go to a intermediate collection first. Process the enumeration directly. This lets you start doing the work immediately and keeps memory down.

foreach (string  fileName in Directory.EnumerateFiles(baseDirectory, "*.*", SearchOption.AllDirectories)) 
{
   // write filename to blocking collection
}
like image 3
paparazzo Avatar answered Nov 15 '22 02:11

paparazzo