Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast folder hashing in Windows Node

I'm building a nodewebkit app that keeps a local directory in sync with a remote FTP. To build the initial index when the app is run for the first time I download an index file from the remote server containing a hash for all the files and their folders. I then run through this list and find matches in the user's local folder.

The total size of the remote/local folder can be over 10GB. As you can imagine, scanning 10GB worth of individual files can be pretty slow, especially on a normal HDD (not SSD).

Is there a way in node to efficiently get a hash of a folder without looping through and hashing every individual file inside? That way if the folder hash differs I can choose to do the expensive individual file checking or not (which is how I do it once I have a local index to compare against the remote one).

like image 686
Titan Avatar asked Jun 18 '15 22:06

Titan


1 Answers

You could iteratively walk the directories, stat the directory and each file it contains, not following links and produce a hash. Here's an example:

'use strict';

// npm install siphash
var siphash = require('siphash');
// npm install walk
var walk = require('walk');

var key = siphash.string16_to_key('0123456789ABCDEF');
var walker  = walk.walk('/tmp', {followLinks: false});

walker.on('directories', directoryHandler);
walker.on('file', fileHandler);
walker.on('errors', errorsHandler); // plural
walker.on('end', endHandler);

var directories = {};
var directoryHashes = [];

function addRootDirectory(name, stats) {
    directories[name] = directories[name] || {
        fileStats: []
    };

    if(stats.file) directories[name].fileStats.push(stats.file);
    else if(stats.dir) directories[name].dirStats = stats.dir;
}

function directoryHandler(root, dirStatsArray, next) {
    addRootDirectory(root, {dir:dirStatsArray});
    next();
}

function fileHandler(root, fileStat, next) {
    addRootDirectory(root, {file:fileStat});
    next();
}

function errorsHandler(root, nodeStatsArray, next) {
    nodeStatsArray.forEach(function (n) {
        console.error('[ERROR] ' + n.name);
        console.error(n.error.message || (n.error.code + ': ' + n.error.path));
    });
    next();
}

function endHandler() {
    Object.keys(directories).forEach(function (dir) {
        var hash = siphash.hash_hex(key, JSON.stringify(dir));
        directoryHashes.push({
            dir: dir,
            hash: hash
        });
    });

    console.log(directoryHashes);
}

You would want of course to turn this into some kind of command-line app to take arguments probably and double check that the files are returned in the correct order every time (maybe sort the file stats based on file name prior to hashing!) so that siphash returns the right hash every time.

This is not tested code.. just to provide an example of where I'd likely start with that sort of thing.

Edit: and to reduce dependencies, you could use Node's crypto lib instead of siphash if you want require('crypto'); and walk/stat the directories and files yourself if you'd like of course.

like image 177
Matt Mullens Avatar answered Oct 10 '22 21:10

Matt Mullens