Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to compare directory state, or hashing for fun and profit

Tags:

linux

bash

php

hash

We have a PHP application, and were thinking it might be advantageous to have the application know if there was a change in its makeup since the last execution. Mainly due to managing caches and such, and knowing that our applications are sometimes accessed by people who don't remember to clear the cache on changes. (Changing the people is the obvious answer, but alas, not really achievable)

We've come up with this, which is the fastest we've managed to eke out, running an average 0.08 on a developer machine for a typical project. We've experimented with shasum,md5 and crc32, and this is the fastest. We are basically md5ing the contents of every file, and md5'ing that output. Security isnt a concern, we're just interested in detecting filesystem changes via a differing checksum.

time (find application/ -path '*/.svn' -prune -o -type f -print0 | xargs -0 md5 | md5)

I suppose the question is, can this be optimised any further?

(I realise that pruning svn will have a cost, but find takes the least amount of time out of the components, so it will be pretty minimal. We're testing this on a working copy atm)

like image 837
jhogendorn Avatar asked Dec 03 '10 08:12

jhogendorn


4 Answers

Instead of going by file contents, you can use the same technique with filename and timestamps:

find . -name '.svn' -prune -o -type f -printf '%m%c%p' | md5sum

This is much faster than reading and hashing the contents of each file.

like image 195
that other guy Avatar answered Nov 09 '22 03:11

that other guy


You can be notified of filesystem modifications using the inotify extension.

It can be installed with pecl:

pecl install inotify

Or manually (download, phpize && ./configure && make && make install as usual).

This is a raw binding over the linux inotify syscalls, and is probably the fastest solution on linux.

See this example of a simple tail implementation: http://svn.php.net/viewvc/pecl/inotify/trunk/tail.php?revision=262896&view=markup


If you want a higher level library, or suppot for non-linux systems, take a look at Lurker.

It works on any system, and can use inotity when it's available.

See the example from the README:

$watcher = new ResourceWatcher;
$watcher->track('an arbitrary id', '/path/to/views');

$watcher->addListener('an arbitrary id', function (FilesystemEvent $event) {
    echo $event->getResource() . 'was' . $event->getTypeString();
});

$watcher->start();
like image 20
Arnaud Le Blanc Avatar answered Nov 09 '22 01:11

Arnaud Le Blanc


Insteading of actively searching for changes, why not getting notified when something changes. Have a look at PHP's FAM - File Alteration Monitor API

FAM monitors files and directories, notifying interested applications of changes. More information about FAM is available at » http://oss.sgi.com/projects/fam/. A PHP script may specify a list of files for FAM to monitor using the functions provided by this extension. The FAM process is started when the first connection from any application to it is opened. It exits after all connections to it have been closed.

CAVEAT: requires an additional daemon on the machine and the PECL extension is unmaintained.

like image 5
Gordon Avatar answered Nov 09 '22 03:11

Gordon


We didn't want to use FAM, since we would need to install it on the server, and thats not always possible. Sometimes clients are insistent we deploy on their broken infrastructure. Since it's discontinued, its hard to get that change approved red tape wise also.

The only way to improve the speed of the original version in the question is to make sure your file list is as succinct as possible. IE only hash the directories/files that really matter if changed. Cutting out directories that aren't relevant can give big speed boosts.

Past that, the application was using the function to check if there were changes in order to perform a cache clear if there were. Since we don't really want to halt the application while its doing this, this sort of thing is best farmed out carefully as an asynchronous process using fsockopen. That gives the best 'speed boost' overall, just be careful of race conditions.

Marking this as the 'answer' and upvoting the FAM answer.

like image 4
jhogendorn Avatar answered Nov 09 '22 02:11

jhogendorn