I want to recurse several directories and find duplicate files between the n number of directories.
My knee-jerk idea at this is to have a global hashtable or some other data structure to hold each file I find; then check each subsequent file to determine if it's in the "master" list of files. Obviously, I don't think this would be very efficient and the "there's got to be a better way!" keeps ringing in my brain.
Any advice on a better way to handle this situation would be appreciated.
You could avoid hashing by first comparing file sizes. If you never find files with the same sizes you don't have to hash them. You only hash a file once you find another file with the same size, then you hash them both.
That should be significantly faster than blindly hashing every single file, although it'd be more complicated to implement that two-tiered check.
I'd suggest keeping multiple in-memory indexes of files.
Create one that indexes all files by file length:
Dictionary<int, List<FileInfo>> IndexBySize;
When you're processing new file Fu
, it's a quick lookup to find all other files that are the same size.
Create another that indexes all files by modification timestamp:
Dictionary<DateTime, List<FileInfo>> IndexByModification;
Given file Fu
, you can find all files modified at the same time.
Repeat for each signficiant file characteristic. You can then use the Intersect()
extension method to compare multiple criteria efficiently.
For example:
var matchingFiles
= IndexBySize[fu.Size].Intersect(IndexByModification[fu.Modified]);
This would allow you to avoid the byte-by-byte scan until you need to. Then, for files that have been hashed, create another index:
Dictionary<MD5Hash, List<FileInfo>> IndexByHash;
You might want to calculate multiple hashes at the same time to reduce collisions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With