Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get duplicate file list by computing their MD5

Tags:

c#

linq

md5

I have a array which contains a files path, I want to make a list a those file which are duplicate on the basis of their MD5. I calculate their MD5 like this:

private void calcMD5(Array files)  //Array contains a path of all files
{
    int i=0;
    string[] md5_val = new string[files.Length];
    foreach (string file_name in files)
    {
        using (var md5 = MD5.Create())
        {
            using (var stream = File.OpenRead(file_name))
            {
                md5_val[i] = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
                i += 1;
            }
        }
    }                
}

From above I able to calculate their MD5 but how to get only list of those files which are duplicate. If there is any other way to do same please let me know, and also I am new to Linq

like image 825
Manish Avatar asked Feb 28 '13 11:02

Manish


People also ask

Can MD5 be duplicated?

Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content. Show activity on this post. If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them.

Can you identify duplicate files with hashing?

A popular technique that we use to find duplicate files is through hashing. The process of hashing generates a hash code for each file. A hash code is a fixed length code that represents a binary source of any length.

How do I find duplicate files on a file server?

Go to File Analysis > Reports > Storage Reports > Duplicate Files. Click the + next to the Select Servers field box. In the pop-up, choose the server in which you want to find duplicate files. Click Select.


2 Answers

1. Rewrite your calcMD5 function to take in a single file path and return the MD5.
2. Store your file names in a string[] or List<string>, not an untyped array, if possible.
3. Use the following LINQ to get groups of files with the same hash:

var groupsOfFilesWithSameHash = files
  // or files.Cast<string>() if you're stuck with an Array
   .GroupBy(f => calcMD5(f))
   .Where(g => g.Count() > 1);

4. You can get to the groups with nested foreach loops, for example:

foreach(var group in groupsOfFilesWithSameHash)
{
    Console.WriteLine("Shared MD5: " + g.Key);
    foreach (var file in group)
        Console.WriteLine("    " + file);
}
like image 130
Rawling Avatar answered Oct 01 '22 11:10

Rawling


    static void Main(string[] args)
    {
        // returns a list of file names, which have duplicate MD5 hashes
        var duplicates = CalcDuplicates(new[] {"Hello.txt", "World.txt"});
    }

    private static IEnumerable<string> CalcDuplicates(IEnumerable<string> fileNames)
    {
        return fileNames.GroupBy(CalcMd5OfFile)
                        .Where(g => g.Count() > 1)
                        // skip SelectMany() if you'd like the duplicates grouped by their hashes as group key
                        .SelectMany(g => g);
    }

    private static string CalcMd5OfFile(string path)
    {
        // I took your implementation - I don't know if there are better ones
        using (var md5 = MD5.Create())
        {
            using (var stream = File.OpenRead(path))
            {
                return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
            }
        }
    }
like image 32
Michael Schnerring Avatar answered Oct 02 '22 11:10

Michael Schnerring