I have a array which contains a files path, I want to make a list a those file which are duplicate on the basis of their MD5. I calculate their MD5 like this:
private void calcMD5(Array files) //Array contains a path of all files
{
int i=0;
string[] md5_val = new string[files.Length];
foreach (string file_name in files)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(file_name))
{
md5_val[i] = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
i += 1;
}
}
}
}
From above I able to calculate their MD5 but how to get only list of those files which are duplicate. If there is any other way to do same please let me know, and also I am new to Linq
Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content. Show activity on this post. If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them.
A popular technique that we use to find duplicate files is through hashing. The process of hashing generates a hash code for each file. A hash code is a fixed length code that represents a binary source of any length.
Go to File Analysis > Reports > Storage Reports > Duplicate Files. Click the + next to the Select Servers field box. In the pop-up, choose the server in which you want to find duplicate files. Click Select.
1.
Rewrite your calcMD5
function to take in a single file path and return the MD5.2.
Store your file names in a string[]
or List<string>
, not an untyped array, if possible.3.
Use the following LINQ to get groups of files with the same hash:
var groupsOfFilesWithSameHash = files
// or files.Cast<string>() if you're stuck with an Array
.GroupBy(f => calcMD5(f))
.Where(g => g.Count() > 1);
4.
You can get to the groups with nested foreach
loops, for example:
foreach(var group in groupsOfFilesWithSameHash)
{
Console.WriteLine("Shared MD5: " + g.Key);
foreach (var file in group)
Console.WriteLine(" " + file);
}
static void Main(string[] args)
{
// returns a list of file names, which have duplicate MD5 hashes
var duplicates = CalcDuplicates(new[] {"Hello.txt", "World.txt"});
}
private static IEnumerable<string> CalcDuplicates(IEnumerable<string> fileNames)
{
return fileNames.GroupBy(CalcMd5OfFile)
.Where(g => g.Count() > 1)
// skip SelectMany() if you'd like the duplicates grouped by their hashes as group key
.SelectMany(g => g);
}
private static string CalcMd5OfFile(string path)
{
// I took your implementation - I don't know if there are better ones
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(path))
{
return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With