I'm running the following MD5 check on 500 million files to check for duplicates. The scripts taking forever to run and I was wondering how to speed it up. How could I speed it up? Could I use a try catch loop instead of contains to throw an error when the hash already exists instead? What would you all recommend?
$folder = Read-Host -Prompt 'Enter a folder path'
$hash = @{}
$lineCheck = 0
Get-ChildItem $folder -Recurse | where {! $_.PSIsContainer} | ForEach-Object {
$lineCheck++
Write-Host $lineCheck
$tempMD5 = (Get-FileHash -LiteralPath $_.FullName -Algorithm MD5).Hash;
if(! $hash.Contains($tempMD5)){
$hash.Add($tempMD5,$_.FullName)
}
else{
Remove-Item -literalPath $_.fullname;
}
}
As suggested in the comments, you might consider to start hashing files only if there is a match with the file length found first. Meaning that you will not invoke the expensive hash method for any file length that is unique.
*Note: that the Write-Host
command is quite expensive by itself, therefore I would not display every iteration (Write-Host $lineCheck
) but e.g. only when a match is found.
$Folder = Read-Host -Prompt 'Enter a folder path'
$FilesBySize = @{}
$FilesByHash = @{}
Function MatchHash([String]$FullName) {
$Hash = (Get-FileHash -LiteralPath $FullName -Algorithm MD5).Hash
$Found = $FilesByHash.Contains($Hash)
If ($Found) {$Null = $FilesByHash[$Hash].Add($FullName)}
Else {$FilesByHash[$Hash] = [System.Collections.ArrayList]@($FullName)}
$Found
}
Get-ChildItem $Folder -Recurse | Where-Object -Not PSIsContainer | ForEach-Object {
$Files = $FilesBySize[$_.Length]
If ($Files) {
If ($Files.Count -eq 1) {$Null = MatchHash $Files[0]}
If ($Files.Count -ge 1) {If (MatchHash $_) {Write-Host 'Found match:' $_.FullName}}
$Null = $FilesBySize[$_.Length].Add($_.FullName)
} Else {
$FilesBySize[$_.Length] = [System.Collections.ArrayList]@($_.FullName)
}
}
Display the found duplicates:
ForEach($Hash in $FilesByHash.GetEnumerator()) {
If ($Hash.Value.Count -gt 1) {
Write-Host 'Hash:' $Hash.Name
ForEach ($File in $Hash.Value) {
Write-Host 'File:' $File
}
}
}
I'd guess that the slowest part of your code is the Get-FileHash
invocation, since everything else is either not computationally intensive or limited by your hardware (disk IOPS).
You could try replacing it with the invocation of the native tool which has more optimized MD5 implementation and see if it helps.
Could I use a try catch loop instead of contains to throw an error when the hash already exists instead?
Exceptions are slow and using them for flow control is not recommended:
While the use of exception handlers to catch errors and other events that disrupt program execution is a good practice, the use of exception handler as part of the regular program execution logic can be expensive and should be avoided
There is the definitive answer to this from the guy who implemented them - Chris Brumme. He wrote an excellent blog article about the subject (warning - its very long)(warning2 - its very well written, if you're a techie you'll read it to the end and then have to make up your hours after work :) )
The executive summary: they are slow. They are implemented as Win32 SEH exceptions, so some will even pass the ring 0 CPU boundary!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With