Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Powershell Speed: How to speed up ForEach-Object MD5/hash check

I'm running the following MD5 check on 500 million files to check for duplicates. The scripts taking forever to run and I was wondering how to speed it up. How could I speed it up? Could I use a try catch loop instead of contains to throw an error when the hash already exists instead? What would you all recommend?

$folder = Read-Host -Prompt 'Enter a folder path'

$hash = @{}
$lineCheck = 0

Get-ChildItem $folder -Recurse | where {! $_.PSIsContainer} | ForEach-Object {
    $lineCheck++
    Write-Host $lineCheck
    $tempMD5 = (Get-FileHash -LiteralPath $_.FullName -Algorithm MD5).Hash;

    if(! $hash.Contains($tempMD5)){
        $hash.Add($tempMD5,$_.FullName)
    }
    else{
        Remove-Item -literalPath $_.fullname;
    }
} 
like image 617
Hughes Avatar asked Jan 25 '23 09:01

Hughes


2 Answers

As suggested in the comments, you might consider to start hashing files only if there is a match with the file length found first. Meaning that you will not invoke the expensive hash method for any file length that is unique.

*Note: that the Write-Host command is quite expensive by itself, therefore I would not display every iteration (Write-Host $lineCheck) but e.g. only when a match is found.

$Folder = Read-Host -Prompt 'Enter a folder path'

$FilesBySize = @{}
$FilesByHash = @{}

Function MatchHash([String]$FullName) {
    $Hash = (Get-FileHash -LiteralPath $FullName -Algorithm MD5).Hash
    $Found = $FilesByHash.Contains($Hash)
    If ($Found) {$Null = $FilesByHash[$Hash].Add($FullName)}
    Else {$FilesByHash[$Hash] = [System.Collections.ArrayList]@($FullName)}
    $Found
}

Get-ChildItem $Folder -Recurse | Where-Object -Not PSIsContainer | ForEach-Object {
    $Files = $FilesBySize[$_.Length]
    If ($Files) {
        If ($Files.Count -eq 1) {$Null = MatchHash $Files[0]}
        If ($Files.Count -ge 1) {If (MatchHash $_) {Write-Host 'Found match:' $_.FullName}}
        $Null = $FilesBySize[$_.Length].Add($_.FullName)
    } Else {
        $FilesBySize[$_.Length] = [System.Collections.ArrayList]@($_.FullName)
    }
}

Display the found duplicates:

ForEach($Hash in $FilesByHash.GetEnumerator()) {
    If ($Hash.Value.Count -gt 1) {
        Write-Host 'Hash:' $Hash.Name
        ForEach ($File in $Hash.Value) {
            Write-Host 'File:' $File
        }
    }
}
like image 57
iRon Avatar answered Feb 05 '23 16:02

iRon


I'd guess that the slowest part of your code is the Get-FileHash invocation, since everything else is either not computationally intensive or limited by your hardware (disk IOPS).

You could try replacing it with the invocation of the native tool which has more optimized MD5 implementation and see if it helps.

Could I use a try catch loop instead of contains to throw an error when the hash already exists instead?

Exceptions are slow and using them for flow control is not recommended:

  • DA0007: Avoid using exceptions for control flow

While the use of exception handlers to catch errors and other events that disrupt program execution is a good practice, the use of exception handler as part of the regular program execution logic can be expensive and should be avoided

  • https://stackoverflow.com/a/162027/4424236

There is the definitive answer to this from the guy who implemented them - Chris Brumme. He wrote an excellent blog article about the subject (warning - its very long)(warning2 - its very well written, if you're a techie you'll read it to the end and then have to make up your hours after work :) )

The executive summary: they are slow. They are implemented as Win32 SEH exceptions, so some will even pass the ring 0 CPU boundary!

like image 25
beatcracker Avatar answered Feb 05 '23 17:02

beatcracker