Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hash only image data in a jpg file with dotnet?

I have a ~20000 jpg images, some of which are duplicates. Unfortunately, some files have been been tagged with EXIF metadata, so a simple file hash cannot identify the duplicated one.

I am attempting to create a Powershell script to process these, but can find no way to extract only the bitmap data.

The system.drawing.bitmap can only return a bitmap object, not bytes. There's a GetHash() function, but it apparently acts on the whole file.

How can I hash these files in a way that the EXIF information is excluded? I'd prefer to avoid external dependencies if possible.

like image 345
JSacksteder Avatar asked Jan 16 '10 16:01

JSacksteder


1 Answers

This is a PowerShell V2.0 advanced function implemention. It is a bit long but I have verified it gives the same hashcode (generated from the bitmap pixels) on the same picture but with different metadata and file sizes. This is a pipeline capable version that also accepts wildcards and literal paths:

function Get-BitmapHashCode
{
    [CmdletBinding(DefaultParameterSetName="Path")]
    param(
        [Parameter(Mandatory=$true, 
                   Position=0, 
                   ParameterSetName="Path", 
                   ValueFromPipeline=$true, 
                   ValueFromPipelineByPropertyName=$true,
                   HelpMessage="Path to bitmap file")]
        [ValidateNotNullOrEmpty()]
        [string[]]
        $Path,

        [Alias("PSPath")]
        [Parameter(Mandatory=$true, 
                   Position=0, 
                   ParameterSetName="LiteralPath", 
                   ValueFromPipelineByPropertyName=$true,
                   HelpMessage="Path to bitmap file")]
        [ValidateNotNullOrEmpty()]
        [string[]]
        $LiteralPath
    )

    Begin {
        Add-Type -AssemblyName System.Drawing
        $sha = new-object System.Security.Cryptography.SHA256Managed
    }

    Process {
        if ($psCmdlet.ParameterSetName -eq "Path")
        {
            # In -Path case we may need to resolve a wildcarded path
            $resolvedPaths = @($Path | Resolve-Path | Convert-Path)
        }
        else 
        {
            # Must be -LiteralPath
            $resolvedPaths = @($LiteralPath | Convert-Path)
        }

        # Find PInvoke info for each specified path       
        foreach ($rpath in $resolvedPaths) 
        {           
            Write-Verbose "Processing $rpath"
            try {
                $bmp    = new-object System.Drawing.Bitmap $rpath
                $stream = new-object System.IO.MemoryStream
                $writer = new-object System.IO.BinaryWriter $stream
                for ($w = 0; $w -lt $bmp.Width; $w++) {
                    for ($h = 0; $h -lt $bmp.Height; $h++) {
                        $pixel = $bmp.GetPixel($w,$h)
                        $writer.Write($pixel.ToArgb())
                    }
                }
                $writer.Flush()
                [void]$stream.Seek(0,'Begin')
                $hash = $sha.ComputeHash($stream)
                [BitConverter]::ToString($hash) -replace '-',''
            }
            finally {
                if ($bmp)    { $bmp.Dispose() }
                if ($writer) { $writer.Close() }
            }
        }  
    }
}
like image 190
Keith Hill Avatar answered Sep 28 '22 07:09

Keith Hill