I have inherited a database that has a table (300gb) full of SQL datatype Image
. I understand this datatype is depreciated.
As a routine cleanup I want to delete all duplicate Image
's from the table where certain conditions are met.
How do I compare binary data using SQL efficiently? Is the = equality operator sufficient?
Here is a scenario:
Table 'Paperwork'
int ID
int EmployeeID
int AnotherID
int AnotherFKID
image Attachment
I want to find all rows where the Attachment
, EmployeeID
, AnotherID
and AnotherFKID
are the same. It needs to be done with minimal impact on the database as there are over 1,116,313 rows.
Edit
The SQL Server Image
data type does not support LIKE
or the usual comparison operators.
Edit
Thanks to @Martin who suggested the Image
be casted to varbinary. I have added to this to get the MD5 checksum using Hashbytes
HASHBYTES('MD5',CAST(cast([Attachment] as varbinary(max))as varbinary)) AS AttachmentMD5
Jeremy,
Any all in one script will kill the buffer cache when it reads in the 300g. Break the job down into several tasks.
Task 1
ID
and a grouping to show duplicates of the three int columnsTable example
TableID PaperWorkID GroupID
1 14 1
2 15 1
3 21 2
4 55 2
Now we know PaperWorkID
s 14 and 15 share the same the three int columns because they are in the same group.
Task 2
bigint
) to the table and populate the column with the DATALENGTH
of the Image column in table Paperwork
based on the PaperWorkID
in the tableGroupID
Task 3
varbinary(max)
to the table.PaperWorkID
in the tableGroupID
Task 4
PaperWork
tablePaperwork
based on the items remaining in the table.If the data for the image column was scanned from paper there is very little chance two scans will produce the exact same image. If the data was uploaded twice then you are in luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With