Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I effectively find duplicate blob rows in MySQL?

I have a table of the form

CREATE TABLE data
{
   pk INT PRIMARY KEY AUTO_INCREMENT,
   dt BLOB
};

It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.

Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.

The naive approach took about an hour on 30-40k rows:

SELECT a.pk, MIN(b.pk) 
    FROM data AS a 
    JOIN data AS b
  ON a.dt=b.dt
  WHERE b.pk < a.pk
  GROUP BY a.pk;

I happen to have, for other reasons, a table that has the sizes of the blobs:

CREATE TABLE sizes
(
   fk INT,  // note: non-unique
   sz INT
   // other cols
);

By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:

SELECT da.pk,MIN(db.pk) 
  FROM data AS da
  JOIN data AS db
  JOIN sizes AS sa
  JOIN sizes AS sb
  ON
        sa.size=sb.size
    AND da.pk=sa.fk
    AND db.pk=sb.fk
  WHERE
        sb.fk<sa.fk
    AND da.dt=db.dt 
  GROUP BY da.pk;

However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.

OK so for the question: Am I going to get much better than the second select? If so, how?

A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?


Xgc on #[email protected] points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

like image 226
BCS Avatar asked Nov 26 '08 05:11

BCS


1 Answers

You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.

The question is if you can save the hashes in your database?

like image 151
Baget Avatar answered Sep 21 '22 17:09

Baget