I have a table of the form
CREATE TABLE data
{
pk INT PRIMARY KEY AUTO_INCREMENT,
dt BLOB
};
It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.
Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.
The naive approach took about an hour on 30-40k rows:
SELECT a.pk, MIN(b.pk)
FROM data AS a
JOIN data AS b
ON a.dt=b.dt
WHERE b.pk < a.pk
GROUP BY a.pk;
I happen to have, for other reasons, a table that has the sizes of the blobs:
CREATE TABLE sizes
(
fk INT, // note: non-unique
sz INT
// other cols
);
By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:
SELECT da.pk,MIN(db.pk)
FROM data AS da
JOIN data AS db
JOIN sizes AS sa
JOIN sizes AS sb
ON
sa.size=sb.size
AND da.pk=sa.fk
AND db.pk=sb.fk
WHERE
sb.fk<sa.fk
AND da.dt=db.dt
GROUP BY da.pk;
However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.
OK so for the question: Am I going to get much better than the second select? If so, how?
A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?
Xgc on #[email protected] points out that the adding a utility table like sizes but with a unique constraint on fk
might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.
You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.
The question is if you can save the hashes in your database?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With