I have a table "test" containing millions of entries. Each row contains a floating point "feature" and a "count" how often this feature is present in item "id". The primary key for this table is the combination of "id" and "feature", i.e. every item may have multiple features. There are usually a couple of hundred to a couple of thousand feature entries per item id.
create table test
(
id int not null,
feature double not null,
count int not null
);
The task is to find the 500 most similar items to a given reference item. Similarity is measured in number of identical feature values in both items. The query I have come up with is quoted below, but despite properly using indices its execution plan still contains "using temporary" and "using filesort", giving unacceptable performance for my use case.
select
t1.id,
t2.id,
sum( least( t1.count, t2.count )) as priority
from test as t1
inner join test as t2
on t2.feature = t1.feature
where t1.id = {some user supplied id value}
group by t1.id, t2.id
order by priority desc
limit 500;
Any ideas on how to improve on this? The schema can be modified and indices added as needed.
With the current schema, this query hardly can be improved.
You already have an index on feature
and this is the best you can do with the current schema design.
The problem is more similar than is not a relationship of order. If a
is more similar to b
than it is to c
, it does not imply that c
is less similar to a
than it is to b
. Hence, you cannot build a single index describing this relationship, and need to do it for each item separately, which would make your index N^2
entries long, where N
is the number of items.
If you always need only top 500
items, you can limit your index to that figure (in which case it will hold 500 * N
entries).
MySQL
does not support indexed or materialized views, so you will have to do it yourself:
Create a table like this:
CREATE TABLE similarity
(
id1 INT NOT NULL,
id2 INT NOT NULL,
similarity DOUBLE NOT NULL,
PRIMARY KEY (id1, id2),
KEY (id1, similarity)
)
Whenever you insert a new feature into the table, reflect the changes in the similarity
:
INSERT
INTO similarity
SELECT @newid, id,
LEAST(@newcount, count) AS ns
FROM test
WHERE feature = @newfeature
AND id <> @newid
ON DUPLICATE KEY UPDATE
SET similarity = similarity + ns;
INSERT
INTO similarity
SELECT @newid, id,
LEAST(@newcount, count) AS ns
FROM test
WHERE feature = @newfeature
AND id <> @newid
ON DUPLICATE KEY UPDATE
SET similarity = similarity + ns;
On a timely basis, remove the excess similarities:
DELETE s
FROM (
SELECT id1,
(
SELECT similarity
FROM similarity si
WHERE si.id1 = s.id1
ORDER BY
si.id1 DESC, si.similarity DESC
LIMIT 499, 1
) AS cs
FROM (
SELECT DISTINCT id1
FROM similarity
) s
) q
JOIN similarity s
ON s.id1 = q.id1
AND s.similarity < q.cs
Query your data:
SELECT id2
FROM similarity
WHERE id1 = @myid
ORDER BY
similarity DESC
LIMIT 500
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With