mySQL: is it possible to make this query any faster?

Question

I have a table "test" containing millions of entries. Each row contains a floating point "feature" and a "count" how often this feature is present in item "id". The primary key for this table is the combination of "id" and "feature", i.e. every item may have multiple features. There are usually a couple of hundred to a couple of thousand feature entries per item id.

create table test 
(
    id      int not null,
    feature double not null,
    count   int not null
);

The task is to find the 500 most similar items to a given reference item. Similarity is measured in number of identical feature values in both items. The query I have come up with is quoted below, but despite properly using indices its execution plan still contains "using temporary" and "using filesort", giving unacceptable performance for my use case.

select 
    t1.id,
    t2.id,
    sum( least( t1.count, t2.count )) as priority 
from test as t1
inner join test as t2 
     on t2.feature = t1.feature
where t1.id = {some user supplied id value} 
group by t1.id, t2.id 
order by priority desc
limit 500;

Any ideas on how to improve on this? The schema can be modified and indices added as needed.

Quassnoi · Accepted Answer

With the current schema, this query hardly can be improved.

You already have an index on feature and this is the best you can do with the current schema design.

The problem is more similar than is not a relationship of order. If a is more similar to b than it is to c, it does not imply that c is less similar to a than it is to b. Hence, you cannot build a single index describing this relationship, and need to do it for each item separately, which would make your index N^2 entries long, where N is the number of items.

If you always need only top 500 items, you can limit your index to that figure (in which case it will hold 500 * N entries).

MySQL does not support indexed or materialized views, so you will have to do it yourself:

Create a table like this:

CREATE TABLE similarity
        (
        id1 INT NOT NULL,
        id2 INT NOT NULL,
        similarity DOUBLE NOT NULL,
        PRIMARY KEY (id1, id2),
        KEY (id1, similarity)
        )

Whenever you insert a new feature into the table, reflect the changes in the similarity:

INSERT
INTO    similarity
SELECT  @newid, id,
        LEAST(@newcount, count) AS ns
FROM    test
WHERE   feature = @newfeature
        AND id <> @newid
ON DUPLICATE KEY UPDATE
SET     similarity = similarity + ns;


INSERT
INTO    similarity
SELECT  @newid, id,
        LEAST(@newcount, count) AS ns
FROM    test
WHERE   feature = @newfeature
        AND id <> @newid
ON DUPLICATE KEY UPDATE
SET     similarity = similarity + ns;

On a timely basis, remove the excess similarities:

DELETE  s
FROM    (
        SELECT  id1,
                (
                SELECT  similarity
                FROM    similarity si
                WHERE   si.id1 = s.id1
                ORDER BY
                        si.id1 DESC, si.similarity DESC
                LIMIT 499, 1
                ) AS cs
        FROM    (
                SELECT  DISTINCT id1
                FROM    similarity
                ) s
        ) q
JOIN    similarity s
ON      s.id1 = q.id1
        AND s.similarity < q.cs

Query your data:

SELECT  id2
FROM    similarity
WHERE   id1 = @myid
ORDER BY
        similarity DESC
LIMIT 500

mySQL: is it possible to make this query any faster?

Tags:

sql

optimization

mysql

query-optimization

BuschnicK

1 Answers

Quassnoi

Recent Activity

Donate For Us

mySQL: is it possible to make this query any faster?

Tags:

sql

optimization

mysql

query-optimization

BuschnicK

1 Answers

Quassnoi

Related questions

Recent Activity

Donate For Us