Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mySQL: is it possible to make this query any faster?

I have a table "test" containing millions of entries. Each row contains a floating point "feature" and a "count" how often this feature is present in item "id". The primary key for this table is the combination of "id" and "feature", i.e. every item may have multiple features. There are usually a couple of hundred to a couple of thousand feature entries per item id.

create table test 
(
    id      int not null,
    feature double not null,
    count   int not null
);

The task is to find the 500 most similar items to a given reference item. Similarity is measured in number of identical feature values in both items. The query I have come up with is quoted below, but despite properly using indices its execution plan still contains "using temporary" and "using filesort", giving unacceptable performance for my use case.

select 
    t1.id,
    t2.id,
    sum( least( t1.count, t2.count )) as priority 
from test as t1
inner join test as t2 
     on t2.feature = t1.feature
where t1.id = {some user supplied id value} 
group by t1.id, t2.id 
order by priority desc
limit 500;

Any ideas on how to improve on this? The schema can be modified and indices added as needed.

like image 834
BuschnicK Avatar asked Nov 29 '10 18:11

BuschnicK


1 Answers

With the current schema, this query hardly can be improved.

You already have an index on feature and this is the best you can do with the current schema design.

The problem is more similar than is not a relationship of order. If a is more similar to b than it is to c, it does not imply that c is less similar to a than it is to b. Hence, you cannot build a single index describing this relationship, and need to do it for each item separately, which would make your index N^2 entries long, where N is the number of items.

If you always need only top 500 items, you can limit your index to that figure (in which case it will hold 500 * N entries).

MySQL does not support indexed or materialized views, so you will have to do it yourself:

  1. Create a table like this:

    CREATE TABLE similarity
            (
            id1 INT NOT NULL,
            id2 INT NOT NULL,
            similarity DOUBLE NOT NULL,
            PRIMARY KEY (id1, id2),
            KEY (id1, similarity)
            )
    
  2. Whenever you insert a new feature into the table, reflect the changes in the similarity:

    INSERT
    INTO    similarity
    SELECT  @newid, id,
            LEAST(@newcount, count) AS ns
    FROM    test
    WHERE   feature = @newfeature
            AND id <> @newid
    ON DUPLICATE KEY UPDATE
    SET     similarity = similarity + ns;
    
    
    INSERT
    INTO    similarity
    SELECT  @newid, id,
            LEAST(@newcount, count) AS ns
    FROM    test
    WHERE   feature = @newfeature
            AND id <> @newid
    ON DUPLICATE KEY UPDATE
    SET     similarity = similarity + ns;
    
  3. On a timely basis, remove the excess similarities:

    DELETE  s
    FROM    (
            SELECT  id1,
                    (
                    SELECT  similarity
                    FROM    similarity si
                    WHERE   si.id1 = s.id1
                    ORDER BY
                            si.id1 DESC, si.similarity DESC
                    LIMIT 499, 1
                    ) AS cs
            FROM    (
                    SELECT  DISTINCT id1
                    FROM    similarity
                    ) s
            ) q
    JOIN    similarity s
    ON      s.id1 = q.id1
            AND s.similarity < q.cs
    
  4. Query your data:

    SELECT  id2
    FROM    similarity
    WHERE   id1 = @myid
    ORDER BY
            similarity DESC
    LIMIT 500
    
like image 138
Quassnoi Avatar answered Sep 28 '22 05:09

Quassnoi