Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing a Temporary B Tree Sort from a SQLite Query

I have a pretty basic implementation of an Image uploading service, where you can upload images and tag them. This is my schema:

CREATE TABLE Tag(
    orm_id INTEGER PRIMARY KEY AUTOINCREMENT, 
    pid_high UNSIGNED BIG INT NOT NULL, 
    pid_low UNSIGNED BIG INT NOT NULL, 
    name STRING NOT NULL, 
    CONSTRAINT KeyConstraint UNIQUE (pid_high, pid_low) ON CONFLICT FAIL);

CREATE TABLE TagBridge(
    orm_id INTEGER PRIMARY KEY AUTOINCREMENT, 
    pid_high UNSIGNED BIG INT NOT NULL, 
    pid_low UNSIGNED BIG INT NOT NULL, 
    image_id_high UNSIGNED BIG INT NOT NULL, 
    image_id_low UNSIGNED BIG INT NOT NULL, 
    tag_id_high UNSIGNED BIG INT NOT NULL, 
    tag_id_low UNSIGNED BIG INT NOT NULL, 
    CONSTRAINT KeyConstraint UNIQUE (pid_high, pid_low) ON CONFLICT FAIL);

CREATE TABLE Image(
    orm_id INTEGER PRIMARY KEY AUTOINCREMENT, 
    pid_high UNSIGNED BIG INT NOT NULL,
    pid_low UNSIGNED BIG INT NOT NULL, 
    filehash STRING NOT NULL, 
    mime STRING NOT NULL, 
    uploadedDate INTEGER NOT NULL, 
    ratingsAverage REAL, 
    CONSTRAINT KeyConstraint UNIQUE (pid_high, pid_low) ON CONFLICT FAIL);

And Indices

CREATE INDEX ImageTest on Image(pid_high, pid_low, uploadedDate DESC);
CREATE INDEX ImagefilehashIndex ON Image (filehash);
CREATE INDEX ImageuploadedDateIndex ON Image (uploadedDate);
CREATE INDEX TagnameIndex ON Tag (name);

The reason that there are pid_high/pid_low fields instead of your standard primary key is because this service uses client-authoritative 128-bit GUIDs, but this does not impact the query speed significantly.

Since this is the internet, the vast majority of the images on this service are cats, and are tagged with 'cat'. In fact, about 47,000 out of 50,000 images are tagged with 'cat'. The query to get all images that is tagged with 'cat' is

select i.* from Tag t, TagBridge b, Image i 
where 
    b.tag_id_high = t.pid_high AND b.tag_id_low = t.pid_low 
AND b.image_id_high = i.pid_high and b.image_id_low = i.pid_low 
AND t.name ='cat' 
order by uploadedDate DESC LIMIT 20;

The query plan for this is

sele  order          from  deta
----  -------------  ----  ----
0     0              0     SEARCH TABLE Tag AS t USING INDEX TagnameIndex (name=?) (~1 rows)
0     1              1     SCAN TABLE TagBridge AS b (~472 rows)
0     2              2     SEARCH TABLE Image AS i USING INDEX ImageTest (pid_high=? AND pid_low=?) (~1 rows)
0     0              0     USE TEMP B-TREE FOR ORDER BY

The main problem here is the last row, USE TEMP B-TREE FOR ORDER BY. This slows down the query significantly. Without the 'order by' clause, the entire query takes about 0.001s to run. With the order by clause, the query takes 0.483s, which is a 400x performance penalty.

I would like to get this query under 0.1 seconds, but I am not sure how. I have tried many other queries, and adding and removing indices, but this is the fastest that I have been able to run.

like image 768
rraawwrr Avatar asked Mar 16 '12 22:03

rraawwrr


1 Answers

This is a general problem of choosing between filtering and ordering index:

  • Choosing index

You should keep a list of popular tags (for which the ordering index is more beneficial) and somehow forbid the filtering index if the tag is popular, say, like this:

SELECT  i.*
FROM    Tag t, TagBridge b, Image i 
WHERE   b.tag_id_high = t.pid_high AND b.tag_id_low = t.pid_low 
        AND b.image_id_high = i.pid_high AND b.image_id_low = i.pid_low 
        AND t.name || '' = 'cat' 
ORDER BY
        i.uploadedDate DESC
LIMIT 20

Alternatively, you could denormalize your schema and add uploadedDate to TagBridge, filling it with a trigger or whatever. Then create a composite index on TagBridge (pid_high, pid_low, uploadedDate, image_id_high, image_id_low) and rewrite the query a little:

SELECT  i.*
FROM    TagBridge b, Image i
WHERE   b.tag_id_high =
        (
        SELECT  t.pid_high
        FROM    Tag t
        WHERE   t.name = 'cat'
        )
        AND b.tag_id_low =
        (
        SELECT  t.pid_low
        FROM    Tag t
        WHERE   t.name = 'cat'
        )
        AND i.pid_high = b.image_id_high
        AND i.pid_low = b.image_id_low
ORDER BY
        b.uploadedDate DESC
LIMIT 20;

The double subquery is because SQLite does not understand tuple syntax.

like image 185
Quassnoi Avatar answered Sep 18 '22 14:09

Quassnoi