I have a pretty basic implementation of an Image uploading service, where you can upload images and tag them. This is my schema:
CREATE TABLE Tag(
orm_id INTEGER PRIMARY KEY AUTOINCREMENT,
pid_high UNSIGNED BIG INT NOT NULL,
pid_low UNSIGNED BIG INT NOT NULL,
name STRING NOT NULL,
CONSTRAINT KeyConstraint UNIQUE (pid_high, pid_low) ON CONFLICT FAIL);
CREATE TABLE TagBridge(
orm_id INTEGER PRIMARY KEY AUTOINCREMENT,
pid_high UNSIGNED BIG INT NOT NULL,
pid_low UNSIGNED BIG INT NOT NULL,
image_id_high UNSIGNED BIG INT NOT NULL,
image_id_low UNSIGNED BIG INT NOT NULL,
tag_id_high UNSIGNED BIG INT NOT NULL,
tag_id_low UNSIGNED BIG INT NOT NULL,
CONSTRAINT KeyConstraint UNIQUE (pid_high, pid_low) ON CONFLICT FAIL);
CREATE TABLE Image(
orm_id INTEGER PRIMARY KEY AUTOINCREMENT,
pid_high UNSIGNED BIG INT NOT NULL,
pid_low UNSIGNED BIG INT NOT NULL,
filehash STRING NOT NULL,
mime STRING NOT NULL,
uploadedDate INTEGER NOT NULL,
ratingsAverage REAL,
CONSTRAINT KeyConstraint UNIQUE (pid_high, pid_low) ON CONFLICT FAIL);
And Indices
CREATE INDEX ImageTest on Image(pid_high, pid_low, uploadedDate DESC);
CREATE INDEX ImagefilehashIndex ON Image (filehash);
CREATE INDEX ImageuploadedDateIndex ON Image (uploadedDate);
CREATE INDEX TagnameIndex ON Tag (name);
The reason that there are pid_high/pid_low fields instead of your standard primary key is because this service uses client-authoritative 128-bit GUIDs, but this does not impact the query speed significantly.
Since this is the internet, the vast majority of the images on this service are cats, and are tagged with 'cat'. In fact, about 47,000 out of 50,000 images are tagged with 'cat'. The query to get all images that is tagged with 'cat' is
select i.* from Tag t, TagBridge b, Image i
where
b.tag_id_high = t.pid_high AND b.tag_id_low = t.pid_low
AND b.image_id_high = i.pid_high and b.image_id_low = i.pid_low
AND t.name ='cat'
order by uploadedDate DESC LIMIT 20;
The query plan for this is
sele order from deta
---- ------------- ---- ----
0 0 0 SEARCH TABLE Tag AS t USING INDEX TagnameIndex (name=?) (~1 rows)
0 1 1 SCAN TABLE TagBridge AS b (~472 rows)
0 2 2 SEARCH TABLE Image AS i USING INDEX ImageTest (pid_high=? AND pid_low=?) (~1 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY
The main problem here is the last row, USE TEMP B-TREE FOR ORDER BY. This slows down the query significantly. Without the 'order by' clause, the entire query takes about 0.001s to run. With the order by clause, the query takes 0.483s, which is a 400x performance penalty.
I would like to get this query under 0.1 seconds, but I am not sure how. I have tried many other queries, and adding and removing indices, but this is the fastest that I have been able to run.
This is a general problem of choosing between filtering and ordering index:
You should keep a list of popular tags (for which the ordering index is more beneficial) and somehow forbid the filtering index if the tag is popular, say, like this:
SELECT i.*
FROM Tag t, TagBridge b, Image i
WHERE b.tag_id_high = t.pid_high AND b.tag_id_low = t.pid_low
AND b.image_id_high = i.pid_high AND b.image_id_low = i.pid_low
AND t.name || '' = 'cat'
ORDER BY
i.uploadedDate DESC
LIMIT 20
Alternatively, you could denormalize your schema and add uploadedDate
to TagBridge
, filling it with a trigger or whatever. Then create a composite index on TagBridge (pid_high, pid_low, uploadedDate, image_id_high, image_id_low)
and rewrite the query a little:
SELECT i.*
FROM TagBridge b, Image i
WHERE b.tag_id_high =
(
SELECT t.pid_high
FROM Tag t
WHERE t.name = 'cat'
)
AND b.tag_id_low =
(
SELECT t.pid_low
FROM Tag t
WHERE t.name = 'cat'
)
AND i.pid_high = b.image_id_high
AND i.pid_low = b.image_id_low
ORDER BY
b.uploadedDate DESC
LIMIT 20;
The double subquery is because SQLite
does not understand tuple syntax.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With