So essentially I have two tables, containing URLS and TAGS, with a has-and-belongs-to-many relationship between the two via a joins tables TAGS_URLS.
A simple query to find URL's by tags would be:
SELECT urls.id FROM urls
INNER JOIN tags_urls ON urls.id=tags_urls.url_id
INNER JOIN tags ON tags_urls.tag_id=tags.id
WHERE tags.tag IN ("sample","tag","list");
However, I'm trying to recover an intersection of all URL's that contain all of a set of tags. I.e., only URL's that contain the tag "sample" AND "tag" AND "list".
I have a working query, but I cannot get the query to execute in less than 30 seconds.
SELECT a.id
FROM
(SELECT DISTINCT urls.id FROM urls
INNER JOIN tags_urls ON tags_urls.url_id=urls.id INNER JOIN tags ON tags.id=tags_urls.tag_id
WHERE tags.tag = 'sample') a
JOIN
(SELECT DISTINCT urls.id FROM urls
INNER JOIN tags_urls ON tags_urls.url_id=urls.id INNER JOIN tags ON tags.id=tags_urls.tag_id
WHERE tags.tag = 'list') b
ON a.id = b.id;
The result set is correct, but the performance is horrific.
I do also currently have the data duplicated in a Redis database as a list of URL id's stored in tag sets so I can do something like this and get a result set VERY quickly.
SINTER "tag-sample" "tag-list"
Would it be possible, with reasonable effort, to bring the MySQL performance for this task up to the Redis levels with SINTER?
I am not 100% sure, but I think the underlying engine is creating a temp table for each of your subselects. Depending on the size of your data, this can be quite costly. If they are big (and they are in your case) temp tables have to write their contents out to disk because they are too big to hold in memory at once. So basically your query is copying huge amounts of data as it tries to build out two temporary tables that match the selection criteria for your two subselects. Once this is done, it finally executes the outer select and this most likely rather fast.
I would try to factor the subselects out for inner joins. I think the following will give you what you are looking for:
select urls.id from urls
inner join tags_urls tu1 on tu1.url_id = urls.id
inner join tags t1 on tu1.tag_id = t1.id and t1.tag = 'sample'
inner join tag_urls tu2 on tu2.url_id = urls.id
inner join tags t2 on t2.id = tu2.tag_id and t2.tag = 'list'
You would continue to add pairs of inner joins to tag_urls and tags for each 'tag' you wanted to intersect with. Again, run this through explain and make sure everything has the right index.
DBMS's can do pretty well with a several inner joins but as you increase the number of intersections, your performance will decrease.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With