Using the database schema for tagging from this question's accepted answer is it possible to have a query using group_concat that works with a large amount of data? I need to get items with their tags for all items tagged with tag x. Using a query with group_concat having ~ .5 million tags is very slow at > 15 seconds. Without group_concat (items without tags) it is ~ 0.05 seconds.
As a side question, how does SO solve this problem?
This is probably a case of a poor indexing strategy. Adapting the schema shown in the accepted answer of the question to which you linked:
CREATE Table Items (
Item_ID SERIAL,
Item_Title VARCHAR(255),
Content TEXT
) ENGINE=InnoDB;
CREATE TABLE Tags (
Tag_ID SERIAL,
Tag_Title VARCHAR(255)
) ENGINE=InnoDB;
CREATE TABLE Items_Tags (
Item_ID BIGINT UNSIGNED REFERENCES Items (Item_ID),
Tag_ID BIGINT UNSIGNED REFERENCES Tags ( Tag_ID),
PRIMARY KEY (Item_ID, Tag_ID)
) ENGINE=InnoDB;
Note that:
MySQL's SERIAL
data type is an alias for BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE
and, as such, is indexed;
defining the foreign key constraints in Items_Tags
creates indexes on the foreign key columns.
I would propose to have a hybrid between normalized data and denormalized data .
So using the normalized structure provided by eggyal i would do the following denormalized structure :
CREATE TABLE Items_Tags_Denormalized (
Item_ID BIGINT UNSIGNED REFERENCES Items (Item_ID),
Tags BLOB,
PRIMARY KEY (Item_ID)
) ENGINE=InnoDB;
In column Tags
you would have all the tags (Tag_Title
) for the corresponding Item_ID
.
Now you have 2 ways to achieve this:
create a cron that runs periodically which will build this table Items_Tags_Denormalized
using GROUP_CONCAT
or whatever suits you (advantage: doesn't put additional load when you insert or delete in Items_Tags
table; disadvantage: the denormalized table will not always be up to date (depending on how often do you run the cron))
create triggers for Items_Tags
table on insert and delete in order to keep up to date the Items_Tags_Denormalized
table (advantage: the denormalized table will always be up to date;disadvantage: additional load when you insert or delete in Items_Tags
table)
Choose whatever solution suits your needs best considering the advantages and disadvantages.
So in the end you will have the Items_Tags_Denormalized
table from which you will only read without doing additional operations.
Why would you use group_concat for that? For a given tag x you said that selecting the list of items is fast. For a given list of items getting all the tags should be fast, too. And is there not normally some kind of restriction, I mean normal websites don't show 100000 entries on one page.
I would suggest:
drop temporary table if exists lookup_item;
create temporary table lookup_item (item_id serial, primary key(item_id));
insert into lookup_item select i.id as item_id
from items i
where exists (select * from items_tags where item_id = i.id and tag_id = <tag_id>)
and <other conditions or limits>;
select * from lookup_item
inner join items_tags it on it.item_id = i.id
inner join tags t on t.id = it.tag_id
order by i.<priority>, t.<priority>
priority could be last-modified for items and some kind of importance for tags.
Then you get every item with it's tags. The only work in the code is to see when the result-line has the next item.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With