Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tagging query with group_concat

Using the database schema for tagging from this question's accepted answer is it possible to have a query using group_concat that works with a large amount of data? I need to get items with their tags for all items tagged with tag x. Using a query with group_concat having ~ .5 million tags is very slow at > 15 seconds. Without group_concat (items without tags) it is ~ 0.05 seconds.

As a side question, how does SO solve this problem?

like image 972
el_pup_le Avatar asked May 18 '13 03:05

el_pup_le


3 Answers

This is probably a case of a poor indexing strategy. Adapting the schema shown in the accepted answer of the question to which you linked:

CREATE Table Items (
  Item_ID    SERIAL,
  Item_Title VARCHAR(255),
  Content    TEXT
) ENGINE=InnoDB;

CREATE TABLE Tags (
  Tag_ID     SERIAL,
  Tag_Title  VARCHAR(255)
) ENGINE=InnoDB;

CREATE TABLE Items_Tags (
  Item_ID    BIGINT UNSIGNED REFERENCES Items (Item_ID),
  Tag_ID     BIGINT UNSIGNED REFERENCES Tags  ( Tag_ID),
  PRIMARY KEY (Item_ID, Tag_ID)
) ENGINE=InnoDB;

Note that:

  • MySQL's SERIAL data type is an alias for BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE and, as such, is indexed;

  • defining the foreign key constraints in Items_Tags creates indexes on the foreign key columns.

like image 135
eggyal Avatar answered Nov 03 '22 00:11

eggyal


I would propose to have a hybrid between normalized data and denormalized data .
So using the normalized structure provided by eggyal i would do the following denormalized structure :

CREATE TABLE Items_Tags_Denormalized (
  Item_ID    BIGINT UNSIGNED REFERENCES Items (Item_ID),
  Tags     BLOB,
  PRIMARY KEY (Item_ID)
) ENGINE=InnoDB;

In column Tags you would have all the tags (Tag_Title) for the corresponding Item_ID.
Now you have 2 ways to achieve this:

  • create a cron that runs periodically which will build this table Items_Tags_Denormalized using GROUP_CONCAT or whatever suits you (advantage: doesn't put additional load when you insert or delete in Items_Tags table; disadvantage: the denormalized table will not always be up to date (depending on how often do you run the cron))

  • create triggers for Items_Tags table on insert and delete in order to keep up to date the Items_Tags_Denormalized table (advantage: the denormalized table will always be up to date;disadvantage: additional load when you insert or delete in Items_Tags table)

Choose whatever solution suits your needs best considering the advantages and disadvantages.

So in the end you will have the Items_Tags_Denormalized table from which you will only read without doing additional operations.

like image 27
Stephan Avatar answered Nov 03 '22 01:11

Stephan


Why would you use group_concat for that? For a given tag x you said that selecting the list of items is fast. For a given list of items getting all the tags should be fast, too. And is there not normally some kind of restriction, I mean normal websites don't show 100000 entries on one page.

I would suggest:

drop temporary table if exists lookup_item;

create temporary table lookup_item (item_id serial, primary key(item_id));

insert into lookup_item select i.id as item_id 
from items i 
where exists (select * from items_tags where item_id = i.id and tag_id = <tag_id>)
and <other conditions or limits>;

select * from lookup_item
inner join items_tags it on it.item_id = i.id
inner join tags t on t.id = it.tag_id
order by i.<priority>, t.<priority>

priority could be last-modified for items and some kind of importance for tags.

Then you get every item with it's tags. The only work in the code is to see when the result-line has the next item.

like image 1
flaschenpost Avatar answered Nov 03 '22 00:11

flaschenpost