High-performance multi-tier tag filtering

Tags:

I have a large database of artists, albums, and tracks. Each of these items may have one or more tags assigned via glue tables (track_attributes, album_attributes, artist_attributes). There are several thousand (or even hundred thousand) tags applicable to each item type.

I am trying to accomplish two tasks, and I'm having a very hard time getting the queries to perform acceptably.

Task 1) Get all tracks that have any given tags (if provided) by artists that have any given tags (if provided) on albums with any given tags (if provided). Any set of tags may not be present (i.e. only a track tag is active, no artist or album tags)

Variation: The results are also presentable by artist or by album rather than by track

Task 2) Get a list of tags that are applied to the results from the previous filter, along with a count of how many tracks have each given tag.

What I am after is some general guidance in approach. I have tried temp tables, inner joins, IN(), all my efforts thus far result in slow responses. A good example of the results I am after can be seen here: http://www.yachtworld.com/core/listing/advancedSearch.jsp, except they only have one tier of tags, I am dealing with three.

Table structures:

Table: attribute_tag_groups
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 name       | character varying(255)      | 
 type       | enum (track, album, artist) | 

Table: attribute_tags
   Column                       |          Type               |   
--------------------------------+-----------------------------+
 id                             | integer                     |
 attribute_tag_group_id         | integer                     |
 name                           | character varying(255)      | 

Table: track_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 track_id   | integer                     |
 tag_id     | integer                     | 

Table: artist_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 artist_id  | integer                     |
 tag_id     | integer                     | 

Table: album_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 album_id   | integer                     |
 tag_id     | integer                     | 

Table: artists
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 name       | varchar(350)                | 

Table: albums
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 artist_id  | integer                     | 
 name       | varchar(300)                | 

Table: tracks
   Column    |          Type               |   
-------------+-----------------------------+
 id          | integer                     |
 artist_id   | integer                     | 
 album_id    | integer                     | 
 compilation | boolean                     | 
 name        | varchar(300)                |

EDIT I am using PHP, and I am not opposed to doing any sorting or other hijinx in script, my #1 concern is speed of return.

532

asked Aug 05 '11 18:08

Chris Baker

4 Answers

If you want speed, I would suggest you look into Solr/Lucene. You can store your data, and have very speedy lookups by calling Solr and parsing the result from PHP. And as an added benefit you get faceted searches as well (which is task 2 of your question if I interpret it correctly). The downside is of course that you might have redundant information (once stored in DB, once in the Solr document store). And it does take a while to setup (well, you could learn a lot from Drupal Solr integration).

Just check out the PHP reference docs for Solr.

Here's on article on how to use Solr with PHP, just in case : http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/.

189

answered Oct 14 '22 08:10

wimvds

You probably should try to denormalize your data. Your structure is optimised for insert/update load, but not for queries. As I got it, your will have much more select queries than insert/update queries.

For example you can do something like this:

store your data in normalized structure.

create agregate table like this

  track_id, artist_tags, album_tags, track_tags
   1 , jazz/pop/,  jazz/rock, /heavy-metal/  

    or 

    track_id, artist_tags, album_tags, track_tags
    1 , 1/2/,  1/3, 4/

to spead up search you probably should create FULLTEXT index on *_tags columns

query this table with sql like

select * from aggregate where album_tags  MATCH (track_tags) AGAINST ('rock')

rebuild this table incrementally once a day.

answered Oct 14 '22 10:10

Andrey Frolov

I think the answer greately depends on how much money you wish to spend on your project - there are some tasks that are even theoretically impossible to accomplish given strict conditions(for example that you must use only one weak server). I will assume that you are ready to upgrade your system.

First of all - your table structure forces JOIN's - I think you should avoid them if possible when writing high performace applications. I don't know "attribute_tag_groups" is, so I propose a table structure: tag(varchar 255), id(int), id_type(enum (track, album, artist)). Id can be artist_id,track_id or album_id depending on id_type. This way you will be able too lokup all your data in one table, but of cource it will use much more memory.

Next - you should consider using several databases. It will help even more if each database contains only part of your data(each lookup will be faster). Deciding how to spread your data between databases is usually rather hard task: I suggest you make some statistics about tag length, find ranges of length that will get similar trac/artists results count and hard-code it into your lookup code.

Of cource you should consider MySql tuning(I am sure you did that, but just in case) - all your tables should reside in RAM - if that is impossible try to get SSD discs, raids etc.. Proper indexing and database types/settings are really important too (MySql may even show some bottlenecks in internal statistics).

This suggestion may sound mad - but sometimes it is good to let PHP do some calculations that MySql can do itself. MySql databases are much harder to scale, while a server for PHP processing can be added in in the matter of minutes. And different PHP threads can run on different CPU cores - MySql have problems with it. You can increase your PHP performace by using some advanced modules(you can even write them yourself - profile your PHP scripts and hard code bottlenecks in fast C code).

Last but I think the most important - you must use some type of caching. I know that it is really hard, but I don't think that there was any big project without a really good caching system. In your case some tags will surely be much more popular then others, so it should greately increase performance. Caching is a form of art - depending on how much time you can spend on it and how much resources are avaliable you can make 99% of all requests use cache.

Using other databases/indexing tools may help you, but you should always consider theoretical query speed comparison(O(n), O(nlog(n))...) to understand if they can really help you - using this tools sometimes give you low performance gain(like constant 20%), but they may complicate your application design and most of the time it is not worth it.

answered Oct 14 '22 08:10

XzKto

From my experience most 'slow' MySQL database doesn't have correct index and/or queries. So I would check these first:

Make sure all data talbes' id fields is primary index. Just in case.
For all data tables, create an index on the external id fields and then the id, so that MySQL can use it in search.
For your glue tables, setting a primary key on the two fields, first the subject, then the tag. This is for normal browsing. Then create a normal index on the tag id. This is for searching.
Still slow? Are you using MyISAM for your tables? It is designed for quick queries.
If still slow, run an EXPLAIN on a slow query and post both the query and result in the question. Preferably with an importable sql dump of your complete database structure.

answered Oct 14 '22 09:10

Sheepy

Related questions
                            
                                handling large arrays with array_diff
                            
                                Question about var_dump output
                            
                                Debug VM with xDebug and PhpStorm
                            
                                Track memory usage of a method
                            
                                Get Browser Tab Index/Id
                            
                                PHP Login framework to include google/facebook/openid etc
                            
                                Bulk Parameterized Inserts
                            
                                Blowfish salt length for the Crypt() function?
                            
                                Killing processes opened with popen()?
                            
                                calculate an unique id from an array/class
                            
                                soapClient use SoapFault or Exception or both to catch Error?
                            
                                PHP locale configuration in php.ini
                            
                                what happens with old CMS/blog websites?
                            
                                MySQLi Prepared Statements and Transactions
                            
                                TCPDF - Pagenumbers not exactly right aligned
                            
                                Deploying / Continuous integration of a Symfony 2 application with Jenkins/Hudson
                            
                                how friend suggestions or 2nd degree related (linkedin) algorithm works
                            
                                Tutorial for PHP Amazon Simple Storage Service [closed]
                            
                                Suppress PHP warnings for expected Oracle exceptions
                            
                                Is there jsoup-like HTML parser for PHP?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

High-performance multi-tier tag filtering

Tags:

performance

sql

join

php

large-data