Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP, MySQL, Efficient tag-driven search algorithm

I'm currenlty building a webshop. This shop allows users to filter products by category, and a couple optional, additional filters such as brand, color, etc.

At the moment, various properties are stored in different places, but I'd like to switch to a tag-based system. Ideally, my database should store tags with the following data:

  • product_id
  • tag_url_alias (unique)
  • tag_type (unique) (category, product_brand, product_color, etc.)
  • tag_value (not unique)

First objective

I would like to search for product_id's that are associated with anywhere between 1-5 particular tags. The tags are extracted from a SEO-friendly url. So I will be retrieving a unique strings (the tag_url_alias) for each tag, but I won't know the tag_type. The search will be an intersection, so my search should return the product_id's that match all of the provided tags.

Second objective

Besides displaying the products that match the current filter, I would also like to display the product-count for other categories and filters which the user might supply.

For instance, my current search is for products that match the tags:

Shoe + Black + Adidas

Now, a visitor of the shop might be looking at the resulting products and wonder which black shoes other brands have to offer. So they might go to the "brand" filter, and choose any of the other listed brands. Lets say they have 2 different options (in practice, this will probably have many more), resulting in the following searches:

Shoe + Black + Nike > 103 results
Shoe + Black + K-swiss > 0 results

In this case, if they see the brand "K-swiss" listed as an available choise in their filter, their search will return 0 results.

This is obviously rather disappointing to the user... I'd much rather know that switching the "brand" from "adidas" to "k-swiss" will 0 results, and simply remove the entire option from the filter.

Same thing goes for categories, colors, etc.

In practice this would mean a single page view would not only return the filtered product list described in my primary objective, but potentially hundreds of similar yet different lists. One for each filter value that could replace another filter value, or be added to the existing filter values.

Capacity

I suspect my database will eventually contain:

between 250 and 1.000 unique tags

And it will contain:

between 10.000 and 100.000 unique products

Current Ideas

I did some Google searches and found the following article: http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html

Judging by that article, running hundreds of queries to achieve the 2nd objective, is going to be a painfully slow route. The "toxy" example might work for my needs and it might be acceptable for my First objective, but it would be unacceptably slow for the Second objective.

I was thinking I might run individual queries that match 1 tag to it's associated product_id's, cache those queries, and then calculate intersections on the results. But, do I calculate these intersections in MySQL? or in PHP? If I use MySQL, is there a particular way I should cache these individual queries, or is supplying the right indexes all I need?

I would imagine it's also quite possible to maybe even cache the intersections between two of these tag/product_id sets. The amount of intersections would be limited by the fact that a tag_type can have only one particular value, but I'm not sure how to efficiently manage this type of caching. Again, I don't know if I should do this in MySQL or in PHP. And if I do this in MySQL, what would be the best way to store and combine this type of cached results?

like image 838
Ruben Vreeken Avatar asked Oct 15 '12 14:10

Ruben Vreeken


1 Answers

Using sphinx search engine can make this magic for you. Its is VERY fast, and even can handle wordforms, what can be useful with SEO requests.

In terms of sphinx, make a document - "product", index by tags, choose proper ranker for query (ex, MATCH_ALL_WORDS) and run batch request with different tag combinations to get best results. Dont forget to use cachers like memcahed or any other.

like image 125
Alexey Petushkov Avatar answered Sep 25 '22 01:09

Alexey Petushkov