Automatic product classification and query weighting

Question

I'm facing ranking problems using solr and I'm stucked.

Given a e-commerce site, for the query "ipad" i obtain:

ipad case for ipad 2
ipad case
ipad connection kit
ipad 32gb wifi

This is a problem, since we want to rank first the main products (or products by itself) and tf/idf ranks first the accessories due to descriptions like "ipad case compatible with ipad, ipad2, ipad3, ipad retina, ipad mini, etc".

Furthermore, using the categories we have no way of determining whether is an accessory or a product.

I wonder if using automatic classification would help. Another solution that improves this ranking (like Named Entity Recognition) would be appreciated.

Thomas Jungblut · Accepted Answer

Could you provide tagged data?

If you have >50k items a Naive Bayes with a bigram language model trained on the product name will almost catch all accessories with 99% accuracy. I guess you can train such a naive bayes with Mahout, however product names have a pretty limited bigram amount so this can be trained even on a smartphone easily and fast nowadays.

This is a typical mechanical turk task, shouldn't be that expensive to tag a few items. However if you insist on some semi-supervised algorithm, I found Iterative similarity aggregation pretty useful.

The main idea is that you give a few tokens like "case"/"power adapter" and it iteratively finds new tokens that are indicators of spam because they appear in the same context.

Here is the paper, but I have written a blogpost about this as well which sums up the intention in plain language. This paper also mentions the same "let the user find the right item" paradigm that Sean has proposed, so both can be used in conjunction.

Oh and if you need some advice of machine learning with Lucene&SOLR I can recommend you the talk of my friend Tommaso Teofili at ApacheCon Europe this year. You can find the slides on slideshare. There is also a youtube video of the talk out there, just search for it ;)

Sean Owen · Answer

TF/IDF is just going to rank based on the words in the query vs words in the title as you have found. That sounds like it is not the right definition of "good result" and that you want to favor products over accessories.

Of course you can simply attach heuristics to patch the problem. For example, consider the title as a set of words, not multiset, so the appearance of "iPad" several times makes no difference. Or just boost the score of items that you know are products. This isn't learning per se, but are simple, directly reflect your business knowledge, and probably have some positive effect.

If you want to learn here, you probably need to use the one best source of knowledge about what the best results are: your users. You know what they click in response to each query. You can learn a term-item model that associates search terms to items clicked. You can view that as many types of problem -- actually a latent-factor recommender model could work well there.

Have a look at Ted's slides on how to use a recommender as a "search engine": http://www.slideshare.net/tdunning/search-as-recommendation

Automatic product classification and query weighting

Tags:

machine-learning

solr

lucene

mahout

Samuel García

2 Answers

Thomas Jungblut

Sean Owen

Recent Activity

Donate For Us

Automatic product classification and query weighting

Tags:

machine-learning

solr

lucene

mahout

Samuel García

2 Answers

Thomas Jungblut

Sean Owen

Related questions

Recent Activity

Donate For Us