Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon like search with Solr

We have an online store where we use Solr for searching products. The basic setup works fine, but currently it's lacking some features. I looked up some online shops like Amazon, and I liked the features they are offering. So I thought, how could I configure Solr to offer some of the features to our end users.

Our product data consists of kinda standard data for products like

  • title of a product
  • description
  • a product is in multiple categories and sub-categories
  • a product can have multiple variants with options, like a T-Shirt in red, blue, green, S, M, L, XL... or an iPad with 16GB, 32GB...
  • a product has a brand
  • a product has a retailer

For now, we are using this schema file to index and perform queries on Solr:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
  </analyzer>
</fieldType>
  • EdgeNGramFilterFactory indexes a word like shirt into sh, shi, shir, shirt
  • WordDelimiterFilterFactory breaks up words like wi-fi into wi, fi, wifi
  • PorterStemFilterFactory works good for stemming
  • PhoneticFilterFactory provides kinda fuzzy search

One problem is, that the fuzzy search doesn't work very well. If I search for the book Inferno and missspelled it with Infenro, the search doesn't return any results. I've read about the SpellCheckComponent (http://wiki.apache.org/solr/SpellCheckComponent), but I'm not sure if that's the best way to do a fuzzy search, or a Did you mean? feature.

The second problem is, that it should be possible, to search for Shirts red to find red T-Shirts (where red is an option value of the option type color) or to search for woman shoes or adidas shoes woman. Is it possible to do this with Solr?

And the third problem is, that I'm not sure which of the tokenizer and filters inside the schema.xml are a good choice to achieve such features.

I hope someone has used such features with solr, and can help me in this case. Thx!

EDIT

Here is some data, that we store inside Solr:

<doc>
  <str name="id">572</str>
  <arr name="taxons">
    <str>cat1</str>
    <str>cat1/cat2</str>
    <str>cat1/cat2/cat3</str>
    <str>cat1/cat4</str>
  </arr>
  <arr name="options">
    <str>color_blue</str>
    <str>color_red</str>
    <str>size_39</str>
    <str>size_40</str>
  </arr>
  <int name="count_on_hand">321</int>
  <arr name="name_text">
    <str>Riddle-Shirt Tech</str>
  </arr>
  <arr name="description_text">
    <str>The Riddle Shirt Tech Men's Hoodie features signature details, along with ultra-lightweight fleece for optimum warmth.</str>
  </arr>
  <arr name="brand_text">
    <str>Riddle</str>
  </arr>
  <arr name="retailer_text">
    <str>Supershop</str>
  </arr>
</doc>

I'm not sure if the options key-value pairs are stored in a proper way, but that's the first approach I came up with.

like image 683
23tux Avatar asked Nov 08 '13 08:11

23tux


1 Answers

Disclaimer:

I've made some assumptions about the schema, so please check the gist with the example schema and data - https://gist.github.com/rchukh/7385672#file-19854599

E.g. for taxons I've used special text field with PathHierarchyTokenizerFactory

First problem (fuzzy search):

The issue why Inferno doen't match Infenro is because it's not a phonetic misspelling. The photetic filter is not meant for that kind of match.

If you're interested in some details - here is a pretty good article about the algorithms supported by lucene/solr: http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html


You will probably be interested in the SpellCheck Collate feature

http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

From wiki:

A collation is the original query string with the best suggestions for each term replaced in it. If spellcheck.collate is true, Solr will take the best suggestion for each token (if it exists) and construct a new query from the suggestions. For example, if the input query was "jawa class lording" and the best suggestion for "jawa" was "java" and "lording" was "loading", then the resulting collation would be "java class loading".


You can also leverage the fuzzy search feature based on the distance algorithms (but as I understand it's more useful for phrase searches, e.g. proximity search). Here's an example from solr wiki:

roam~

This search will match terms like foam and roams. It will also match the word "roam" itself.

So Infenro~ in query should match Inferno in index... but my bet is to go with "google-like" approach:

google misspellings

That is - notify the user that following results are for correct spellings, but allow him to use the wrong spelling also (As it happens, sometimes the user may be right, and the machine may be wrong).

Second problem

This problem can be solved with edismax, e.g. if you want to search by name_text AND options:

q=shirt%20AND%20red&defType=edismax&qf=name_text%20options

Here you can see the explain plan of this query - http://explain.solr.pl/explains/w1qb7zie


The issue with storing options as multivalued field with separator is that the search query will start matching the key, e.g. "color".

For example - the following request:

q=shirt%20AND%20color&defType=edismax&qf=name_text%20options

will match all shirts that have "color" option - http://explain.solr.pl/explains/pn6fbpfq

Third problem

I have some doubts about using any FilterFactory after stemmers, but can't provide some meaninful information currently.

like image 108
rchukh Avatar answered Nov 19 '22 14:11

rchukh