We have an online store where we use Solr for searching products. The basic setup works fine, but currently it's lacking some features. I looked up some online shops like Amazon, and I liked the features they are offering. So I thought, how could I configure Solr to offer some of the features to our end users.
Our product data consists of kinda standard data for products like
For now, we are using this schema file to index and perform queries on Solr:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
</analyzer>
</fieldType>
EdgeNGramFilterFactory
indexes a word like shirt
into sh
, shi
, shir
, shirt
WordDelimiterFilterFactory
breaks up words like wi-fi
into wi
, fi
, wifi
PorterStemFilterFactory
works good for stemmingPhoneticFilterFactory
provides kinda fuzzy searchOne problem is, that the fuzzy search doesn't work very well. If I search for the book Inferno
and missspelled it with Infenro
, the search doesn't return any results. I've read about the SpellCheckComponent
(http://wiki.apache.org/solr/SpellCheckComponent), but I'm not sure if that's the best way to do a fuzzy search, or a Did you mean? feature.
The second problem is, that it should be possible, to search for Shirts red
to find red T-Shirts (where red is an option value of the option type color) or to search for woman shoes
or adidas shoes woman
. Is it possible to do this with Solr?
And the third problem is, that I'm not sure which of the tokenizer and filters inside the schema.xml
are a good choice to achieve such features.
I hope someone has used such features with solr, and can help me in this case. Thx!
EDIT
Here is some data, that we store inside Solr:
<doc>
<str name="id">572</str>
<arr name="taxons">
<str>cat1</str>
<str>cat1/cat2</str>
<str>cat1/cat2/cat3</str>
<str>cat1/cat4</str>
</arr>
<arr name="options">
<str>color_blue</str>
<str>color_red</str>
<str>size_39</str>
<str>size_40</str>
</arr>
<int name="count_on_hand">321</int>
<arr name="name_text">
<str>Riddle-Shirt Tech</str>
</arr>
<arr name="description_text">
<str>The Riddle Shirt Tech Men's Hoodie features signature details, along with ultra-lightweight fleece for optimum warmth.</str>
</arr>
<arr name="brand_text">
<str>Riddle</str>
</arr>
<arr name="retailer_text">
<str>Supershop</str>
</arr>
</doc>
I'm not sure if the options
key-value pairs are stored in a proper way, but that's the first approach I came up with.
Disclaimer:
I've made some assumptions about the schema, so please check the gist with the example schema and data - https://gist.github.com/rchukh/7385672#file-19854599
E.g. for taxons I've used special text field with PathHierarchyTokenizerFactory
The issue why Inferno
doen't match Infenro
is because it's not a phonetic misspelling. The photetic filter is not meant for that kind of match.
If you're interested in some details - here is a pretty good article about the algorithms supported by lucene/solr: http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html
You will probably be interested in the SpellCheck Collate feature
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
From wiki:
A collation is the original query string with the best suggestions for each term replaced in it. If spellcheck.collate is true, Solr will take the best suggestion for each token (if it exists) and construct a new query from the suggestions. For example, if the input query was "jawa class lording" and the best suggestion for "jawa" was "java" and "lording" was "loading", then the resulting collation would be "java class loading".
You can also leverage the fuzzy search feature based on the distance algorithms (but as I understand it's more useful for phrase searches, e.g. proximity search). Here's an example from solr wiki:
roam~
This search will match terms like foam and roams. It will also match the word "roam" itself.
So Infenro~
in query should match Inferno
in index... but my bet is to go with "google-like" approach:
That is - notify the user that following results are for correct spellings, but allow him to use the wrong spelling also (As it happens, sometimes the user may be right, and the machine may be wrong).
This problem can be solved with edismax, e.g. if you want to search by name_text AND options:
q=shirt%20AND%20red&defType=edismax&qf=name_text%20options
Here you can see the explain plan of this query - http://explain.solr.pl/explains/w1qb7zie
The issue with storing options as multivalued field with separator is that the search query will start matching the key, e.g. "color".
For example - the following request:
q=shirt%20AND%20color&defType=edismax&qf=name_text%20options
will match all shirts that have "color" option - http://explain.solr.pl/explains/pn6fbpfq
I have some doubts about using any FilterFactory after stemmers, but can't provide some meaninful information currently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With