Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr search dash in part number

Tags:

search

solr4

I'm having some difficulties with either how to construct the Solr query, or how to setup the schema to get searches in our web store to work better.

First some configuration (Solr 4.2.1)

<field name="mfgpartno" type="text_en_splitting_tight" indexed="true" stored="true" />
<field name="mfgpartno_sort" type="string" indexed="true" stored="false" />
<field name="mfgpartno_search" type="sku_partial" indexed="true" stored="true" />

<copyField source="mfgpartno" dest="mfgpartno_sort" />
<copyField source="mfgpartno" dest="mfgpartno_search" />

<fieldType name="sku_partial" class="solr.TextField" omitTermFreqAndPositions="true">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
        <filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="100" side="front" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    </analyzer>
</fieldType>

Let me break this down into stages (I'm only going to go into enough to replicate the problem - the initial stages aren't using edismax, that is what we've chosen to use on our website):

  1. q=DV\-5PBRP <- With this query I get 18 results but, not the one I'm looking for (this is most likely do to the default df searching on the productname field - fine)
  2. q=mfgpartno_search:DV\-5PBRP <- this gives me the 1 result I'm looking for, but due to the query building I need to do on the website it's better if I can use the q parameter like stage 1.
  3. q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search <- this also gives me the 1 result I'm looking for, but again due to the website search qf needs to be spanning more fields. Because it needs to search more fields (actual qf = productname_search shortdesc_search fulldesc_search mfgpartno_search productname shortdesc fulldesc keywords) to get more accurate searching I implemented stage 4.
  4. q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search&q.op=AND <- with this test I get 0 results - though this works great for most searches on our site.

My big problem with search has been the special characters like the dash that sometimes must be literal, and sometimes act as separators as in product names or descriptions. Sometimes people will even search or replace the dash with a space on a part number search and it should still show relevant data.

I'm kind of stuck on how to get this special character search working - especially as it pertains to this mfgpartno_search field. How might I configure either the schema or query (or both) to get this working?

like image 304
Chris Avatar asked Apr 30 '15 19:04

Chris


1 Answers

Maybe you could try the Regular Expression Pattern Tokenizer, and make a suitable regular expression for you article numbers. Lucene (which Solr is built upon) is very focused on tokenization for prose.

What you want here is probably an N-gram split, as well as 1-grams? And maybe that dashes are replaced with spaces, something like

DV-5PBRP -> {DV 5PBRP, DV, 5P, BR, PB, RP, D, V, 5, P, B, R}

As you can see, the index will be quite large for very small fields. Make sure the ranking of the results are heavily weighted for the larger ngrams.

I do think you should remove the stop word list for the article numbers field.

The N-gram size should probably start at 1 or 2.

Simply make sure the various analyzers doesn't:

  • swallow the dash
  • remove single or few characters (these are often in stop word lists)
  • removes numbers
like image 171
claj Avatar answered Oct 20 '22 01:10

claj