Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr WordDelimiterFilter generate word parts and catenate in query

Tags:

solr

I want query wi-fi to match documents with wifi in the index. So, I'm using solr.WordDelimiterFilterFactory to catenate words in query:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0" preserveOriginal="0"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0" preserveOriginal="0"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
</fieldType>

But using this configuration query LGA1155 doesn't match LGA 1155, because query title:LGA1155 is parsed as: (title:lga title:1155 title:lga1155)~3

If I don't catenate words in query, LGA1155 matches LGA 1155, because query is parsed as: (title:lga title:1155)~2. But then wi-fi doesn't match wifi.

I'm using edismax query parser and q.op is AND. Solr version: 4.5.

So, how to make both wi-fi match wifi and LGA1155 match LGA 1155 (and other similar queries)?

like image 739
Rinas Avatar asked Dec 26 '22 18:12

Rinas


1 Answers

As you describe it, you want to catenate word parts, but you want to split on numerics.

The catenateAll="1" you have in there is not good, as it will undo the split of numerics (LGA115 becoming LGA 1155) you want to achieve.

Try it with these settings of the WhitespaceTokenizerFactory in your analyzer.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="0" catenateWords="1"
            generateNumberParts="1" catenateNumbers="0" splitOnNumerics="1"
            catenateAll="0" splitOnCaseChange="0"
            stemEnglishPossessive="0" preserveOriginal="0" />
        <filter class="solr.ICUFoldingFilterFactory" />
    </analyzer>
</fieldType>

This would produce the following tokens

  • wi-fi -> wifi
  • Wi-Fi -> wifi
  • WiFi -> wifi
  • LGA1155 -> lga 1155
  • LGA 1155 -> lga 1155
  • LGA-1155 -> lga 1155

As you can see wifi becomes one word and LGA1155 gets separated.


Another thing is, as you can see in my sample, if the analyzer on query and index time shall be the same, as in your sample, you can leave out the type attribute in the analyzer element and delete one of the two elements completely.

So instead of

<fieldType ... >
    <analyzer type="query">
       <!-- same stuff -->
    </analyzer>
    <analyzer type="index">
       <!-- same stuff -->
    </analyzer>
</fieldType>

Just

<fieldType ... >
    <analyzer>
       <!-- will be taken to index and query time -->
    </analyzer>
</fieldType>
like image 80
cheffe Avatar answered Feb 23 '23 00:02

cheffe