I want query wi-fi to match documents with wifi in the index. So, I'm using solr.WordDelimiterFilterFactory to catenate words in query:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0" preserveOriginal="0"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0" preserveOriginal="0"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
But using this configuration query LGA1155 doesn't match LGA 1155, because query title:LGA1155 is parsed as: (title:lga title:1155 title:lga1155)~3
If I don't catenate words in query, LGA1155 matches LGA 1155, because query is parsed as: (title:lga title:1155)~2. But then wi-fi doesn't match wifi.
I'm using edismax query parser and q.op is AND. Solr version: 4.5.
So, how to make both wi-fi match wifi and LGA1155 match LGA 1155 (and other similar queries)?
As you describe it, you want to catenate word parts, but you want to split on numerics.
The catenateAll="1"
you have in there is not good, as it will undo the split of numerics (LGA115 becoming LGA 1155) you want to achieve.
Try it with these settings of the WhitespaceTokenizerFactory
in your analyzer.
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" catenateWords="1"
generateNumberParts="1" catenateNumbers="0" splitOnNumerics="1"
catenateAll="0" splitOnCaseChange="0"
stemEnglishPossessive="0" preserveOriginal="0" />
<filter class="solr.ICUFoldingFilterFactory" />
</analyzer>
</fieldType>
This would produce the following tokens
wifi
wifi
wifi
lga
1155
lga
1155
lga
1155
As you can see wifi becomes one word and LGA1155 gets separated.
Another thing is, as you can see in my sample, if the analyzer on query and index time shall be the same, as in your sample, you can leave out the type
attribute in the analyzer
element and delete one of the two elements completely.
So instead of
<fieldType ... >
<analyzer type="query">
<!-- same stuff -->
</analyzer>
<analyzer type="index">
<!-- same stuff -->
</analyzer>
</fieldType>
Just
<fieldType ... >
<analyzer>
<!-- will be taken to index and query time -->
</analyzer>
</fieldType>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With