Adding ShingleFilterFactory
to a type in solr (index time) does result in changing behavior when queering with highlighting.
Sample Text: "in a ship a dragon was in a box"
Without ShingleFilterFactory
both "in" tokens will be highlighted separately.
<em>in</em> a ship a dragon was <em>in</em> a box
With it the whole segment is returned as a single highlight.
<em>in a ship a dragon was in</em>
Why is it that the use of 'ShingleFilterFactory' does affect the highlighting?
EDIT:
Adding schema info as requested:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Using text_general
, which contains the shingle filter, results in unusually large highlight fields as described above.
Maybe you can use this highlighter:
https://issues.apache.org/jira/browse/LUCENE-1522
The problem that you are pointing is known and some patches are available:
https://issues.apache.org/jira/browse/LUCENE-1489
Edit: The second link is the same that Bereng sent.
Won't help much but will shed some light:
https://issues.apache.org/jira/browse/LUCENE-1489
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With