Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ShingleFilterFactory affects size of highlighted section in Solr

Adding ShingleFilterFactory to a type in solr (index time) does result in changing behavior when queering with highlighting.

Sample Text: "in a ship a dragon was in a box"

Without ShingleFilterFactory both "in" tokens will be highlighted separately.

<em>in</em> a ship a dragon was <em>in</em> a box

With it the whole segment is returned as a single highlight.

<em>in a ship a dragon was in</em>

Why is it that the use of 'ShingleFilterFactory' does affect the highlighting?

EDIT:

Adding schema info as requested:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Using text_general, which contains the shingle filter, results in unusually large highlight fields as described above.

like image 338
Th 00 mÄ s Avatar asked May 05 '15 13:05

Th 00 mÄ s


2 Answers

Maybe you can use this highlighter:

https://issues.apache.org/jira/browse/LUCENE-1522

The problem that you are pointing is known and some patches are available:

https://issues.apache.org/jira/browse/LUCENE-1489

Edit: The second link is the same that Bereng sent.

like image 129
alexf Avatar answered Sep 20 '22 17:09

alexf


Won't help much but will shed some light:

https://issues.apache.org/jira/browse/LUCENE-1489

like image 23
Bereng Avatar answered Sep 17 '22 17:09

Bereng