Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr does not highlight some words

Tags:

highlight

solr

I configured solr 4.10 (also 5.3) with highlighting functionality. It works fine with most of the words, however I found some words which "does not" allow highlightings, that is, solr returns the required docs, but does not highlights some of them.

What can cause such effect?

solrconfig.xml

 <requestHandler name="/select" class="solr.SearchHandler">
 <lst name="defaults">
   <str name="wt">json</str>
   <str name="indent">true</str>
   <str name="defType">edismax</str>
   <str name="bf">product(concount)</str>
   <str name="df">text bio text_syn text_syn_other</str>
   <str name="qf">
    text^25 bio^16 text_syn^8 text_syn_other^3
   </str>
   <str name="hl">on</str>
   <str name="hl.fl">text bio text_syn text_syn_other</str>
   <str name="hl.preserveMulti">true</str>
   <str name="hl.encoder">html</str>
   <str name="f.text.hl.fragsize">100</str>
   <str name="hl.snippets">20</str>
   <arr name="components">
     <str>highlight</str>
   </arr>
 </lst>

schema.xml

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_abbr.txt" ignoreCase="true" expand="false"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text_en_syn" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text_en_syn_other" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_other.txt" ignoreCase="true" expand="false"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

<field name="text" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="text_syn" type="text_en_syn" indexed="true" stored="false" multiValued="true" />
<field name="text_syn_other" type="text_en_syn_other" indexed="true" stored="false" multiValued="true" />

<field name="text_exact" type="string" indexed="true" stored="false" multiValued="false" />

<field name="bio" type="text_en" indexed="true" stored="true" multiValued="false" />

<field name="bio_exact" type="string" indexed="true" stored="false" multiValued="false" />

<field name="concount" type="long" indexed="true" stored="true" multiValued="false" />

<field name="concount_exact" type="long" indexed="true" stored="false" multiValued="false" />

<copyField source="text" dest="text_syn"/>
<copyField source="bio" dest="text_syn"/>
<copyField source="text" dest="text_syn_other"/>
<copyField source="bio" dest="text_syn_other"/>

For the query http://localhost:8983/solr/select?q=senior I got docs containing the word senior, but in highlighting section of solr response that word is not highlighted.


UPDATE 1: I find out that I have the word senior in my synonyms_abbr.txt file, the line senior,lead. When I commented that line or replaced the places of words, lead,senior, surprisingly the word senior started geting highlighting. Any ideas ?


UPDATE 2: Words from synonyms.txt and synonyms_other.txt are getting highlighting normally, but words from synonyms_abbr.txt behave strangely as follows. For example, I have the line lead,head,senior in synonyms_abbr.txt then

  • the queries http://localhost:8983/solr/select?q=senior and http://localhost:8983/solr/select?q=head does not highlight any word,
  • the query http://localhost:8983/solr/select?q=lead highlights not only the word lead, but also head and senior.
like image 597
Mher Avatar asked Oct 20 '15 11:10

Mher


People also ask

What is HL in SOLR?

Original Highlighter. ( hl.method=original ) The Original Highlighter, sometimes called the "Standard Highlighter" or "Default Highlighter", is Lucene's original highlighter – a venerable option with a high degree of customization options.

What is schema in SOLR?

XML schema, a way to define the structure, content, and to some extent, the semantics of XML documents) (Elasticsearch index configuration is done with HTTP / JSON commands. No files required. You define types, mappings, analysis with simple commands.) Solr index configuration is done through 2 files: schema.


2 Answers

From your update2 it is clear that only the first word among lead,head,senior is actually used for synonym matching and highlighting.

If you look at Docs on SolrWiki https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters there is a mention of expand=true having a certain effect

The synonyms parameter names an external file defining the synonyms. If ignoreCase is true, matching will lowercase before checking equality. If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.

The site also presents and example

# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod

This seems to be consistent with the behaviour you are observing. This implies that you should change the Synonym filters definition in schema.xml to use expand=true OR change the way your synonyms file defines the filter to use explicit mapping.

Additionally since the Analyzers work at time of indexing, you may have to reindex documents for this to work.

like image 99
vvs Avatar answered Sep 28 '22 13:09

vvs


Some fields are not stored thus cannot be returned. Since they are indexed they are searchable. Change your schema to have stored="true" for all the fields you want to highlight.

<field name="text_syn" type="text_en_syn" indexed="true" stored="true" multiValued="true" />
<field name="text_syn_other" type="text_en_syn_other" indexed="true" stored="true" multiValued="true" />

By looking at your config I presume highlighting works on the fields bio and text?

like image 28
ilinca Avatar answered Sep 28 '22 11:09

ilinca