I configured solr 4.10 (also 5.3) with highlighting functionality. It works fine with most of the words, however I found some words which "does not" allow highlightings, that is, solr returns the required docs, but does not highlights some of them.
What can cause such effect?
solrconfig.xml
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="wt">json</str>
<str name="indent">true</str>
<str name="defType">edismax</str>
<str name="bf">product(concount)</str>
<str name="df">text bio text_syn text_syn_other</str>
<str name="qf">
text^25 bio^16 text_syn^8 text_syn_other^3
</str>
<str name="hl">on</str>
<str name="hl.fl">text bio text_syn text_syn_other</str>
<str name="hl.preserveMulti">true</str>
<str name="hl.encoder">html</str>
<str name="f.text.hl.fragsize">100</str>
<str name="hl.snippets">20</str>
<arr name="components">
<str>highlight</str>
</arr>
</lst>
schema.xml
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_abbr.txt" ignoreCase="true" expand="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en_syn" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en_syn_other" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_other.txt" ignoreCase="true" expand="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s\n,/\\]" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="text" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="text_syn" type="text_en_syn" indexed="true" stored="false" multiValued="true" />
<field name="text_syn_other" type="text_en_syn_other" indexed="true" stored="false" multiValued="true" />
<field name="text_exact" type="string" indexed="true" stored="false" multiValued="false" />
<field name="bio" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="bio_exact" type="string" indexed="true" stored="false" multiValued="false" />
<field name="concount" type="long" indexed="true" stored="true" multiValued="false" />
<field name="concount_exact" type="long" indexed="true" stored="false" multiValued="false" />
<copyField source="text" dest="text_syn"/>
<copyField source="bio" dest="text_syn"/>
<copyField source="text" dest="text_syn_other"/>
<copyField source="bio" dest="text_syn_other"/>
For the query http://localhost:8983/solr/select?q=senior
I got docs containing the word senior
, but in highlighting section of solr response that word is not highlighted.
UPDATE 1:
I find out that I have the word senior
in my synonyms_abbr.txt
file, the line senior,lead
. When I commented that line or replaced the places of words, lead,senior
, surprisingly the word senior
started geting highlighting. Any ideas ?
UPDATE 2:
Words from synonyms.txt
and synonyms_other.txt
are getting highlighting normally, but words from synonyms_abbr.txt
behave strangely as follows. For example, I have the line lead,head,senior
in synonyms_abbr.txt
then
http://localhost:8983/solr/select?q=senior
and http://localhost:8983/solr/select?q=head
does not highlight any word,http://localhost:8983/solr/select?q=lead
highlights not only the word lead
, but also head
and senior
.Original Highlighter. ( hl.method=original ) The Original Highlighter, sometimes called the "Standard Highlighter" or "Default Highlighter", is Lucene's original highlighter – a venerable option with a high degree of customization options.
XML schema, a way to define the structure, content, and to some extent, the semantics of XML documents) (Elasticsearch index configuration is done with HTTP / JSON commands. No files required. You define types, mappings, analysis with simple commands.) Solr index configuration is done through 2 files: schema.
From your update2 it is clear that only the first word among lead,head,senior
is actually used for synonym matching and highlighting.
If you look at Docs on SolrWiki https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters there is a mention of expand=true
having a certain effect
The synonyms parameter names an external file defining the synonyms. If ignoreCase is true, matching will lowercase before checking equality. If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.
The site also presents and example
# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod
This seems to be consistent with the behaviour you are observing. This implies that you should change the Synonym filters definition in schema.xml to use expand=true OR change the way your synonyms file defines the filter to use explicit mapping.
Additionally since the Analyzers work at time of indexing, you may have to reindex documents for this to work.
Some fields are not stored thus cannot be returned. Since they are indexed they are searchable. Change your schema to have stored="true" for all the fields you want to highlight.
<field name="text_syn" type="text_en_syn" indexed="true" stored="true" multiValued="true" />
<field name="text_syn_other" type="text_en_syn_other" indexed="true" stored="true" multiValued="true" />
By looking at your config I presume highlighting works on the fields bio and text?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With