Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this simple Solr highlighting attempt failing?

Tags:

solr

I've read the Solr highlighting wiki document several times, searched everywhere, but cannot get even basic highlighting to work with my Solr installation. I am running Solr 3.5 on the demo Jetty 6.1 server.

I have indexed 250K documents, and am able to search them just fine. Other than configuring my document field definitions, most of the Solr configuration is "stock," although I have temporarily commented out the solrconfig.xml's "Highlighting defaults" to make sure they aren't causing this problem:

  <!-- Highlighting defaults
   <str name="hl">on</str>
   <str name="hl.fl">title snippet</str>
   <str name="f.name.hl.fragsize">0</str>
   <str name="f.name.hl.alternateField">name</str> -->

My URL querystring is very simple. I've tried many variations, but here is my latest with it returning the most basic query:

hl=on&hl.fl=title&indent=on&version=2.2&q=toyota&fq=&start=0&rows=1&fl=*%2Cscore

Here is the resulting XML:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">32</int>
  <lst name="params">
    <str name="explainOther"/>
    <str name="indent">on</str>
    <str name="hl.fl">title</str>
    <str name="wt"/>
    <str name="hl">true</str>
    <str name="version">2.2</str>
    <str name="rows">1</str>
    <str name="fl">*,score</str>
    <str name="start">0</str>
    <str name="q">toyota</str>
    <str name="qt"/>
    <str name="fq"/>
  </lst>
</lst>
<result name="response" numFound="9549" start="0" maxScore="0.9960097">
  <doc>
    <float name="score">0.9960097</float>
    <str name="id">2-33-200</str>
    <str name="title">1992 Toyota Camry 2.2L CV Boots</str>
  </doc>
</result>
<lst name="highlighting">
  <lst name="2-33-200"/>
</lst>
</response>

How can I debug this issue further? Thanks!

Edit Here is the <highlighting> section from solrconfig.xml. As I stated, it is stock. That could be the issue, but I'm new to Solr and not familiar with the highlighting ins and outs yet (obviously).

    <highlighting>
  <!-- Configure the standard fragmenter -->
  <!-- This could most likely be commented out in the "default" case -->
  <fragmenter name="gap" 
              default="true"
              class="solr.highlight.GapFragmenter">
    <lst name="defaults">
      <int name="hl.fragsize">100</int>
    </lst>
  </fragmenter>

  <!-- A regular-expression-based fragmenter 
       (for sentence extraction) 
    -->
  <fragmenter name="regex" 
              class="solr.highlight.RegexFragmenter">
    <lst name="defaults">
      <!-- slightly smaller fragsizes work better because of slop -->
      <int name="hl.fragsize">70</int>
      <!-- allow 50% slop on fragment sizes -->
      <float name="hl.regex.slop">0.5</float>
      <!-- a basic sentence pattern -->
      <str name="hl.regex.pattern">[-\w ,/\n\&quot;&apos;]{20,200}</str>
    </lst>
  </fragmenter>

  <!-- Configure the standard formatter -->
  <formatter name="html" 
             default="true"
             class="solr.highlight.HtmlFormatter">
    <lst name="defaults">
      <str name="hl.simple.pre"><![CDATA[<em>]]></str>
      <str name="hl.simple.post"><![CDATA[</em>]]></str>
    </lst>
  </formatter>

  <!-- Configure the standard encoder -->
  <encoder name="html" 
           class="solr.highlight.HtmlEncoder" />

  <!-- Configure the standard fragListBuilder -->
  <fragListBuilder name="simple" 
                   default="true"
                   class="solr.highlight.SimpleFragListBuilder"/>

  <!-- Configure the single fragListBuilder -->
  <fragListBuilder name="single" 
                   class="solr.highlight.SingleFragListBuilder"/>

  <!-- default tag FragmentsBuilder -->
  <fragmentsBuilder name="default" 
                    default="true"
                    class="solr.highlight.ScoreOrderFragmentsBuilder">
    <!-- 
    <lst name="defaults">
      <str name="hl.multiValuedSeparatorChar">/</str>
    </lst>
    -->
  </fragmentsBuilder>

  <!-- multi-colored tag FragmentsBuilder -->
  <fragmentsBuilder name="colored" 
                    class="solr.highlight.ScoreOrderFragmentsBuilder">
    <lst name="defaults">
      <str name="hl.tag.pre"><![CDATA[
           <b style="background:yellow">,<b style="background:lawgreen">,
           <b style="background:aquamarine">,<b style="background:magenta">,
           <b style="background:palegreen">,<b style="background:coral">,
           <b style="background:wheat">,<b style="background:khaki">,
           <b style="background:lime">,<b style="background:deepskyblue">]]></str>
      <str name="hl.tag.post"><![CDATA[</b>]]></str>
    </lst>
  </fragmentsBuilder>

  <boundaryScanner name="default" 
                   default="true"
                   class="solr.highlight.SimpleBoundaryScanner">
    <lst name="defaults">
      <str name="hl.bs.maxScan">10</str>
      <str name="hl.bs.chars">.,!? &#9;&#10;&#13;</str>
    </lst>
  </boundaryScanner>

  <boundaryScanner name="breakIterator" 
                   class="solr.highlight.BreakIteratorBoundaryScanner">
    <lst name="defaults">
      <!-- type should be one of CHARACTER, WORD(default), LINE and SENTENCE -->
      <str name="hl.bs.type">WORD</str>
      <!-- language and country are used when constructing Locale object.  -->
      <!-- And the Locale object will be used when getting instance of BreakIterator -->
      <str name="hl.bs.language">en</str>
      <str name="hl.bs.country">US</str>
    </lst>
  </boundaryScanner>
</highlighting>

Edit Although initially my "title" field was set to indexed="false" I have since tested setting it to true (no change / no highlighting still), and also termVectors="true" termPositions="true" termOffsets="true"... still no effect. (I tried these based on reading this post to SO.)

And here is my "title" field definition as of now:

<field name="title" type="string" indexed="true" stored="true" required="true" termVectors="true" termPositions="true" termOffsets="true" />

Initially I started with:

<field name="title" type="string" indexed="false" stored="true" required="true" />

Edit I've now also tried this definition:

<field name="title" type="text_general" indexed="true" stored="true" required="true" termVectors="true" termPositions="true" termOffsets="true" />

and no change in highlighting, still not working. My text_general definition is the default one that comes with Solr's demo:

 <!-- A general text field that has reasonable, generic
        cross-language defaults: it tokenizes with StandardTokenizer,
 removes stop words from case-insensitive "stopwords.txt"
 (empty by default), and down cases.  At query time only, it
 also applies synonyms. -->
 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
      <!-- in this example, we will only use synonyms at query time
      <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
      -->
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
 </fieldType>

Edit I've now also tried re-indexing title with the text_en_splitting fieldtype, which uses WhitespaceTokenizerFactory instead of StandardTokenizerFactory, and still no highlighting. For what it's worth, I am using the standard query parser, which according to debugQuery=on is the LuceneQParser.

FINALLY! Thanks to @javanna for the help. I've done a lot of experimenting, and the two key takeaways are:

  1. You must use a tokenizing field type. The string fieldtype won't work. It doesn't seem necessary to have indexed=true or termVectors=true, but the field type must be tokenized.
  2. You must be careful to refer to your fields with the proper case. In addition to screwing up the tokenizing, I had also changed the case on my fields during development, and forgot to change the case on the hl.fl (highlighted field) definition -- preventing highlighting from working.
  3. Make sure you re-index between each configuration change. To be safe, I was deleting all documents from the index, and rebuilding it from scratch, but that may not be necessary.

My definition now appears as:

<field name="Title" type="text_general" indexed="false" stored="true" required="true" />

And my solrconfig.xml has this set:

<str name="hl">on</str>
<str name="hl.fl">Title</str>
like image 730
Mason G. Zhwiti Avatar asked Mar 23 '12 16:03

Mason G. Zhwiti


1 Answers

The way you're making highlighting seems good, but your solrconfig.xml looks a bit messy. Unfortunately the example you took uses basically all the available options, and I guess you don't need them. Unless you need something different from the default, I'd start commenting out all your highlighting configuration, as well as your default parameters. Then I'd play around with the url parameters you need, just a couple to start: hl=on and hl.fl=title. Once you've found the right parameters you can configure them as default.

That said, given your title fieldType I suspect it isn't tokenized, unless you changed the default string type definition. In that case your query wouldn't match the title field, that's why you don't get highlighting on it. Are you maybe using edismax (or dismax)? If yes, what is your qf parameter? Is it possible that the toyota term is on another field that matches your query? If you're using edismax you can try searching for q=title:toyota ans see if you get results.

You can also check where is your match enabling debugQuery=on and checking the debug output.

UPDATE
I saw you changed the title fieldType to text_general, but this doesn't change anything because that type isn't tokenized on whitespaces. You haven't told yet what query parser you're using, anyway if I'm right you should use WhitespaceTokenizerFactory instead of the StandardTokenizerFactory:

<tokenizer class="solr.WhitespaceTokenizerFactory"/> 

After that, remember to reindex all your data, otherwise you won't see any change. Basically, if you index something like toyota whatever without tokenizing on whitespaces, you won't get any result searching for toyota, and you won't even have toyota highlighted on that field because it doesn't match. My assumption is that you're using dismax or edismax query parser and searching on more than one field, and some of them but not title match your search, that's why you'd get results but not highlighting on title, the only field you selected for highlighting. Can you post the results you get searching for toyota? Is the toyota term on some other fields than title?

like image 156
javanna Avatar answered Oct 13 '22 23:10

javanna