Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Solr highlighting also indicate the position or offset of the returned fragments within the original field?

Background

Using Solr 4.0.0. I've indexed the text of a set of sample documents and enabled Term Vectors so I can use Fast Vector Highlighting

<field name="raw_text" type="text_en" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />

For highlighting I'm using the Break Iterator Boundary Scanner with SENTENCE boundaries.

<boundaryScanner name="breakIterator" class="solr.highlight.BreakIteratorBoundaryScanner">
    <lst name="defaults">
      <!-- type should be one of CHARACTER, WORD(default), LINE and SENTENCE -->
      <str name="hl.bs.type">SENTENCE</str>
    </lst>
  </boundaryScanner>

I do a simple query

http://localhost:8983/solr/documents/select?q=raw_text%3AArtibonite&wt=xml&hl=true&hl.fl=raw_text&hl.useFastVectorHighlighter=true&hl.snippets=100&hl.boundaryScanner=breakIterator

Highlighting is working fairly well

<response>
...
<result name="response" numFound="5" start="0">
<doc>
  <str name="id">-1071691270</str>
  <str name="raw_text">
     Final Report of the Independent Panel of Experts on the Cholera
     Outbreak in Haiti Dr. Alejando Cravioto (Chair) International
     Center for Diarrhoeal Disease Research, Dhaka, Bangladesh Dr.
     Claudio F. Lanata Instituto de Investigación Nutricional, and
     The US Navy Medical Research Unit 6, Lima, Peru Engr. Daniele
     S. Lantagne Harvard University... ~SNIP~
  </str>
<doc>
<lst name="highlighting">
  <lst name="-1071691270">
    <arr name="raw_text">
      ...
      <str>
        The timeline suggests that the outbreak spread along
        the <em>Artibonite</em> River. After establishing that
        the cases began in the upper reaches of the Artibonite
        River, potential sources of contamination that could have
        initiated the outbreak were investigated.
      </str>
      ...
    </arr>
  </lst>
</lst>

Problem

I want to be able to send the resulting sentences for further processing (entity-extraction, etc.) but I would like to track the start/end offsets of the highlighted sentence within the original (long) text field. Is there straightforward way to do this?

Would it be better to set hl.fragsize to return the entire field and then process/extract the sentences of interest this way?

like image 497
Mike Willekes Avatar asked Dec 13 '12 15:12

Mike Willekes


People also ask

What is highlighting in Solr?

Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response.

What is Solr and how it works?

Solr is a search server built on top of Apache Lucene, an open source, Java-based, information retrieval library. It is designed to drive powerful document retrieval applications - wherever you need to serve data to users based on their queries, Solr can work for you.

How indexing happens in Solr?

By adding content to an index, we make it searchable by Solr. A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.

How does Solr store data?

Apache Solr stores the data it indexes in the local filesystem by default. HDFS (Hadoop Distributed File System) provides several benefits, such as a large scale and distributed storage with redundancy and failover capabilities. Apache Solr supports storing data in HDFS.


1 Answers

There is no way to return offset information of the fragments with the highlighting results aside from doing some sort of customization.

You have a few options it seems:

1) You can extend the Solr Highlighter by creating a custom Formatter that encodes the offset information into the string. The TokenGroup that is passed in to the Formatter for each term will have offset and position information stored in it. If your formatter returned a <span data-offset=X>text</span> or something similar, then that would be one way. This doesn't seem to be the most straightforward.

2) As you said, return the entire field using hl.fragsize=0.

3) Use the TermVectorsComponent in an additional request and map the offset/position information returned from it with the highlighted fragments.

If you are doing your own fragmenting anyway, the best solution for you would probably be to either do 0 fragmenting in Solr and handle it all yourself. Alternatively, you could implement your own BoundaryScanner implementation in Java to use your own special knowledge of entity extraction in the breaking up of the fragments.

like image 64
smerchek Avatar answered Sep 30 '22 03:09

smerchek