Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene/Solr: Store offset information for certain keywords

Tags:

solr

lucene

We are using Solr to store documents with keywords; each keyword is associated with a span within the document.

The keywords were produced by some fancy analytics and/or manual work prior to loading them into Solr. A keyword can be repeated multiple times in a document. On the other hand, different instances of the same string in a single document can be connected with different keywords.

For example, this document

Bill studied The Bill of Rights last summer.

could be accompanied by the following keywords (with offsets in parentheses):

William Brown (0:4)
legal term (13:31)  
summer 2011 (32:43)

(Obviously in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer will refer to different years in different documents. We do have all this information for all the documents.)

I know the document can have a field, say KEYWORD, which will store William Brown. Then when I search for William Brown I will get the above document. That part is easy.

But I have no idea how to store the info that William Brown corresponds to the text span 0:4 so I can highlight the first Bill, but not the second.

I thought I could use TermVectors, but I am not sure if/how I can store custom offsets. I would think this is a fairly common scenario ...

EDIT: edited to make clear that Bill can refer to different people/things in different documents.

EDIT2: edited to make clear that a document can contain homonyms (identical strings with different meanings).

like image 240
Jirka Avatar asked Dec 06 '15 21:12

Jirka


2 Answers

Two Q Monte

Solution Pros:

  • Annotations logically stored with source docs
  • No knowledge of highlighter implementation or custom Java highlighter development required
  • Since all customization happens outside of Solr, this solution should be forward-compatible to future Solr versions.

Solution Cons:

  • Requires two queries to be run
  • Requires code in your search client to merge results from one query into the other.

With Solr 4.8+ you can nest child documents (annotations) underneath each primary document (text)...

curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
  {
    "id": "123",
    "text" : "Bill studied The Bill of Rights last summer.",
    "content_type": "source",
    "_childDocuments_": [
      {
        "id": "123-1",
        "content_type": "source_annotation",
        "annotation": "William Brown",
        "start_offset": 0,
        "end_offset": 4
      },
      {
        "id": "123-2",
        "content_type": "source_annotation",
        "annotation": "legal term",
        "start_offset": 13,
        "end_offset": 31
      },
      {
        "id": "123-3",
        "content_type": "source_annotation",
        "annotation": "summer 2011",
        "start_offset": 32,
        "end_offset": 43
      }
    ]
  }
]

... using block join to query the annotations.

1) Annotation Query: http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"

"response":{"numFound":1,"start":0,
    "docs":[
      {
            "id": "123-1",
            "content_type": "source_annotation",
            "annotation": "William Brown",
            "start_offset": 0,
            "end_offset": 4
      }
    ]
  }

Store these results in your code so that you can fold in the annotation offsets after the next query returns.

2) Source Query + Highlighting: http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123

(id:123 discovered in Annotation Query gets ORed into second query)

"response":{"numFound":1,"start":0,
    "docs":[
      {
            "id": "123",
            "content_type": "source",
            "text": "Bill studied The Bill of Rights last summer."
      }
    ],
    "highlighting":{}
  }

Note: In this example there is no highlighting information returned because the search terms didn't match any content_type:source documents. However we have the explicit annotations and offsets from the first query!

Your client code then needs to take the content_type:source_annotation results from the first query and manually insert highlighting markers into the content_type:source results from the second query.


More block join info on Yonik's blog here.

like image 90
Peter Dixon-Moses Avatar answered Nov 03 '22 16:11

Peter Dixon-Moses


By default Solr stores the start/end position of each token once is tokenized, for instance using the StandardTokenizer. This info is encoded on the underline index. The use case that you described here sounds a lot like the SynonymFilterFactory.

When you define a synonym using the SynonymFilterFactory stating for instance that: foo => baz foo is equivalent to bar, the bar term is added to the token stream generated when the text is tokenized, and it will have the same offset information than the original token. So for instance if your text is: "foo is awesome", the term foo will have the following offset information (start=0,end=3) a new token bar(start=0,end=3) will be added to your index (assuming that you're using the SynonymFilterFactory at index time):

   text:   foo    is    awesome
   start:  0      4     7
   end:    3      6     13

Once the SynonymFilterFactory is applied:

           bar
   text:   foo    is    awesome
   start:  0      4     7
   end:    3      6     13

So if you fire a query using foo, the document will match, but if you use bar as your query the document will also match since a bar token is added by the SynonymFilterFactory

In your particular case, you're trying to accomplish multi-term synonyms, which is kind of a difficult problem, you may need something more than the default synonym filter of Solr. Check this post from the guys at OpenSourceConnections and this other post from Lucidworks (the company behind Solr/Lucene). This two posts should provide additional information and the caveats of each approach.

Do you need to fetch the stored offsets for some later processing?

like image 43
Jorge Luis Avatar answered Nov 03 '22 16:11

Jorge Luis