Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing and Searching Over Word Level Annotation Layers in Lucene

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like:


Word  POS  Chunk       NER
====  ===  =====  ========
The    DT     NP    Person     
man    NN     NP    Person
went  VBD     VP         -
to     TO     PP         - 
the    DT     NP  Location
store  NN     NP  Location

I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:

Query: Word=Washington,NER=Person

I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like:

Query: "NER=Person Word=arrived Word=at NER=Location"

What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

Payloads

One suggestion was to try to use Lucene payloads. But, I thought payloads could only be used to adjust the rankings of documents, and that they aren't used to select what documents are returned.

The latter is important since, for some use-cases, the number of documents that contain a pattern is really what I want.

Also, only the payloads on terms that match the query are examined. This means that payloads could only even help with the rankings of the first example query, Word=Washington,NER=Person, whereby we just want to make sure the term Washingonton is tagged as a Person. However, for the second example query, "NER=Person Word=arrived Word=at NER=Location", I need to check the tags on unspecified, and thus non-matching, terms.

like image 601
dmcer Avatar asked May 21 '10 14:05

dmcer


2 Answers

Perhaps one way to achieve what you're asking is to index each class of annotation at the same position (i.e., Word, POS, Chunk, NER) and prefix each of the annotations with a unique string. Don't bother with prefixes for words. You will need a custom Analyzer to preserve the prefixes, but then you should be able to use the syntax you want for queries.

To be specific, what I am proposing is that you index the following tokens at the specified positions:

Position Word   POS      Chunk     NER
======== ====   ===      =====     ========
1        The    POS=DT   CHUNK=NP  NER=Person     
2        man    POS=NN   CHUNK=NP  NER=Person
3        went   POS=VBD  CHUNK=VP       -
4        to     POS=TO   CHUNK=PP       - 
5        the    POS=DT   CHUNK=NP  NER=Location
6        store  POS=NN   CHUNK=NP  NER=Location

To get the semantics, use SpanQuery or SpanTermQuery to preserve token sequence.

I haven't tried this but indexing the different classes of terms at the same position should allow position-sensitive queries to do the right thing to evaluate expressions such as

NER=Person arrived at NER=Location

Note the difference from your example: I deleted the Word= prefix to treat that as the default. Also, your choice of prefix syntax (e.g., "class=") may constrain the contents of the document you are indexing. Make sure that the documents either don't contain the phrases, or that you escape them in some way in pre-processing. This is, of course, related to the analyzer you'll need to use.

Update: I used this technique for indexing sentence and paragraph boundaries in text (using break=sen and break=para tokens) so that I could decide where to break phrase query matches. Seems to work just fine.

like image 135
Gene Golovchinsky Avatar answered Nov 15 '22 16:11

Gene Golovchinsky


What you are looking for are payloads. Lucid Imagination has a detailed blog entry on the subject. Payloads allow you to store a byte array of metadata about individual terms. Once you have indexed your data with the payloads including, you can create a new similarity mechanism that takes your payloads into account when scoring.

like image 25
Eric Hauser Avatar answered Nov 15 '22 17:11

Eric Hauser