How to increase position offsets in a lucene index to correspond to
tags?

Question

I am using Lucene 3.0.3. In preparation to using SpanQuery and PhraseQuery, I would like to mark paragraph boundaries in my index in a way that will discourage these queries from matching across paragraph boundaries. I understand that I need to increment position by some suitably large value in the PositionIncrementAttribute when processing text to mark paragraph boundaries. Let's assume that in the source document, my paragraph boundaries are marked by <p>...</p> pairs.

How do I set up my token stream to detect the tags? Also, I don't actually want to index the tags themselves. For the purposes of indexing, I would rather increment the position of the next legitimate token, rather than emitting a token that corresponds to the tag, since I don't want it to affect search.

Christian Kohlschütter · Accepted Answer

The easiest way to add gaps (= PositionIncrement > 1) is to provide a custom TokenStream. You do not need to change your Analyzer for that. However, HTML parsing should be done upstream (i.e., you should segment and clean your input text accordingly before feeding it to Lucene).

Here is a full, working example (imports omitted):

public class GapTest {

    public static void main(String[] args) throws Exception {
        final Directory dir = new RAMDirectory();
        final IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_4_10_1, new SimpleAnalyzer());
        final IndexWriter iw = new IndexWriter(dir, iwConfig);

        Document doc = new Document();
        doc.add(new TextField("body", "A B C", Store.YES));
        doc.add(new TextField("body", new PositionIncrementTokenStream(10)));
        doc.add(new TextField("body", "D E F", Store.YES));

        System.out.println(doc);
        iw.addDocument(doc);
        iw.close();

        final IndexReader ir = DirectoryReader.open(dir);
        IndexSearcher is = new IndexSearcher(ir);

        QueryParser qp = new QueryParser("body", new SimpleAnalyzer());

        for (String q : new String[] { "\"A B C\"", "\"A B C D\"",
                "\"A B C D\"", "\"A B C D\"~10", "\"A B C D E F\"~10",
                "\"A B C D F E\"~10", "\"A B C D F E\"~11" }) {
            Query query = qp.parse(q);
            TopDocs docs = is.search(query, 10);
            System.out.println(docs.totalHits + "	" + q);
        }
        ir.close();
    }

    /**
     * A gaps-only TokenStream (uses {@link PositionIncrementAttribute}
     * 
     * @author Christian Kohlschuetter
     */
    private static final class PositionIncrementTokenStream extends TokenStream {
    private boolean first = true;
    private PositionIncrementAttribute attribute;
    private final int positionIncrement;

    public PositionIncrementTokenStream(final int positionIncrement) {
        super();
        this.positionIncrement = positionIncrement;
        attribute = addAttribute(PositionIncrementAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (first) {
            first = false;
            attribute.setPositionIncrement(positionIncrement);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        first = true;
    }
}

}

How to increase position offsets in a lucene index to correspond to <p> tags?

Tags:

html

indexing

position

lucene

tags

Gene Golovchinsky

1 Answers

Christian Kohlschütter

Recent Activity

Donate For Us