Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to increase position offsets in a lucene index to correspond to <p> tags?

I am using Lucene 3.0.3. In preparation to using SpanQuery and PhraseQuery, I would like to mark paragraph boundaries in my index in a way that will discourage these queries from matching across paragraph boundaries. I understand that I need to increment position by some suitably large value in the PositionIncrementAttribute when processing text to mark paragraph boundaries. Let's assume that in the source document, my paragraph boundaries are marked by <p>...</p> pairs.

How do I set up my token stream to detect the tags? Also, I don't actually want to index the tags themselves. For the purposes of indexing, I would rather increment the position of the next legitimate token, rather than emitting a token that corresponds to the tag, since I don't want it to affect search.

like image 512
Gene Golovchinsky Avatar asked Apr 21 '11 20:04

Gene Golovchinsky


1 Answers

The easiest way to add gaps (= PositionIncrement > 1) is to provide a custom TokenStream. You do not need to change your Analyzer for that. However, HTML parsing should be done upstream (i.e., you should segment and clean your input text accordingly before feeding it to Lucene).

Here is a full, working example (imports omitted):

public class GapTest {

    public static void main(String[] args) throws Exception {
        final Directory dir = new RAMDirectory();
        final IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_4_10_1, new SimpleAnalyzer());
        final IndexWriter iw = new IndexWriter(dir, iwConfig);

        Document doc = new Document();
        doc.add(new TextField("body", "A B C", Store.YES));
        doc.add(new TextField("body", new PositionIncrementTokenStream(10)));
        doc.add(new TextField("body", "D E F", Store.YES));

        System.out.println(doc);
        iw.addDocument(doc);
        iw.close();

        final IndexReader ir = DirectoryReader.open(dir);
        IndexSearcher is = new IndexSearcher(ir);

        QueryParser qp = new QueryParser("body", new SimpleAnalyzer());

        for (String q : new String[] { "\"A B C\"", "\"A B C D\"",
                "\"A B C D\"", "\"A B C D\"~10", "\"A B C D E F\"~10",
                "\"A B C D F E\"~10", "\"A B C D F E\"~11" }) {
            Query query = qp.parse(q);
            TopDocs docs = is.search(query, 10);
            System.out.println(docs.totalHits + "\t" + q);
        }
        ir.close();
    }

    /**
     * A gaps-only TokenStream (uses {@link PositionIncrementAttribute}
     * 
     * @author Christian Kohlschuetter
     */
    private static final class PositionIncrementTokenStream extends TokenStream {
    private boolean first = true;
    private PositionIncrementAttribute attribute;
    private final int positionIncrement;

    public PositionIncrementTokenStream(final int positionIncrement) {
        super();
        this.positionIncrement = positionIncrement;
        attribute = addAttribute(PositionIncrementAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (first) {
            first = false;
            attribute.setPositionIncrement(positionIncrement);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        first = true;
    }
}

}

like image 65
Christian Kohlschütter Avatar answered Oct 19 '22 23:10

Christian Kohlschütter