I am using Lucene 3.0.3. In preparation to using SpanQuery and PhraseQuery, I would like to mark paragraph boundaries in my index in a way that will discourage these queries from matching across paragraph boundaries. I understand that I need to increment position by some suitably large value in the PositionIncrementAttribute
when processing text to mark paragraph boundaries. Let's assume that in the source document, my paragraph boundaries are marked by <p>...</p>
pairs.
How do I set up my token stream to detect the tags? Also, I don't actually want to index the tags themselves. For the purposes of indexing, I would rather increment the position of the next legitimate token, rather than emitting a token that corresponds to the tag, since I don't want it to affect search.
The easiest way to add gaps (= PositionIncrement > 1) is to provide a custom TokenStream. You do not need to change your Analyzer for that. However, HTML parsing should be done upstream (i.e., you should segment and clean your input text accordingly before feeding it to Lucene).
Here is a full, working example (imports omitted):
public class GapTest {
public static void main(String[] args) throws Exception {
final Directory dir = new RAMDirectory();
final IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_4_10_1, new SimpleAnalyzer());
final IndexWriter iw = new IndexWriter(dir, iwConfig);
Document doc = new Document();
doc.add(new TextField("body", "A B C", Store.YES));
doc.add(new TextField("body", new PositionIncrementTokenStream(10)));
doc.add(new TextField("body", "D E F", Store.YES));
System.out.println(doc);
iw.addDocument(doc);
iw.close();
final IndexReader ir = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(ir);
QueryParser qp = new QueryParser("body", new SimpleAnalyzer());
for (String q : new String[] { "\"A B C\"", "\"A B C D\"",
"\"A B C D\"", "\"A B C D\"~10", "\"A B C D E F\"~10",
"\"A B C D F E\"~10", "\"A B C D F E\"~11" }) {
Query query = qp.parse(q);
TopDocs docs = is.search(query, 10);
System.out.println(docs.totalHits + "\t" + q);
}
ir.close();
}
/**
* A gaps-only TokenStream (uses {@link PositionIncrementAttribute}
*
* @author Christian Kohlschuetter
*/
private static final class PositionIncrementTokenStream extends TokenStream {
private boolean first = true;
private PositionIncrementAttribute attribute;
private final int positionIncrement;
public PositionIncrementTokenStream(final int positionIncrement) {
super();
this.positionIncrement = positionIncrement;
attribute = addAttribute(PositionIncrementAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if (first) {
first = false;
attribute.setPositionIncrement(positionIncrement);
return true;
} else {
return false;
}
}
@Override
public void reset() throws IOException {
super.reset();
first = true;
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With