I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?
As of Lucene 7+ the above and some related links are obsolete.
Here's what's current:
// IndexReader has leaves, you'll iterate through those
int leavesCount = reader.leaves().size();
final String fieldName = "content";
for(int l = 0; l < leavesCount; l++) {
System.out.println("l: " + l);
// specify the field here ----------------------------->
TermsEnum terms = reader.leaves().get(l).reader().terms(fieldName).iterator();
// this stops at 20 just to sample the head
for(int i = 0; i < 20; i++) {
// and to get it out, here -->
final Term content = new Term(fieldName, BytesRef.deepCopyOf(terms.next()));
System.out.println("i: " + i + ", term: " + content);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With