Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get the list of unique terms from a specific field in Lucene?

Tags:

java

lucene

I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?

like image 292
Hossein Avatar asked Jan 18 '12 12:01

Hossein


1 Answers

As of Lucene 7+ the above and some related links are obsolete.

Here's what's current:

// IndexReader has leaves, you'll iterate through those
int leavesCount = reader.leaves().size();
final String fieldName = "content";

for(int l = 0; l < leavesCount; l++) {
  System.out.println("l: " + l);
  // specify the field here ----------------------------->
  TermsEnum terms = reader.leaves().get(l).reader().terms(fieldName).iterator();
  // this stops at 20 just to sample the head
  for(int i = 0; i < 20; i++) {
    // and to get it out, here -->
    final Term content = new Term(fieldName, BytesRef.deepCopyOf(terms.next()));
    System.out.println("i: " + i + ", term: " + content);
  }
}
like image 108
Alex Moore-Niemi Avatar answered Oct 31 '22 15:10

Alex Moore-Niemi