Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In lucene 4, IndexReader.getTermVector(docID, fieldName) returns null for every doc

Tags:

lucene

I'm using the newly released Lucene 4, and I understand that the API related to document term vectors has changed considerably. I've read through the migration doc and related various and sundry blog mailing list posts, and I believe I'm using the API correctly. However, I always get a null Terms reference back from IndexReader.getTermVector(). Here's what I'm doing:

// Indexing, given "bodyString" as a String containing document text
Document doc = new Document();
doc.add(new TextField("body", bodyString, Field.Store.YES));
MyIndexWriter.addDocument(doc);


// much later, enumerating document term vectors for "body" field for every doc
for (int i = 0; i < Reader.maxDoc(); ++i) {
  final Terms terms = Reader.getTermVector(i, "body");
  if (terms != null) {
    int numTerms = 0;
    // record term occurrences for corpus terms above threshold
    term = terms.iterator(term);
    while (term.next() != null) {
      ++numTerms;
    }
    System.out.println("Document " + i + " had " + numTerms + " terms");
  }
  else {
    System.err.println("Document " + i + " had a null terms vector for body");
  }
}

Of course, it prints out that I have null term vectors for every doc, i.e., Reader.getTermVector(i, "body") always returns null.

When I look at the index in Luke, I have documents which have stored body fields. However, when I click on "TV" button (in the Documents tab) whilst having the body field highlighted, Luke tells me "Term Vector not available." Do I need to add some other kind of option to record this information when indexing?

Any ideas? Thanks!

Jon

Update I should note that the IndexReader in question is an instance of SlowCompositeReaderWrapper, which is wrapping a DirectoryReader. I am using a SlowCompositeReaderWrapper because I want the corpus term frequencies as well, and it's not precisely clear how to iterate all docs over all IndexReader leaves (do doc IDs get reused across them?, etc.).

Is SlowCompositeReaderWrapper the culprit?

like image 297
Jon Stewart Avatar asked Jan 16 '13 16:01

Jon Stewart


2 Answers

According to the TextField API it is "A field that is indexed and tokenized, without term vectors." If you wish to store TermVectors, you should just use a Field, and set it to store TermVectors in the FieldType.

Something like:

Document doc = new Document();
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true);
Field field = new Field("body", bodyString, type);
doc.add(field);
MyIndexWriter.addDocument(doc);
like image 139
femtoRgon Avatar answered Nov 09 '22 03:11

femtoRgon


You are using TextField, a field that is indexed and tokenized, without term vectors. That's why you will get null on getTermVector(). Instead of using TextField, construct Field with your customized FieldType which setStoreTermVectors to true.

like image 35
user2121984 Avatar answered Nov 09 '22 04:11

user2121984