In lucene 4, IndexReader.getTermVector(docID, fieldName) returns null for every doc

Question

I'm using the newly released Lucene 4, and I understand that the API related to document term vectors has changed considerably. I've read through the migration doc and related various and sundry blog mailing list posts, and I believe I'm using the API correctly. However, I always get a null Terms reference back from IndexReader.getTermVector(). Here's what I'm doing:

// Indexing, given "bodyString" as a String containing document text
Document doc = new Document();
doc.add(new TextField("body", bodyString, Field.Store.YES));
MyIndexWriter.addDocument(doc);


// much later, enumerating document term vectors for "body" field for every doc
for (int i = 0; i < Reader.maxDoc(); ++i) {
  final Terms terms = Reader.getTermVector(i, "body");
  if (terms != null) {
    int numTerms = 0;
    // record term occurrences for corpus terms above threshold
    term = terms.iterator(term);
    while (term.next() != null) {
      ++numTerms;
    }
    System.out.println("Document " + i + " had " + numTerms + " terms");
  }
  else {
    System.err.println("Document " + i + " had a null terms vector for body");
  }
}

Of course, it prints out that I have null term vectors for every doc, i.e., Reader.getTermVector(i, "body") always returns null.

When I look at the index in Luke, I have documents which have stored body fields. However, when I click on "TV" button (in the Documents tab) whilst having the body field highlighted, Luke tells me "Term Vector not available." Do I need to add some other kind of option to record this information when indexing?

Any ideas? Thanks!

Jon

Update I should note that the IndexReader in question is an instance of SlowCompositeReaderWrapper, which is wrapping a DirectoryReader. I am using a SlowCompositeReaderWrapper because I want the corpus term frequencies as well, and it's not precisely clear how to iterate all docs over all IndexReader leaves (do doc IDs get reused across them?, etc.).

Is SlowCompositeReaderWrapper the culprit?

femtoRgon · Accepted Answer

According to the TextField API it is "A field that is indexed and tokenized, without term vectors." If you wish to store TermVectors, you should just use a Field, and set it to store TermVectors in the FieldType.

Something like:

Document doc = new Document();
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true);
Field field = new Field("body", bodyString, type);
doc.add(field);
MyIndexWriter.addDocument(doc);

user2121984 · Answer

You are using TextField, a field that is indexed and tokenized, without term vectors. That's why you will get null on getTermVector(). Instead of using TextField, construct Field with your customized FieldType which setStoreTermVectors to true.

In lucene 4, IndexReader.getTermVector(docID, fieldName) returns null for every doc

Tags:

lucene

Jon Stewart

2 Answers

femtoRgon

user2121984

Recent Activity

Donate For Us

In lucene 4, IndexReader.getTermVector(docID, fieldName) returns null for every doc

Tags:

lucene

Jon Stewart

2 Answers

femtoRgon

user2121984

Related questions

Recent Activity

Donate For Us