I'm using the newly released Lucene 4, and I understand that the API related to document term vectors has changed considerably. I've read through the migration doc and related various and sundry blog mailing list posts, and I believe I'm using the API correctly. However, I always get a null Terms reference back from IndexReader.getTermVector(). Here's what I'm doing:
// Indexing, given "bodyString" as a String containing document text
Document doc = new Document();
doc.add(new TextField("body", bodyString, Field.Store.YES));
MyIndexWriter.addDocument(doc);
// much later, enumerating document term vectors for "body" field for every doc
for (int i = 0; i < Reader.maxDoc(); ++i) {
final Terms terms = Reader.getTermVector(i, "body");
if (terms != null) {
int numTerms = 0;
// record term occurrences for corpus terms above threshold
term = terms.iterator(term);
while (term.next() != null) {
++numTerms;
}
System.out.println("Document " + i + " had " + numTerms + " terms");
}
else {
System.err.println("Document " + i + " had a null terms vector for body");
}
}
Of course, it prints out that I have null term vectors for every doc, i.e., Reader.getTermVector(i, "body") always returns null.
When I look at the index in Luke, I have documents which have stored body fields. However, when I click on "TV" button (in the Documents tab) whilst having the body field highlighted, Luke tells me "Term Vector not available." Do I need to add some other kind of option to record this information when indexing?
Any ideas? Thanks!
Jon
Update
I should note that the IndexReader
in question is an instance of SlowCompositeReaderWrapper
, which is wrapping a DirectoryReader
. I am using a SlowCompositeReaderWrapper
because I want the corpus term frequencies as well, and it's not precisely clear how to iterate all docs over all IndexReader
leaves (do doc IDs get reused across them?, etc.).
Is SlowCompositeReaderWrapper the culprit?
According to the TextField API it is "A field that is indexed and tokenized, without term vectors." If you wish to store TermVectors, you should just use a Field, and set it to store TermVectors in the FieldType.
Something like:
Document doc = new Document();
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true);
Field field = new Field("body", bodyString, type);
doc.add(field);
MyIndexWriter.addDocument(doc);
You are using TextField, a field that is indexed and tokenized, without term vectors. That's why you will get null on getTermVector(). Instead of using TextField, construct Field with your customized FieldType which setStoreTermVectors to true.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With