I want to count the number of documents in Lucene for a term on a field. I know 3 ways of doing that; I am curious what the best and fastest practice would be:
I will search in a long-typed single-valued field ("field") for the term (so not text, but numbered data!)
Directory dirIndex = FSDirectory.open('/path/to/index/');
IndexReader indexReader = DirectoryReader.open(dirIndex);
final BytesRefBuilder bytes = new BytesRefBuilder();
NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);
TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
int count = termEnum.docFreq();
IndexSearcher searcher = new IndexSearcher(indexReader);
TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(query,collector);
int count = collector.getTotalHits();
TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
int count = 0;
if (docsEnum != null) {
int docx;
while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
count++;
}
}
option 1) wins for shortest code, but is basically useless if you ever update and delete documents in your index. it counts deleted docs as if they're still there. Not documented in many places (except the official documentation, but not in in answers here on s.o.) that this is something to be aware of. Perhaps there is a way around this, otherwise the enthusiasm about this method is a bit misplaced. option 2) and 3) do produce the correct results, but which should be preferred? or better yet - is there a faster way of doing this?
Measured by doing a test, using the index to fetch the docs rather than search for them (i.e. option 3 instead of option 2) appears to be faster (average: option 3) was 8 times faster in a 100 doc sample I could run). I've also reversed the test to ensure that running one before the other doesn't affect results: it doesn't.
So it appears the searcher is creating quite some overhead to perform a simple document count, and if one looks into counting docs for a single term entry, a lookup in the index is fastest.
code used to test (using first 100 records in SOLR index):
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;
public class ReadLongTermReferenceCount {
public static void main(String[] args) throws IOException {
Directory dirIndex = FSDirectory.open('/path/to/index/');
IndexReader indexReader = DirectoryReader.open(dirIndex);
final BytesRefBuilder bytes = new BytesRefBuilder();
TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
IndexSearcher searcher = new IndexSearcher(indexReader);
TotalHitCountCollector collector = new TotalHitCountCollector();
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
int maxDoc = indexReader.maxDoc();
int docsPassed = 0;
for (int i=0; i<maxDoc; i++) {
if (docsPassed==100) {
break;
}
if (liveDocs != null && !liveDocs.get(i))
continue;
Document doc = indexReader.document(i);
//get longTerm from this doc and convert to BytesRefBuilder
String longTerm = doc.get("longTerm");
NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);
//time before the first test
long time_start = System.nanoTime();
//look in the "field" index for longTerm and count the number of documents
int count = 0;
termEnum.seekExact(bytes.toBytesRef());
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
if (docsEnum != null) {
int docx;
while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
count++;
}
}
//mid point: test 1 done, start of test 2
long time_mid = System.nanoTime();
//do a search for longTerm in "field"
TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
searcher.search(query,collector);
int count = collector.getTotalHits();
//end point: test 2 done.
long time_end = System.nanoTime();
//write to stdout
System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));
docsPassed++;
}
indexReader.close();
dirIndex.close();
}
}
Slight modification of above to work with Lucene 5:
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;
public class ReadLongTermReferenceCount {
public static void main(String[] args) throws IOException {
Directory dirIndex = FSDirectory.open(Paths.get('/path/to/index/'));
IndexReader indexReader = DirectoryReader.open(dirIndex);
final BytesRefBuilder bytes = new BytesRefBuilder();
TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
IndexSearcher searcher = new IndexSearcher(indexReader);
TotalHitCountCollector collector = new TotalHitCountCollector();
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
int maxDoc = indexReader.maxDoc();
int docsPassed = 0;
for (int i=0; i<maxDoc; i++) {
if (docsPassed==100) {
break;
}
if (liveDocs != null && !liveDocs.get(i))
continue;
Document doc = indexReader.document(i);
//get longTerm from this doc and convert to BytesRefBuilder
String longTerm = doc.get("longTerm");
NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);
//time before the first test
long time_start = System.nanoTime();
//look in the "field" index for longTerm and count the number of documents
int count = 0;
termEnum.seekExact(bytes.toBytesRef());
PostingsEnum docsEnum = termEnum.postings(liveDocs, null);
if (docsEnum != null) {
int docx;
while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
count++;
}
}
//mid point: test 1 done, start of test 2
long time_mid = System.nanoTime();
//do a search for longTerm in "field"
TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
searcher.search(query,collector);
int count = collector.getTotalHits();
//end point: test 2 done.
long time_end = System.nanoTime();
//write to stdout
System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));
docsPassed++;
}
indexReader.close();
dirIndex.close();
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With