Can someone help me in undestanding a way of working with customized implementations of abstract Collector class in Lucene?
I've implemented two ways of querying index with some test text:
1.Total hits is eq to 2. Both file names are the same, hence results size is eq to 1 because I keep them in a set.
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
LOG.info("Total hits " + topDocs.totalHits);
ScoreDoc[] scoreDosArray = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDosArray) {
Document doc = searcher.doc(scoreDoc.doc);
String fileName = doc.get(FILENAME_FIELD);
results.add(fileName);
}
2.CountCollect is eq to 2. Both documents from which I get files names in collect method of the Collector are unique, hence final results size is also eq to 2. CountNextReader variable is at the end of the logic is eq to 10.
private Set<String> doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException {
final Set<String> results = new HashSet<String>();
Collector collector = new Collector() {
private int base;
private Scorer scorer;
private int countCollect;
private int countNextReader;
@Override
public void collect(int doc) throws IOException {
Document document = searcher.doc(doc);
String filename = document.get(FILENAME_FIELD);
results.add(filename);
countCollect++;
}
@Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
@Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
@Override
public void setNextReader(AtomicReaderContext ctx) throws IOException {
this.base = ctx.docBase;
countNextReader++;
}
@Override
public String toString() {
LOG.info("CountCollect: " + countCollect);
LOG.info("CountNextReader: " + countNextReader);
return null;
}
};
searcher.search(query, collector);
collector.toString();
return results;
}
I don't understand why within collect method I get different documents and different file names in comparison with previous implementation? I would expect the same result, or?
The Collector#collect
method is the hotspot of a search request. It's called for every document that matches the query, not only the ones that you get back. In fact, you usually get back only the top documents, which are effectively the ones that you show to the users.
I would suggest not to do things like:
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
which would force lucene to return too many documents.
Anyway, if you only have two matching documents (or you are asking for all the documents that match), the number of documents that you get back and the number of calls to the collect method should be the same.
The setNextReader
method is something completely different that you shouldn't care that much about. Have a look at this article if you want to know more about AtomicReader and so on. To keep it short, Lucene stores data as segments, which are mini searchable inverted indexes. Every query is executed on each segment sequentially. Every time the search switches to the next segment the setNextReader
method is called to allow to do operations at a segment level in the Collector
. For example, the internal lucene document id is unique only within the segment, thus you need to add docBase
to it to make it unique within the whole index. That's why you need to store it when the segment changes and take it into account. Your countNextReader
variable just contains the number of segments that have been analyzed for your query, it doesn't have anything to do with your documents.
Looking deeper at your Collector
code I also noticed you are not taking into account the docBase
when retrieving documents by id. This should fix it:
Document document = searcher.doc(doc + docBase);
Keep also in mind that loading a stored field within a Collector
is not really a wise thing to do. It's gonna make your searches really slow, because stored fields are loaded from disk. You usually load stored fields only for the subset of documents that you want to return. Within a Collector
you usually load information needed to score documents like payloads or similar things, usually making use of the lucene field cache too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With