Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Way of working with Lucene index custom Collector

Tags:

lucene

Can someone help me in undestanding a way of working with customized implementations of abstract Collector class in Lucene?

I've implemented two ways of querying index with some test text:

1.Total hits is eq to 2. Both file names are the same, hence results size is eq to 1 because I keep them in a set.

TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
LOG.info("Total hits " + topDocs.totalHits);
ScoreDoc[] scoreDosArray = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDosArray) {
    Document doc = searcher.doc(scoreDoc.doc);
    String fileName = doc.get(FILENAME_FIELD);
    results.add(fileName);
}

2.CountCollect is eq to 2. Both documents from which I get files names in collect method of the Collector are unique, hence final results size is also eq to 2. CountNextReader variable is at the end of the logic is eq to 10.

private Set<String> doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException {
    final Set<String> results = new HashSet<String>();
    Collector collector = new Collector() {
        private int base;
        private Scorer scorer;
        private int countCollect;
        private int countNextReader;

        @Override
        public void collect(int doc) throws IOException {
            Document document = searcher.doc(doc);
            String filename = document.get(FILENAME_FIELD);
            results.add(filename);
            countCollect++;
        }

        @Override
        public boolean acceptsDocsOutOfOrder() {
            return true;
        }

        @Override
        public void setScorer(Scorer scorer) throws IOException {
            this.scorer = scorer;
        }

        @Override
        public void setNextReader(AtomicReaderContext ctx) throws IOException {
            this.base = ctx.docBase;
            countNextReader++;
        }

        @Override
        public String toString() {
            LOG.info("CountCollect: " + countCollect);
            LOG.info("CountNextReader: " + countNextReader);
            return null;
        }
    };
    searcher.search(query, collector);
    collector.toString();
    return results;
}

I don't understand why within collect method I get different documents and different file names in comparison with previous implementation? I would expect the same result, or?

like image 438
damax Avatar asked Mar 22 '13 10:03

damax


1 Answers

The Collector#collect method is the hotspot of a search request. It's called for every document that matches the query, not only the ones that you get back. In fact, you usually get back only the top documents, which are effectively the ones that you show to the users.

I would suggest not to do things like:

TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);

which would force lucene to return too many documents.

Anyway, if you only have two matching documents (or you are asking for all the documents that match), the number of documents that you get back and the number of calls to the collect method should be the same.

The setNextReader method is something completely different that you shouldn't care that much about. Have a look at this article if you want to know more about AtomicReader and so on. To keep it short, Lucene stores data as segments, which are mini searchable inverted indexes. Every query is executed on each segment sequentially. Every time the search switches to the next segment the setNextReader method is called to allow to do operations at a segment level in the Collector. For example, the internal lucene document id is unique only within the segment, thus you need to add docBase to it to make it unique within the whole index. That's why you need to store it when the segment changes and take it into account. Your countNextReader variable just contains the number of segments that have been analyzed for your query, it doesn't have anything to do with your documents.

Looking deeper at your Collector code I also noticed you are not taking into account the docBase when retrieving documents by id. This should fix it:

Document document = searcher.doc(doc + docBase);

Keep also in mind that loading a stored field within a Collector is not really a wise thing to do. It's gonna make your searches really slow, because stored fields are loaded from disk. You usually load stored fields only for the subset of documents that you want to return. Within a Collector you usually load information needed to score documents like payloads or similar things, usually making use of the lucene field cache too.

like image 145
javanna Avatar answered Oct 19 '22 18:10

javanna