Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr/SolrJ: how to iterate results without creating a giant ArrayList

Tags:

solr

solrj

Is there a way to iterate over a Solrj response such that the results are fetched incrementally during iteration, rather than returning a giant in-memory ArrayList?

Or do we have to resort to this:

    SolrQuery query = new SolrQuery();
    query.setQuery("*:*");
    int fetchSize = 1000;
    query.setRows(fetchSize);
    QueryResponse rsp = server.query(query);

    long offset = 0;
    long totalResults = rsp.getResults().getNumFound();

    while (offset < totalResults)
    {
        query.setStart((int) offset);  // requires an int? wtf?
        query.setRows(fetchSize);

        for (SolrDocument doc : server.query(query).getResults())
        {
             log.info((String) doc.getFieldValue("title"));
        }

        offset += fetchSize;
    }

And while I'm on the topic, why does SolrQuery.setStart() require an integer, when SolrDocumentList.getStart()/getNumFound() return long?

like image 319
George Armhold Avatar asked Feb 19 '11 14:02

George Armhold


People also ask

What would be the best option to iterate the list variable customers?

Iterating over a list can also be achieved using a while loop. The block of code inside the loop executes until the condition is true. A loop variable can be used as an index to access each element.

What is SolrJ in Solr?

SolrJ is an API that makes it easy for applications written in Java (or any language based on the JVM) to talk to Solr. SolrJ hides a lot of the details of connecting to Solr and allows your application to interact with Solr with simple high-level methods. SolrJ supports most Solr APIs, and is highly configurable.

What is SolrClient?

SolrClient's are the main workhorses at the core of SolrJ. They handle the work of connecting to and communicating with Solr, and are where most of the user configuration happens.


2 Answers

That code looks correct. You could also wrap it in an Iterator so that your client code doesn't have to know anything about the underlying paging.

About SolrQuery.setStart() requiring an Integer, it certainly looks odd, I think you're right and it should be a long as well. Try asking on the solr-user or lucene-dev mailing lists.

like image 187
Mauricio Scheffer Avatar answered Sep 30 '22 13:09

Mauricio Scheffer


The reason, Caffeine, is that Solr is designed to give you the top X search results. The expectation is that you will have a "reasonable" number to return. If Solr has to look deep into the search results (into the thousands), you're rubbing against the grain for what Solr was designed for. It will work but the query response will get exponentially slower and slower the deeper into the search results you have to go. There is some ongoing work in Solr to make this use-case more efficient but I've seen no progress on it lately.

like image 38
David Smiley Avatar answered Sep 30 '22 11:09

David Smiley