Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return all records in one query in Elasticsearch

I have a database in elastic search and want to get all records on my web site page. I wrote a bean, which connects to the elastic search node, searches records and returns some response. My simple java code, which does the searching, is

SearchResponse response = getClient().prepareSearch(indexName)
    .setTypes(typeName)              
    .setQuery(queryString("\*:*"))
    .setExplain(true)
    .execute().actionGet();

But Elasticsearch set default size to 10 and I have 10 hits in response. There are more than 10 records in my database. If I set size to Integer.MAX_VALUE my search becomes very slow and this not what I want.

How can I get all the records in one action in an acceptable amount of time without setting size of response?

like image 454
San4o Avatar asked Feb 27 '13 14:02

San4o


People also ask

What is the Elasticsearch query to get all documents from an index?

Elasticsearch will get significant slower if you just add some big number as size, one method to use to get all documents is using scan and scroll ids. The results from this would contain a _scroll_id which you have to query to get the next 100 chunk. This answer needs more updates. search_type=scan is now deprecated.

How do I get more than 10 results in Elasticsearch?

If a search request results in more than ten hits, ElasticSearch will, by default, only return the first ten hits. To override that default value in order to retrieve more or fewer hits, we can add a size parameter to the search request body.

How do I view all data in Elasticsearch?

You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .


1 Answers

The current highest-ranked answer works, but it requires loading the whole list of results in memory, which can cause memory issues for large result sets, and is in any case unnecessary.

I created a Java class that implements a nice Iterator over SearchHits, that allows to iterate through all results. Internally, it handles pagination by issuing queries that include the from: field, and it only keeps in memory one page of results.

Usage:

// build your query here -- no need for setFrom(int)
SearchRequestBuilder requestBuilder = client.prepareSearch(indexName)
                                            .setTypes(typeName)
                                            .setQuery(QueryBuilders.matchAllQuery()) 

SearchHitIterator hitIterator = new SearchHitIterator(requestBuilder);
while (hitIterator.hasNext()) {
    SearchHit hit = hitIterator.next();

    // process your hit
}

Note that, when creating your SearchRequestBuilder, you don't need to call setFrom(int), as this will be done interally by the SearchHitIterator. If you want to specify the size of a page (i.e. the number of search hits per page), you can call setSize(int), otherwise ElasticSearch's default value is used.

SearchHitIterator:

import java.util.Iterator;
import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.search.SearchHit;

public class SearchHitIterator implements Iterator<SearchHit> {

    private final SearchRequestBuilder initialRequest;

    private int searchHitCounter;
    private SearchHit[] currentPageResults;
    private int currentResultIndex;

    public SearchHitIterator(SearchRequestBuilder initialRequest) {
        this.initialRequest = initialRequest;
        this.searchHitCounter = 0;
        this.currentResultIndex = -1;
    }

    @Override
    public boolean hasNext() {
        if (currentPageResults == null || currentResultIndex + 1 >= currentPageResults.length) {
            SearchRequestBuilder paginatedRequestBuilder = initialRequest.setFrom(searchHitCounter);
            SearchResponse response = paginatedRequestBuilder.execute().actionGet();
            currentPageResults = response.getHits().getHits();

            if (currentPageResults.length < 1) return false;

            currentResultIndex = -1;
        }

        return true;
    }

    @Override
    public SearchHit next() {
        if (!hasNext()) return null;

        currentResultIndex++;
        searchHitCounter++;
        return currentPageResults[currentResultIndex];
    }

}

In fact, realizing how convenient it is to have such a class, I wonder why ElasticSearch's Java client does not offer something similar.

like image 139
Alphaaa Avatar answered Oct 08 '22 19:10

Alphaaa