Using the Elasticsearch javascript client (node.js), what is the best (or simplest) way to loop through every document in an index (ca. 100 000 documents)?

I think a good place to start is with scan queries using the scroll api: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html Basically it's similar to a cursor with a database - you open the query with a time limit and it returns a scroll id. You then use that scroll id to retrieve the first batch of results and it returns the documents along with a new scroll id. Examples below: <pre class="prettyprint"><code>curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=1000' -d ' { "query" : { "match_all" : {} } } ' </code></pre> This will return a _scroll_id that you then use to retrieve documents: <pre class="prettyprint"><code>curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d '<_SCROLL_ID_HERE>' </code></pre> Note that this will return 1000 documents PER PRIMARY SHARD - so if you have 4 primary shards it will return 4000 documents. Each call will in addition to the documents return a new _scroll_id which you then use for the next call. The "scroll=10m" sets a time limit of 10m to keep the scroll open between calls.

Looping over all documents in an elasticsearch index

1 Answers

I think a good place to start is with scan queries using the scroll api:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

Basically it's similar to a cursor with a database - you open the query with a time limit and it returns a scroll id. You then use that scroll id to retrieve the first batch of results and it returns the documents along with a new scroll id. Examples below:

curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=1000' -d '
{
    "query" : {
        "match_all" : {}
    }
}
'

This will return a _scroll_id that you then use to retrieve documents:

curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d '<_SCROLL_ID_HERE>'

Note that this will return 1000 documents PER PRIMARY SHARD - so if you have 4 primary shards it will return 4000 documents. Each call will in addition to the documents return a new _scroll_id which you then use for the next call. The "scroll=10m" sets a time limit of 10m to keep the scroll open between calls.

answered Sep 28 '22 03:09

John Petrone

Related questions
                            
                                I don't understand the results that's returning from elasticsearch/haystack
                            
                                Unit testing elastic search inside Django app
                            
                                AWS Elasticsearch service: Disable index auto creation (auto_create_index)
                            
                                How to do a time range search in Kibana
                            
                                Logstash close file descriptors?
                            
                                Elasticsearch 5 Java Client giving "NoNodeAvailableException"
                            
                                How to index and store multiple languages in ElasticSearch
                            
                                Search for name(text) with spaces in elasticsearch
                            
                                How to setup ElasticSearch index structure with multiple entity bindings
                            
                                Django haystack EdgeNgramField given different results than elasticsearch
                            
                                How to group results in elasticsearch?
                            
                                how to migrate mysql data to ElasticSearch realtime
                            
                                Failed to connect to Dockerized elasticsearch via java-client
                            
                                What is the difference between a field and a property in Elasticsearch?
                            
                                Getting ElasticSearch facets to treat multi-word field content as an atomic term
                            
                                Elasticsearch sorting on string not returning expected results
                            
                                Run a simple sql group by query in kibana 4
                            
                                ES reachable from curl but not from Marvel/Sense
                            
                                Filebeat - parse fields from message line
                            
                                Elasticsearch enum field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Looping over all documents in an elasticsearch index

Tags:

elasticsearch

user1612947

People also ask

1 Answers

John Petrone

Recent Activity

Donate For Us