Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looping over all documents in an elasticsearch index

Using the Elasticsearch javascript client (node.js), what is the best (or simplest) way to loop through every document in an index (ca. 100 000 documents)?

like image 914
user1612947 Avatar asked May 24 '14 13:05

user1612947


People also ask

What is the Elasticsearch query to get all documents from an index?

Elasticsearch will get significant slower if you just add some big number as size, one method to use to get all documents is using scan and scroll ids. The results from this would contain a _scroll_id which you have to query to get the next 100 chunk. This answer needs more updates. search_type=scan is now deprecated.

How many documents can Elasticsearch handle?

You could have one document per product or one document per order. There is no limit to how many documents you can store in a particular index.

How are documents stored in Elasticsearch?

Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names of fields or properties) with their corresponding values (strings, numbers, Booleans, dates, arrays of values, geolocations, or other types of data).

How do I view all data in Elasticsearch?

You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .


1 Answers

I think a good place to start is with scan queries using the scroll api:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

Basically it's similar to a cursor with a database - you open the query with a time limit and it returns a scroll id. You then use that scroll id to retrieve the first batch of results and it returns the documents along with a new scroll id. Examples below:

curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=1000' -d '
{
    "query" : {
        "match_all" : {}
    }
}
'

This will return a _scroll_id that you then use to retrieve documents:

curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d '<_SCROLL_ID_HERE>'

Note that this will return 1000 documents PER PRIMARY SHARD - so if you have 4 primary shards it will return 4000 documents. Each call will in addition to the documents return a new _scroll_id which you then use for the next call. The "scroll=10m" sets a time limit of 10m to keep the scroll open between calls.

like image 78
John Petrone Avatar answered Sep 28 '22 03:09

John Petrone