Using the Elasticsearch javascript client (node.js), what is the best (or simplest) way to loop through every document in an index (ca. 100 000 documents)?
Elasticsearch will get significant slower if you just add some big number as size, one method to use to get all documents is using scan and scroll ids. The results from this would contain a _scroll_id which you have to query to get the next 100 chunk. This answer needs more updates. search_type=scan is now deprecated.
You could have one document per product or one document per order. There is no limit to how many documents you can store in a particular index.
Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names of fields or properties) with their corresponding values (strings, numbers, Booleans, dates, arrays of values, geolocations, or other types of data).
You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .
I think a good place to start is with scan queries using the scroll api:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
Basically it's similar to a cursor with a database - you open the query with a time limit and it returns a scroll id. You then use that scroll id to retrieve the first batch of results and it returns the documents along with a new scroll id. Examples below:
curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=1000' -d '
{
"query" : {
"match_all" : {}
}
}
'
This will return a _scroll_id that you then use to retrieve documents:
curl -XGET 'localhost:9200/_search/scroll?scroll=10m' -d '<_SCROLL_ID_HERE>'
Note that this will return 1000 documents PER PRIMARY SHARD - so if you have 4 primary shards it will return 4000 documents. Each call will in addition to the documents return a new _scroll_id which you then use for the next call. The "scroll=10m" sets a time limit of 10m to keep the scroll open between calls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With