ElasticSearch Scroll API with multi threading

Question

First of all, I want to let you guys know that I know the basic work logic of how ElasticSearch Scroll API works. To use Scroll API, first, we need to call search method with some scroll value like 1m, then it will return a _scroll_id that will be used for the next consecutive calls on Scroll until all of the doc returns within loop. But the problem is I just want to use the same process on multi-thread basis, not on serially. For example:

If I have 300000 documents, then I want to process/get the docs this way

The 1st thread will process initial 100000 documents
The 2nd thread will process next 100000 documents
The 3rd thread will process remaining 100000 documents

So my question is as I didn't find any way to set the from value on scroll API how can I make the scrolling process faster with threading. Not to process the documents in a serialized manner.

My sample python code

if index_name is not None and doc_type is not None and body is not None:
   es = init_es()
   page = es.search(index_name,doc_type, scroll = '30s',size = 10, body = body)
   sid = page['_scroll_id']
   scroll_size = page['hits']['total']

   # Start scrolling
   while (scroll_size > 0):

       print("Scrolling...")
       page = es.scroll(scroll_id=sid, scroll='30s')
       # Update the scroll ID
       sid = page['_scroll_id']

       print("scroll id: " + sid)

       # Get the number of results that we returned in the last scroll
       scroll_size = len(page['hits']['hits'])
       print("scroll size: " + str(scroll_size))

       print("scrolled data :" )
       print(page['aggregations'])

Paul · Accepted Answer

Have you tried a sliced scroll? According to the linked docs:

For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which can be consumed independently.

and

Each scroll is independent and can be processed in parallel like any scroll request.

I have not used this myself (the largest result set I need to process is ~50k documents) but this seems to be what you're looking for.

ElasticSearch Scroll API with multi threading

Tags:

multithreading

elasticsearch

elasticsearch-py

Always Sunny

1 Answers

Paul

Recent Activity

Donate For Us

ElasticSearch Scroll API with multi threading

Tags:

multithreading

elasticsearch

elasticsearch-py

Always Sunny

1 Answers

Paul

Related questions

Recent Activity

Donate For Us