Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get status of a task Elasticsearch for a long running update query

Assuming I have a long running update query where I am updating ~200k to 500k, perhaps even more.Why I need to update so many documents is beyond the scope of the question.

Since the client times out (I use the official ES python client), I would like to have a way to check what the status of the bulk update request is, without having to use enormous timeout values.

For a short request, the response of the request can be used, is there a way I can get the response of the request as well or if I can specify a name or id to a request so as to reference it later.

For a request which is running : I can use the tasks API to get the information.

But for other statuses - completed / failed, how do I get it. If I try to access a task which is already completed, I get resource not found .

P.S. I am using update_by_query for the update

like image 629
Shivendra Soni Avatar asked Mar 22 '18 22:03

Shivendra Soni


People also ask

How does Elasticsearch handle wait_for_completion=false?

If the request contains wait_for_completion=false, Elasticsearch performs some preflight checks, launches the request, and returns a task you can use to cancel or get the status of the task. Elasticsearch creates a record of this task as a document at .tasks/task/$ {taskId} .

How does Elasticsearch handle update by query request?

While processing an update by query request, Elasticsearch performs multiple search requests sequentially to find all of the matching documents. A bulk update request is performed for each batch of matching documents. Any query or update failures cause the update by query request to fail and the failures are shown in the response.

How do I monitor a task in Elasticsearch?

If the Elasticsearch security features are enabled, you must have the monitor or manage cluster privilege to use this API. The task management API returns information about tasks currently executing on one or more nodes in the cluster. (Optional, string) ID of the task to return ( node_id:task_number ).

Is the Elasticsearch API backwards compatible?

The API may change in ways that are not backwards compatible. For feature status, see #51628. Returns information about the tasks currently executing in the cluster. If the Elasticsearch security features are enabled, you must have the monitor or manage cluster privilege to use this API.


2 Answers

With the task id you can look up the task directly:

GET /_tasks/taskId:1

The advantage of this API is that it integrates with wait_for_completion=false to transparently return the status of completed tasks. If the task is completed and wait_for_completion=false was set on it them it’ll come back with a results or an error field. The cost of this feature is the document that wait_for_completion=false creates at .tasks/task/${taskId}. It is up to you to delete that document.

From here https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#docs-update-by-query-task-api

My use case went like this, I needed to do an update_by_query and I used painless as the script language. At first I did a reindex (when testing). Then I tried using the update_by_query functionality (they resemble each other a lot). I did a request to the task api (the operation hasn't finished of course) and I saw the task being executed. When it finished I did a query and the data of the fields that I was manipulating had disappeared. The script worked since I used the same script for the reindex api and everything went as it should have. I didn't investigate further because of lack of time, but... yeah, test thoroughly...

like image 94
Alkis Kalogeris Avatar answered Oct 23 '22 02:10

Alkis Kalogeris


I feel GET /_tasks/taskId:1 confusing to understand. It should be

GET http://localhost:9200/_tasks/taskId

A taskId looks something like this NCvmGYS-RsW2X8JxEYumgA:1204320.


Here is my trivial explanation related to this topic.

To check a task, you need to know its taskId.

A task id is a string that consists of node_id, a colon, and a task_sequence_number. An example is taskId = NCvmGYS-RsW2X8JxEYumgA:1204320 where node_id = NCvmGYS-RsW2X8JxEYumgA and task_sequence_number = 1204320. Some people including myself thought taskId = 1204320, but that's not the way how the elasticsearch codebase developers understand it at this moment.

A taskId can be found in two ways.

  1. wait_for_deletion = false. When sending a request to ES, with this parameter, the response will be {"task" : "NCvmGYS-RsW2X8JxEYumgA:1204320"}. Then, you can check a status of that task like this GET http://localhost:9200/_tasks/NCvmGYS-RsW2X8JxEYumgA:1204320
  2. GET http://localhost:9200/_tasks?detailed=false&actions=*/delete/byquery. This example will return you the status of all tasks with action = delete_by_query. If you know there is only one task running on ES, you can find your taskId from the response of all running tasks.

After you know the taskId, you can get the status of a task with this.

GET /_tasks/taskId

Notice you can only check the status of a task when the task is running, or a task is generated with wait_for_deletion == false.

More trivial explanation, wait_for_deletion by default is true. Based on my understanding, tasks with wait_for_deletion = true are "in-memory" only. You can still check the status of a task while it's running. But it's completely gone after it is completed/canceled. Meaning checking the status will return you a 'resouce_not_found_exception'. Tasks with wait_for_deletion = false will be stored in an ES system index .task. You can still check it's status after it finishes. However, you might want to delete this task document from .task index after you are done with it to save some space. The deletion request looks like this

http://localhost:9200/.tasks/task/NCvmGYS-RsW2X8JxEYumgA:1204320

You will receive resouce_not_found_exception if a taskId is not present. (for example, you deleted some task twice, or you are deleting an in-memory task, whose wait_for_deletetion == true).

About this confusing taskId thing, I made a pull request https://github.com/elastic/elasticsearch/pull/31122 to help clarify the Elasticsearch document. Unfortunately, they rejected it. Ugh.

like image 2
Jinggang Avatar answered Oct 23 '22 04:10

Jinggang