I wanted to know more about elastic delete, it's Java high level delete api & weather it's feasible to perform bulk delete.
Following are the config information
Elastic dependencies added:
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.1.1</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>7.1.1</version>
</dependency>
In my case daily around 10K records are added into the index dev-answer
.
I want to trigger delete operation (this can be triggered daily or once in a week or once in a month) which will basically delete all documents from above index
if specific condition is satisfied. (Which I'll give in DeleteByQueryRequest
)
For delete there is an api as given in latest doc which I'm referring.
DeleteByQueryRequest request = new DeleteByQueryRequest("source1", "source2");
While reading the documentation I came across following queries which I'm unable to understand.
As in doc: It’s also possible to limit the number of processed documents by setting size. request.setSize(10);
What does processed document means ? Will it delete only 10 documents ?
What batch size I should set ? request.setBatchSize(100);
it's performance is based on how many documents we are going to delete ?
Should I first make a call to get no of documents
& based on that setBatchSize
should be changed ?
request.setSlices(2);
Slices should be depend on how many cores executor machine have or on no of cores in elastic cluster ?
In documentation the method setSlices(2)
is given which I'm unable to find in class org.elasticsearch.index.reindex.DeleteByQueryRequest
. What I'm missing here ?
Let's consider if I'm executing this delete query in async mode which is taking 0.5-1.0 sec, meanwhile if I'm doing get request on this index, will it give some exception ? Also in the same time if I inserted new document & retrieving the same, will it be able to give response ?
While processing a delete by query request, Elasticsearch performs multiple search requests sequentially to find all of the matching documents to delete. A bulk delete request is performed for each batch of matching documents. If a search or bulk request is rejected, the requests are retried up to 10 times, with exponential back off.
If you have 10 documents matching your query and a size of 2, elasticsearch will internally performs 5 search/_scroll calls (i.e., 5 batches) while if you set a size to 5, only 2 search/_scroll calls will be performed. Regardless of the size parameter all documents matching the query will be removed but it will be more or less efficient. 2.
Unlike the delete API, it does not support wait_for. If the request contains wait_for_completion=false, Elasticsearch performs some preflight checks, launches the request, and returns a task you can use to cancel or get the status of the task. Elasticsearch creates a record of this task as a document at .tasks/task/$ {taskId}.
In Elasticsearch, a document can be more than just text, it can be any structured data encoded in JSON. That data can be things like numbers, strings, and dates.
If you have not already you should read the search/_scroll
documentation. _delete_by_query
performs a scroll search using the query given as parameter.
The size
parameter corresponds to the number of documents returned by each call to the scroll
endpoint. If you have 10 documents matching your query and a size of 2, elasticsearch will internally performs 5 search/_scroll
calls (i.e., 5 batches) while if you set a size to 5, only 2 search/_scroll
calls will be performed.
Regardless of the size
parameter all documents matching the query will be removed but it will be more or less efficient.
setBatchSize()
method is equivalent to set the size
parameter in the query. You can read this article to determine the correct value for the size parameter.
You would have to run the search request twice to get the number of deleted documents, I believe that it would not be efficient. I advise you to find and stick to a constant value.
The number of slice should be set from the elasticsearch cluster configuration. It also to parallelize the search both between the shards and within the shards.
You can read the documentation for hints on how to set this parameter. Usually the number of shards for your index.
You are right, that is probably an error in the documentation. I have never tried it, but I believe you should use forSlice(TaskId slicingTask, SearchRequest slice, int totalSlices)
.
First, as stated in the documentation, the _delete_by_query
endpoint create a snapshot of the index and work on this copy.
For a get
request, it depends if the document has already been deleted or not. It will never send an exception, you will just have the same result has if you where retrieving an existing or a non existing document. Please note that unless you specify a sort
in the search query, the order of deletion for the documents is not determined.
If you insert (or update) a document during the processing, this document will not be taken into account by the _delete_by_query
endpoint, even if it matches the _delete_by_query
query. This is where the snapshot is used. So if you insert a new document, you will be able to retrieve it. Same if you update an existing document, the document will be created again if it has already been deleted or updated but not deleted if it has not been deleted yet.
As a side note, deleted documents will still be searchable (even after the delete_by_query
task has finished) until a refresh
operation has occurred.
_delete_by_query
does not support refresh
parameter. The request return
mentionned in the documentation for the refresh
operation refers to requests that can have a refresh parameter. If you want to force a refresh you can use the _refresh
endpoint. By default, refresh operation occur every 1 second. So once the _delete_by_query
operation is finished after at most 1 second, the deleted documents will not be searchable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With