Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Elastic Search delete_by_query work ? What happens when we insert new data and retrieve the same while deleting documents?

I wanted to know more about elastic delete, it's Java high level delete api & weather it's feasible to perform bulk delete.

Following are the config information

  • Java: 8
  • Elastic Version: 7.1.1
  • Elastic dependencies added:

    <dependency>
        <groupId>org.elasticsearch.client</groupId>
        <artifactId>elasticsearch-rest-high-level-client</artifactId>
        <version>7.1.1</version>
    </dependency>
    
    <dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch</artifactId>
        <version>7.1.1</version>
    </dependency>
    

In my case daily around 10K records are added into the index dev-answer. I want to trigger delete operation (this can be triggered daily or once in a week or once in a month) which will basically delete all documents from above index if specific condition is satisfied. (Which I'll give in DeleteByQueryRequest)

For delete there is an api as given in latest doc which I'm referring.

DeleteByQueryRequest request = new DeleteByQueryRequest("source1", "source2");

While reading the documentation I came across following queries which I'm unable to understand.

  1. As in doc: It’s also possible to limit the number of processed documents by setting size. request.setSize(10); What does processed document means ? Will it delete only 10 documents ?

  2. What batch size I should set ? request.setBatchSize(100); it's performance is based on how many documents we are going to delete ?

    Should I first make a call to get no of documents & based on that setBatchSize should be changed ?

  3. request.setSlices(2); Slices should be depend on how many cores executor machine have or on no of cores in elastic cluster ?

  4. In documentation the method setSlices(2) is given which I'm unable to find in class org.elasticsearch.index.reindex.DeleteByQueryRequest. What I'm missing here ?

  5. Let's consider if I'm executing this delete query in async mode which is taking 0.5-1.0 sec, meanwhile if I'm doing get request on this index, will it give some exception ? Also in the same time if I inserted new document & retrieving the same, will it be able to give response ?

like image 855
AshwinK Avatar asked Jul 05 '19 09:07

AshwinK


People also ask

How does Elasticsearch search and delete work?

While processing a delete by query request, Elasticsearch performs multiple search requests sequentially to find all of the matching documents to delete. A bulk delete request is performed for each batch of matching documents. If a search or bulk request is rejected, the requests are retried up to 10 times, with exponential back off.

What is the difference between Elasticsearch query size 2 and 5?

If you have 10 documents matching your query and a size of 2, elasticsearch will internally performs 5 search/_scroll calls (i.e., 5 batches) while if you set a size to 5, only 2 search/_scroll calls will be performed. Regardless of the size parameter all documents matching the query will be removed but it will be more or less efficient. 2.

Does Elasticsearch support wait_for_completion?

Unlike the delete API, it does not support wait_for. If the request contains wait_for_completion=false, Elasticsearch performs some preflight checks, launches the request, and returns a task you can use to cancel or get the status of the task. Elasticsearch creates a record of this task as a document at .tasks/task/$ {taskId}.

What is a document in Elasticsearch?

In Elasticsearch, a document can be more than just text, it can be any structured data encoded in JSON. That data can be things like numbers, strings, and dates.


1 Answers

1. As in doc: It’s also possible to limit the number of processed documents by setting size. request.setSize(10); What does processed document means ? Will it delete only 10 documents ?

If you have not already you should read the search/_scroll documentation. _delete_by_query performs a scroll search using the query given as parameter.

The size parameter corresponds to the number of documents returned by each call to the scroll endpoint. If you have 10 documents matching your query and a size of 2, elasticsearch will internally performs 5 search/_scroll calls (i.e., 5 batches) while if you set a size to 5, only 2 search/_scroll calls will be performed.

Regardless of the size parameter all documents matching the query will be removed but it will be more or less efficient.

2. What batch size I should set ? request.setBatchSize(100); it's performance is based on how many documents we are going to delete ?

setBatchSize() method is equivalent to set the size parameter in the query. You can read this article to determine the correct value for the size parameter.

3. Should I first make a call to get no of documents & based on that setBatchSize should be changed ?

You would have to run the search request twice to get the number of deleted documents, I believe that it would not be efficient. I advise you to find and stick to a constant value.

4. Slices should be depend on how many cores executor machine have or on no of cores in elastic cluster ?

The number of slice should be set from the elasticsearch cluster configuration. It also to parallelize the search both between the shards and within the shards.

You can read the documentation for hints on how to set this parameter. Usually the number of shards for your index.

5. In documentation the method setSlices(2) is given which I'm unable to find in class org.elasticsearch.index.reindex.DeleteByQueryRequest. What I'm missing here ?

You are right, that is probably an error in the documentation. I have never tried it, but I believe you should use forSlice(TaskId slicingTask, SearchRequest slice, int totalSlices).

6. Let's consider if I'm executing this delete query in async mode which is taking 0.5-1.0 sec, meanwhile if I'm doing get request on this index, will it give some exception ? Also in the same time if I inserted new document & retrieving the same, will it be able to give response ?

First, as stated in the documentation, the _delete_by_query endpoint create a snapshot of the index and work on this copy.

For a get request, it depends if the document has already been deleted or not. It will never send an exception, you will just have the same result has if you where retrieving an existing or a non existing document. Please note that unless you specify a sort in the search query, the order of deletion for the documents is not determined.

If you insert (or update) a document during the processing, this document will not be taken into account by the _delete_by_query endpoint, even if it matches the _delete_by_query query. This is where the snapshot is used. So if you insert a new document, you will be able to retrieve it. Same if you update an existing document, the document will be created again if it has already been deleted or updated but not deleted if it has not been deleted yet.

As a side note, deleted documents will still be searchable (even after the delete_by_query task has finished) until a refresh operation has occurred.

_delete_by_query does not support refresh parameter. The request return mentionned in the documentation for the refresh operation refers to requests that can have a refresh parameter. If you want to force a refresh you can use the _refresh endpoint. By default, refresh operation occur every 1 second. So once the _delete_by_query operation is finished after at most 1 second, the deleted documents will not be searchable.

like image 104
Pierre-Nicolas Mougel Avatar answered Oct 18 '22 04:10

Pierre-Nicolas Mougel