Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Documents are automatically getting deleted in Elasticsearch after insertion

I created an index in Elasticsearch with the following settings. After inserting data into the index using Bulk API, the docs.deleted count is continuously increasing. Does this mean the documents are automatically getting deleted, if so what did i do wrong ?

PUT /inc_index/
{
  "mappings": {
    "store": {
      "properties": {
        "title": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
         },
         "description": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
        },
        "category": {
          "type": "string"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 5,
      "number_of_replicas" : 1
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

The output of "GET /_cat/indices?v" is as shown below, the "docs.deleted" is continuously increasing:

health status index    pri rep docs.count docs.deleted store.size pri.store.size  
green  open   inc_index  5   1   2009877       584438      6.8gb          3.6gb
like image 821
Sra1 Avatar asked Oct 19 '15 15:10

Sra1


People also ask

Does Elasticsearch delete old data?

One of the great features of Elasticsearch is that it can automatically delete old data.

How are documents stored in Elasticsearch?

Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

What is Doc deleted in Elasticsearch?

You use DELETE to remove a document from an index. You must specify the index name and document ID. You cannot send deletion requests directly to a data stream. To delete a document in a data stream, you must target the backing index containing the document.

How many documents can Elasticsearch hold?

You could have one document per product or one document per order. There is no limit to how many documents you can store in a particular index.


2 Answers

If your bulk operations also include updates to existing documents (insert/update to documents with the same ID), then this is normal. In Elasticsearch, an update is a combo of delete+insert operations: https://www.elastic.co/guide/en/elasticsearch/guide/current/update-doc.html

And the deleted documents you see there are documents marked as deleted. When the Lucene segments merging happens, the deleted documents will be physically removed from disk.

like image 97
Andrei Stefan Avatar answered Oct 27 '22 01:10

Andrei Stefan


ElasticSearch indexes have been composed of “segments”. Since segments have a policy of "write once", when we delete/update any document from ElasticSearch, it is not actually deleted, only marked as deleted and increases the count in "doc.deleted".

The more segments means slower searches and more memory used. Elasticsearch solves this problem by merging segments in the background. Small segments are merged into bigger segments, which, in turn, are merged into even bigger segments...while merging those segments if there are any documents which are marked as deleted, it doesn't copy that doc in the bigger segment. And Once merging has finished, the old segments are deleted. That's why there is further decrease in "doc.deleted" value.

like image 22
Shweta Avatar answered Oct 26 '22 23:10

Shweta