I have read a lot of articles about index refreshing in Elasticsearch. I understand the implication of different intervals that are greater than 0, which is the elapsed time between consecutive segments flush, making them available for search. However, I am not sure what refresh_interval: -1
does exactly. In my understanding, it's a means to disable automatic index refreshing but not completely. Elasticsearch still flushes segments from time to time even though the refresh_interval
is set to -1. I wonder which mechanism governs this flushing activity if automatic refresh is disabled.
Sorry I know I don't have a lot of code to post, so I will give a bit of background into what I am after. My application doesn't need near real-time search; it only needs eventual consistency. However, this eventuality should be reasonable, i.e. within a few seconds to less than a minute, not half an hour. I was wondering if I can leave it to Elasticsearch to decide when best to refresh at its convenience rather than refreshing at a regular interval. The reason is because disabling automatic refreshing does bring some benefits in terms of performance to my application, e.g. JVM Heap Size usage rises less aggressively in between garbage collection interval (see graph below)
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index. refresh_interval setting.
An index is defined as: An index is like a 'database' in a relational database. It has a mapping which defines multiple types. An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.
Why Is ElasticSearch Tuning Required? Elasticsearch gives you moderate performance for search and injection of logs maintaining a balance. But when the service utilization or service count within the infrastructure grows, logs grow in similar proportion.
Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.
There is a bit of confusion in your understanding. Refreshing the index and writing to disk are two different processes and are not necessarily related, thus your observation about segments still being written even if the refresh_interval
is -1.
When a document is indexed, it is added to the in-memory buffer and appended to the translog file. When a refresh takes place the docs in the buffer are written to a new segment, without an fsync, the segment is opened to make it visible to search and the buffer is cleared. The translog is not yet cleared and nothing is actually persisted to disk (as there was no fsync
).
Now imagine the refresh is not happening: there is no index refresh, you cannot search your documents, the segments are not created in cache.
The settings here will dictate when the flush (writing to disk) happens. By default when the translog reaches 512mb in size, or after 30 minutes. This is actually persisting data on disk, everything else is in filesystem cache (if the node dies or the machine is rebooted the cache is lost and the translog is the only salvation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With