Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Segments in Elasticsearch

I was under the assumption that each shard in Elasticsearch is an index. But I read somewhere that each segment is a Lucene index.

What exactly is a segment? How does it effect search performance? I have indices that reach around 450GB in size everyday (I create a new one everyday) with default Elasticsearch settings.

When I execute curl -XPOST "http://localhost:9200/logstash-2013.03.0$i_optimize?max_num_segments=1", I get num_committed_segments=11 and num_search_segments=11.

Shouldn't the above values be 1? Maybe it's because of index.merge.policy.segments_per_tier value? What is this tier anyway?

like image 231
Abhijeet Rastogi Avatar asked Mar 15 '13 06:03

Abhijeet Rastogi


People also ask

What are segments in ES?

A segment is a small Lucene index. Lucene searches in all segments sequentially. Lucene creates a segment when a new writer is opened, and when a writer commits or is closed. It means segments are immutable. When you add new documents into your Elasticsearch index, Lucene creates a new segment and writes it.

How do Elasticsearch indices work?

Elasticsearch takes in unstructured data from different locations, stores and indexes it according to user-specified mapping (which can also be derived automatically from data), and makes it searchable. Its distributed architecture makes it possible to search and analyze huge volumes of data in near real time.

What is index and shard in Elasticsearch?

Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

What is Elasticsearch Lucene index?

Lucene is a Java library. You can include it in your project and refer to its functions using function calls. Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work beneath, Elasticsearch provides us a convenient layer over Lucene.


1 Answers

The word "index" gets abused a bit in Elasticsearch -- applies to too many things.

To explain:

index

An "index" in Elasticsearch is a bit like a database in a relational DB. It's where you store/index your data. But actually, that's just what your application sees. Internally, an index is a logical namespace that points to one or more shards.

Also, "to index" means to "put" your data into Elasticsearch. Your data is both stored (for retrieval) and "indexed" for search.

inverted index

An "inverted index" is the data structure that Lucene uses to make data searchable. It processes the data, pulls out unique terms or tokens, then records which documents contain those tokens. See http://en.wikipedia.org/wiki/Inverted_index for more.

shard

A "shard" is an instance of Lucene. It is a fully functional search engine in its own right. An "index" could consist of a single shard, but generally consists of several shards, to allow the index to grow and to be split over several machines.

A "primary shard" is the main home for a document. A "replica shard" is a copy of the primary shard that provides (1) failover in case the primary dies and (2) increased read throughput

segment

Each shard contains multiple "segments", where a segment is an inverted index. A search in a shard will search each segment in turn, then combine their results into the final results for that shard.

While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and "refreshes" the search.

This makes the data in the new segment visible to search (ie they are "searchable"), but the segment has not been fsync'ed to disk, so is still at risk of data loss.

Every so often, Elasticsearch will "flush", which means fsync'ing the segments, (they are now "committed") and clearing out the transaction log, which is no longer needed because we know that the new data has been written to disk.

The more segments there are, the longer each search takes. So Elasticsearch will merge a number of segments of a similar size ("tier") into a single bigger segment, through a background merge process. Once the new bigger segment is written, the old segments are dropped. This process is repeated on the bigger segments when there are too many of the same size.

Segments are immutable. When a document is updated, it actually just marks the old document as deleted, and indexes a new document. The merge process also expunges these old deleted documents.

like image 65
DrTech Avatar answered Oct 11 '22 12:10

DrTech