What are segments in Lucene?
What are the benefits of segments?
The segment files in Solr are parts of the underlying Lucene index. You can read about the index format in the Lucene index docs. In principle, each segment contains a part of the index. New files get created when you add documents and you can completely ignore them.
A Lucene Index Is an Inverted Index An index may store a heterogeneous set of documents, with any number of different fields that may vary by a document in arbitrary ways. Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token.
Lucene uses a well-known index structure called an inverted index. Quite simply, and probably unsurprisingly, an inverted index is an inside-out arrangement of documents in which terms take center stage. Each term refers to the documents that contain it.
The Lucene index is split into smaller chunks called segments. Each segment is its own index. Lucene searches all of them in sequence.
A new segment is created when a new writer is opened and when a writer commits or is closed.
The advantages of using this system are that you never have to modify the files of a segment once it is created. When you are adding new documents in your index, they are added to the next segment. Previous segments are never modified.
Deleting a document is done by simply indicating in a file which document of a segment is deleted, but physically, the document always stays in the segment. Documents in Lucene aren't really updated. What happens is that the previous version of the document is marked as deleted in its original segment and the new version of the document is added to the current segment. This minimizes the chances of corrupting an index by constantly having to modify its content when there are changes. It also allows for easy backup and synchronization of the index across different machines.
However, at some point, Lucene may decide to merge some segments. This operation can also be triggered with an optimize.
A segment is very simply a section of the index. The idea is that you can add documents to the index that's currently being served by creating a new segment with only new documents in it. This way, you don't have to go to the expensive trouble of rebuilding your entire index frequently in order to add new documents to the index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With