Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding lucene segments

Tags:

lucene

I have these 3 files in a folder and they are all related to an index created by Lucene:

  • _0.cfs
  • segments_2
  • segments.gen

What are they all used for, and is it possible to convert any of them to a human-readable format to discern a bit more about how lucene works with its indexes?

like image 697
Nick Avatar asked Jul 12 '13 19:07

Nick


People also ask

What are Lucene segments?

Lucene's index is composed of segments, each of which contains a subset of all the documents in the index, and is a complete searchable index in itself, over that subset. As documents are written to the index, new segments are created and flushed to directory storage.

How does Lucene index work?

In Lucene, a Document is the unit of search and index. An index consists of one or more Documents. Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

Why Lucene is so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

What are SOLR segments?

The segment files in Solr are parts of the underlying Lucene index. You can read about the index format in the Lucene index docs. In principle, each segment contains a part of the index. New files get created when you add documents and you can completely ignore them.


1 Answers

The two segments files store information about the segments, and the .cfs is a compound file consisting of other index files (like index, storage, deletion, etc. files).

For documentation of different types of files used to create a Lucene index, see this summary of file extensions

Generally, no, Lucene files are not human readable. They are designed more for efficiency and speed than human readability. The way to get a human readable format is to access them through the Lucene API (via Luke, or Solr, or something like that).

If you want a thorough understanding of the file formats in use, the codecs package would be the place to look.

like image 85
femtoRgon Avatar answered Sep 18 '22 23:09

femtoRgon