Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene indexing: Store and indexing modes explained

Tags:

lucene

I think I'm still not understanding the lucene indexing options.

The following options are

  • Store.Yes
  • Store.No

and

  • Index.Tokenized
  • Index.Un_Tokenized
  • Index.No
  • Index.No_Norms

I don't really understand the store option. Why would you ever want to NOT store your field?
Tokenizing is splitting up the content and removing the noise words/separators (like "and", "or" etc)
I don't have a clue what norms could be. How are tokenized values stored?
What happens if i store a value "my string" in "fieldName"? Why doesn't a query

fieldName:my string

return anything?

like image 314
Boris Callens Avatar asked Mar 16 '09 14:03

Boris Callens


People also ask

How does Lucene index work?

In Lucene, a Document is the unit of search and index. An index consists of one or more Documents. Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

Where are Lucene indexes stored?

Overview. When using the default Sitefinity CMS search service (Lucene), the search index definition (configurations which content to be indexed) is stored in your website database, and the actual search index files – on the file system. By default, the search index files are in the ~/App_Data/Sitefinity/Search/ folder ...

What is Lucene index in AEM?

The Lucene Full Text IndexA full text indexer based on Apache Lucene is available in AEM 6. If a full-text index is configured, then all queries that have a full-text condition use the full-text index, no matter if there are other conditions that are indexed, and no matter if there is a path restriction.

What are Lucene segments?

Lucene's index is composed of segments, each of which contains a subset of all the documents in the index, and is a complete searchable index in itself, over that subset. As documents are written to the index, new segments are created and flushed to directory storage.


2 Answers

Store.Yes

Means that the value of the field will be stored in the index

Store.No

Means that the value of the field will NOT be stored in the index

Store.Yes/No does not affect the indexing or searching with lucene. It just tells lucene if you want it to act as a datastore for the values in the field. If you use Store.Yes, then when you search, the value of that field will be included in your search result Documents.

If you're storing your data in a database and only using the Lucene index for searching, then you can get away with Store.No on all of your fields. However, if you're using the index as storage as well, then you'll want Store.Yes.

Index.Tokenized

Means that the field will be tokenized when it's indexed (you got that one). This is useful for long fields with multiple words.

Index.Un_Tokenized

Means that the field will not be analyzed and will be stored as a single value. This is useful for keyword/single-word and some short multi-word fields.

Index.No

Exactly what it says. The field will not be indexed and therefore unsearchable. However, you can use Index.No along with Store.Yes to store a value that you don't want to be searchable.

Index.No_Norms

Same as Index.Un_Tokenized except for that a few bytes will be saved by not storing some Normalization data. This data is what is used for boosting and field-length normalization.

For further reading, the lucene javadocs are priceless (current API version 4.4.0):

  • Field.Index
  • Field.Store

For your last question, about why your query's not returning anything, without knowing anymore about how you're indexing that field, I'd say that it's because your fieldName qualifier is only attached to the 'my' string. To do the search for the phrase "my string" you want:

fieldName:"my string"

A search for both the words "my" and "string" in the fieldName field:

fieldName:(my string)

like image 169
dustyburwell Avatar answered Sep 20 '22 14:09

dustyburwell


In case any Java users stumble upon this, the same options in the March 2009 answer still exist in the Lucene 4.6.0 Java library but are deprecated. The current way to set these options is via FieldType.

like image 27
Ian Durkan Avatar answered Sep 22 '22 14:09

Ian Durkan