Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Analyzers in elasticsearch

I'm having trouble understanding the concept of analyzers in elasticsearch with tire gem. I'm actually a newbie to these search concepts. Can someone here help me with some reference article or explain what actually the analyzers do and why they are used?

I see different analyzers being mentioned at elasticsearch like keyword, standard, simple, snowball. Without the knowledge of analyzers I couldn't make out what actually fits my need.

like image 740
Vamsi Krishna Avatar asked Oct 11 '12 09:10

Vamsi Krishna


People also ask

What are analyzers in Elasticsearch?

In a nutshell an analyzer is used to tell elasticsearch how the text should be indexed and searched. And what you're looking into is the Analyze API, which is a very nice tool to understand how analyzers work. The text is provided to this API and is not related to the index.

What is the default analyzer in Elasticsearch?

By default, Elasticsearch uses the standard analyzer for all text analysis. The standard analyzer gives you out-of-the-box support for most natural languages and use cases. If you chose to use the standard analyzer as-is, no further configuration is needed.

What is analyzer and tokenizer in Elasticsearch?

Elasticsearch analyzers and normalizers are used to convert text into tokens that can be searched. Analyzers use a tokenizer to produce one or more tokens per text field. Normalizers use only character filters and token filters to produce a single token.


2 Answers

Let me give you a short answer.

An analyzer is used at index Time and at search Time. It's used to create an index of terms.

To index a phrase, it could be useful to break it in words. Here comes the analyzer.

It applies tokenizers and token filters. A tokenizer could be a Whitespace tokenizer. It split a phrase in tokens at each space. A lowercase tokenizer will split a phrase at each non-letter and lowercase all letters.

A token filter is used to filter or convert some tokens. For example, a ASCII folding filter will convert characters like ê, é, è to e.

An analyzer is a mix of all of that.

You should read Analysis guide and look at the right all different options you have.

By default, Elasticsearch applies the standard analyzer. It will remove all common english words (and many other filters)

You can also use the Analyze Api to understand how it works. Very useful.

like image 127
dadoonet Avatar answered Sep 21 '22 07:09

dadoonet


In Lucene, analyzer is a combination of tokenizer (splitter) + stemmer + stopword filter

In ElasticSearch, analyzer is a combination of

  1. Character filter: "tidy up" a string before it is tokenized e.g. remove HTML tags
  2. Tokenizer: It's used to break up the string into individual terms or tokens. Must have 1 only.
  3. Token filter: change, add or remove tokens. Stemmer is an example of token filter. It's used to get the base of the word e.g. happy and happiness both have the same base is happi.

See Snowball demo here

This is a sample setting:

     {       "settings":{         "index" : {             "analysis" : {                 "analyzer" : {                     "analyzerWithSnowball" : {                         "tokenizer" : "standard",                         "filter" : ["standard", "lowercase", "englishSnowball"]                     }                 },                 "filter" : {                     "englishSnowball" : {                         "type" : "snowball",                         "language" : "english"                     }                 }             }         }       }     } 

Ref:

  1. Comparison of Lucene Analyzers
  2. http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html
like image 45
Tho Avatar answered Sep 23 '22 07:09

Tho