Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Control order of token filters in ElasticSearch

Trying to control the order that token filters are applied in ElasticSearch.

I know from the docs that the tokenizer is applied first, then the token filters, but they do not mention how the order of the token filters is determined.

Here's a YAML snippet from my analysis setup script:

       KeywordNameIndexAnalyzer :
           type : custom
           tokenizer : whitespace
           filter : [my_word_concatenator, keyword_ngram]

I would have thought that my_word_concatenator would be applied before keyword_ngram, but it seems like that isn't the case. Anyone know how (or if) the order of these filters can be controlled?

Thanks a lot!

like image 914
Clay Wardell Avatar asked Sep 27 '12 19:09

Clay Wardell


People also ask

What is token filter in Elasticsearch?

Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). Elasticsearch has a number of built-in token filters you can use to build custom analyzers.

What is difference between analyzer and Tokenizer in Elasticsearch?

Elasticsearch analyzers and normalizers are used to convert text into tokens that can be searched. Analyzers use a tokenizer to produce one or more tokens per text field. Normalizers use only character filters and token filters to produce a single token.

What is Asciifolding?

ASCII folding token filtereditConverts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a .

What is Tokenizer filter?

Tokenizers break field data into lexical units, or tokens. Filters examine a stream of tokens and keep them, transform or discard them, or create new ones. Tokenizers and filters may be combined to form pipelines, or chains, where the output of one is input to the next.


1 Answers

An analyzer is made of a tokenizer, which splits your text into tokens. After that token filters come into the picture, in the order you configured them, since you're providing an array. If you have doubts I'd suggest you to have a look at the analyze api, through which you can actually test how a analyzer works without indexing any text.

like image 99
javanna Avatar answered Sep 25 '22 14:09

javanna