Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how edge ngram token filter differs from ngram token filter?

As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.

How these two differ from each other in processing tokens?

like image 727
Karunakar Avatar asked Jul 14 '15 05:07

Karunakar


People also ask

What is EDGE ngram Elasticsearch?

The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word. Edge N-Grams are useful for search-as-you-type queries.

What is a token filter?

Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). Elasticsearch has a number of built-in token filters you can use to build custom analyzers.

What is ascii folding?

ASCII folding. Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a . This filter uses Lucene's {lucene-analysis-docs}/miscellaneous/ASCIIFoldingFilter.


1 Answers

I think the documentation is pretty clear on this:

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'       # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04 

With this tokenizer definition:

                    "type" : "nGram",                     "min_gram" : "2",                     "max_gram" : "3",                     "token_chars": [ "letter", "digit" ] 

In short:

  • the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
  • nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
  • edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).

like image 50
Andrei Stefan Avatar answered Sep 19 '22 14:09

Andrei Stefan