As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.
How these two differ from each other in processing tokens?
The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word. Edge N-Grams are useful for search-as-you-type queries.
Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). Elasticsearch has a number of built-in token filters you can use to build custom analyzers.
ASCII folding. Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a . This filter uses Lucene's {lucene-analysis-docs}/miscellaneous/ASCIIFoldingFilter.
I think the documentation is pretty clear on this:
This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.
And the best example for nGram
tokenizer again comes from the documentation:
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04' # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
With this tokenizer definition:
"type" : "nGram", "min_gram" : "2", "max_gram" : "3", "token_chars": [ "letter", "digit" ]
In short:
FC
, Schalke
, 04
.nGram
generates groups of characters of minimum min_gram
size and maximum max_gram
size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).edgeNGram
does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.For the same text as above, an edgeNGram
generates this: FC, Sc, Sch, Scha, Schal, 04
. Every "word" in the text is considered and for every "word" the first character is the starting point (F
from FC
, S
from Schalke
and 0
from 04
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With