As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter. How these two differ from each other in processing tokens?

I think the documentation is pretty clear on this: <blockquote> This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token. </blockquote> And the best example for <code>nGram</code> tokenizer again comes from the documentation: <pre class="prettyprint"><code>curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04' # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04 </code></pre> With this tokenizer definition: <pre class="prettyprint"><code> "type" : "nGram", "min_gram" : "2", "max_gram" : "3", "token_chars": [ "letter", "digit" ] </code></pre> In short: <ul> <li>the tokenizer, depending on the configuration, will create tokens. In this example: <code>FC</code>, <code>Schalke</code>, <code>04</code>.</li> <li> <code>nGram</code> generates groups of characters of minimum <code>min_gram</code> size and maximum <code>max_gram</code> size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).</li> <li> <code>edgeNGram</code> does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.</li> </ul> For the same text as above, an <code>edgeNGram</code> generates this: <code>FC, Sc, Sch, Scha, Schal, 04</code>. Every "word" in the text is considered and for every "word" the first character is the starting point (<code>F</code> from <code>FC</code>, <code>S</code> from <code>Schalke</code> and <code>0</code> from <code>04</code>).

how edge ngram token filter differs from ngram token filter?

1 Answers

I think the documentation is pretty clear on this:

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'       # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

With this tokenizer definition:

                    "type" : "nGram",                     "min_gram" : "2",                     "max_gram" : "3",                     "token_chars": [ "letter", "digit" ]

In short:

the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).

answered Sep 19 '22 14:09

Andrei Stefan

Related questions
                            
                                Kibana: pie chart slices based on substring of a field
                            
                                Elastic search document count
                            
                                Elasticsearch gives different scores for same documents
                            
                                How to get duplicate field values in elastic search by field name without knowing its value
                            
                                "[circuit_breaking_exception] [parent]" Data too large, data for "[<http_request>]" would be error
                            
                                Query with multiple values on a property with one value in Elasticsearch
                            
                                MySQL "not in" Query in Elasticsearch
                            
                                Delete multiple indices in one Elasticsearch HTTP request (cURL)
                            
                                Validation Failed: 1: no requests added in bulk indexing ElasticSearch
                            
                                Updating indexed document in Elasticsearch
                            
                                Getting elasticsearch "can not run as root" error after upgrading from SonarQube 6.5 to 6.6. Nothing else changed
                            
                                Representing a Kibana query in a REST, curl form
                            
                                Install elasticsearch 1.1 using brew
                            
                                ElasticSearch: How to search for a value in any field, across all types, in one or more indices?
                            
                                ElasticSearch date range
                            
                                Elasticsearch Scroll
                            
                                Store Date Format in elasticsearch
                            
                                How to secure an Internet-facing Elastic Search implementation in a shared hosting environment? [closed]
                            
                                Define custom ElasticSearch Analyzer using Java API
                            
                                Exact match in elastic search query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how edge ngram token filter differs from ngram token filter?

Tags:

token

elasticsearch

analyzer

Karunakar

People also ask

1 Answers

Andrei Stefan

Recent Activity

Donate For Us