I am trying to write a query in ElasticSearch which matches contiguous characters in the words. So, if my index has "John Doe", I should still see "John Doe" returned by Elasticsearch for the below searches.
I have tried the below query so far.
{
"query": {
"multi_match": {
"query": "term",
"operator": "OR",
"type": "phrase_prefix",
"max_expansions": 50,
"fields": [
"Field1",
"Field2"
]
}
}
}
But this also returns unnessary matches like I will still get "John Doe" when i type john x.
Querying Elasticsearch works by matching the queried terms with the terms in the Inverted Index, the terms queried and the one in the Inverted Index must be exactly the same, else it won't get matched.
There are two recommended methods to retrieve selected fields from a search query: Use the fields option to extract the values of fields present in the index mapping. Use the _source option if you need to access the original data that was passed at index time.
The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term. (Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field> .
ElasticSearch is a search engine based on Apache Lucene, a free and open-source information retrieval software library. It provides a distributed, full-text search engine with an HTTP web interface and schema-free JSON documents.
As explained in my comment above, prefix wildcards should be avoided at all costs as your index grows since that will force ES to do full index scans. I'm still convinced that ngrams (more precisely edge-ngrams) is the way to go, so I'm taking a stab at it below.
The idea is to index all the suffixes of the input and then use a prefix
query to match any suffix as searching for prefixes doesn't suffer the same performance issues as searching for suffixes. So the idea is to index john doe
as follows:
john doe
ohn doe
hn doe
n doe
doe
oe
e
That way, using a prefix
query we can match any sub-part of those tokens which effectively achieves the goal of matching partial contiguous words while at the same time ensuring good performance.
The definition of the index would go like this:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"reverse",
"suffixes",
"reverse"
]
}
},
"filter": {
"suffixes": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then we can index a sample document:
PUT my_index/doc/1
{
"name": "john doe"
}
And finally all of the following searches will return the john doe
document:
POST my_index/_search
{
"query": {
"prefix": {
"name": "john doe"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "ohn do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "n doe"
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With