I am trying to write a query in ElasticSearch which matches contiguous characters in the words. So, if my index has "John Doe", I should still see "John Doe" returned by Elasticsearch for the below searches. <ol> <li>john doe</li> <li>john do</li> <li>ohn do</li> <li>john</li> <li>n doe</li> </ol> I have tried the below query so far. <pre class="prettyprint"><code>{ "query": { "multi_match": { "query": "term", "operator": "OR", "type": "phrase_prefix", "max_expansions": 50, "fields": [ "Field1", "Field2" ] } } } </code></pre> But this also returns unnessary matches like I will still get "John Doe" when i type john x.

As explained in my comment above, prefix wildcards should be avoided at all costs as your index grows since that will force ES to do full index scans. I'm still convinced that ngrams (more precisely edge-ngrams) is the way to go, so I'm taking a stab at it below. The idea is to index all the suffixes of the input and then use a <code>prefix</code> query to match any suffix as searching for prefixes doesn't suffer the same performance issues as searching for suffixes. So the idea is to index <code>john doe</code> as follows: <pre class="prettyprint"><code>john doe ohn doe hn doe n doe doe oe e </code></pre> That way, using a <code>prefix</code> query we can match any sub-part of those tokens which effectively achieves the goal of matching partial contiguous words while at the same time ensuring good performance. The definition of the index would go like this: <pre class="prettyprint"><code>PUT my_index { "settings": { "index": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "keyword", "filter": [ "lowercase", "reverse", "suffixes", "reverse" ] } }, "filter": { "suffixes": { "type": "edgeNGram", "min_gram": 1, "max_gram": 20 } } } } }, "mappings": { "doc": { "properties": { "name": { "type": "text", "analyzer": "my_analyzer", "search_analyzer": "standard" } } } } } </code></pre> Then we can index a sample document: <pre class="prettyprint"><code>PUT my_index/doc/1 { "name": "john doe" } </code></pre> And finally all of the following searches will return the <code>john doe</code> document: <pre class="prettyprint"><code>POST my_index/_search { "query": { "prefix": { "name": "john doe" } } } POST my_index/_search { "query": { "prefix": { "name": "john do" } } } POST my_index/_search { "query": { "prefix": { "name": "ohn do" } } } POST my_index/_search { "query": { "prefix": { "name": "john" } } } POST my_index/_search { "query": { "prefix": { "name": "n doe" } } } </code></pre>

Query in ElasticSearch to match part of the word

Tags:

elasticsearch

I am trying to write a query in ElasticSearch which matches contiguous characters in the words. So, if my index has "John Doe", I should still see "John Doe" returned by Elasticsearch for the below searches.

john doe
john do
ohn do
john
n doe

I have tried the below query so far.

{
  "query": {
    "multi_match": {
      "query": "term",
      "operator": "OR",
      "type": "phrase_prefix",
      "max_expansions": 50,
      "fields": [
        "Field1",
        "Field2"
      ]
    }
  }
}

But this also returns unnessary matches like I will still get "John Doe" when i type john x.

467

asked May 22 '18 23:05

user2748107

1 Answers

As explained in my comment above, prefix wildcards should be avoided at all costs as your index grows since that will force ES to do full index scans. I'm still convinced that ngrams (more precisely edge-ngrams) is the way to go, so I'm taking a stab at it below.

The idea is to index all the suffixes of the input and then use a prefix query to match any suffix as searching for prefixes doesn't suffer the same performance issues as searching for suffixes. So the idea is to index john doe as follows:

john doe
ohn doe
hn doe
n doe
doe
oe
e

That way, using a prefix query we can match any sub-part of those tokens which effectively achieves the goal of matching partial contiguous words while at the same time ensuring good performance.

The definition of the index would go like this:

PUT my_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": [
              "lowercase",
              "reverse",
              "suffixes",
              "reverse"
            ]
          }
        },
        "filter": {
          "suffixes": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "my_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

Then we can index a sample document:

PUT my_index/doc/1
{
  "name": "john doe"
}

And finally all of the following searches will return the john doe document:

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "john doe"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "john do"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "ohn do"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "john"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "n doe"
    }
  }
}

194

answered Sep 19 '22 16:09

Val

Related questions
                            
                                Elasticsearch filter results excluding by id
                            
                                Elasticsearch phrase suggester is suggesting me suggestions that do not exists in my index
                            
                                More_like_this query with a filter
                            
                                How to delete document matching a query using official elasticsearch nodejs client?
                            
                                Can ElasticSearch be packaged locally with Electron?
                            
                                Understanding Decay function and its parameters in ElasticSearch
                            
                                How to use async await inside redux saga?
                            
                                Cannot run logstash 1.4.0 in windows 8
                            
                                Search document with empty array field, on ElasticSearch
                            
                                Elasticsearch - Create field using script if doesn't exist
                            
                                Convert timestamp timezone in Logstash for output index name
                            
                                Should I include spaces in fuzzy query fields?
                            
                                How does ElasticSearch and Lucene share the memory
                            
                                Elasticsearch copy the index from one server to another server?
                            
                                AWS ElasticSearch console: How to Access to Indices tab in my ES domain
                            
                                ElasticSearch Stemming
                            
                                Remove/Delete an indexed document in ElasticSearch with Tire (with soft delete via ActsAsParanoid)
                            
                                How do I reduce Elasticsearch scroll response time?
                            
                                How to select specific fields in elasticsearch-dsl python
                            
                                can I query for documents in specific Elasticsearch shard?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With