Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query in ElasticSearch to match part of the word

I am trying to write a query in ElasticSearch which matches contiguous characters in the words. So, if my index has "John Doe", I should still see "John Doe" returned by Elasticsearch for the below searches.

  1. john doe
  2. john do
  3. ohn do
  4. john
  5. n doe

I have tried the below query so far.

{
  "query": {
    "multi_match": {
      "query": "term",
      "operator": "OR",
      "type": "phrase_prefix",
      "max_expansions": 50,
      "fields": [
        "Field1",
        "Field2"
      ]
    }
  }
}

But this also returns unnessary matches like I will still get "John Doe" when i type john x.

like image 467
user2748107 Avatar asked May 22 '18 23:05

user2748107


People also ask

How do I query keyword in Elasticsearch?

Querying Elasticsearch works by matching the queried terms with the terms in the Inverted Index, the terms queried and the one in the Inverted Index must be exactly the same, else it won't get matched.

How do I search a specific field in Elasticsearch?

There are two recommended methods to retrieve selected fields from a search query: Use the fields option to extract the values of fields present in the index mapping. Use the _source option if you need to access the original data that was passed at index time.

How does match query work in Elasticsearch?

The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term. (Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field> .

Does Elasticsearch support text queries?

ElasticSearch is a search engine based on Apache Lucene, a free and open-source information retrieval software library. It provides a distributed, full-text search engine with an HTTP web interface and schema-free JSON documents.


1 Answers

As explained in my comment above, prefix wildcards should be avoided at all costs as your index grows since that will force ES to do full index scans. I'm still convinced that ngrams (more precisely edge-ngrams) is the way to go, so I'm taking a stab at it below.

The idea is to index all the suffixes of the input and then use a prefix query to match any suffix as searching for prefixes doesn't suffer the same performance issues as searching for suffixes. So the idea is to index john doe as follows:

john doe
ohn doe
hn doe
n doe
doe
oe
e

That way, using a prefix query we can match any sub-part of those tokens which effectively achieves the goal of matching partial contiguous words while at the same time ensuring good performance.

The definition of the index would go like this:

PUT my_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": [
              "lowercase",
              "reverse",
              "suffixes",
              "reverse"
            ]
          }
        },
        "filter": {
          "suffixes": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "my_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

Then we can index a sample document:

PUT my_index/doc/1
{
  "name": "john doe"
}

And finally all of the following searches will return the john doe document:

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "john doe"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "john do"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "ohn do"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "john"
    }
  }
}

POST my_index/_search 
{
  "query": {
    "prefix": {
      "name": "n doe"
    }
  }
}
like image 194
Val Avatar answered Sep 19 '22 16:09

Val