Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: query for multiple words across multiple fields (with prefix)

I'm trying to implement an auto-suggest control powered by an ES index. The index has multiple fields and I want to be able to query across multiple fields using the AND operator and allowing for partial matches (prefix only).

Just as an example, let's say I got 2 fields I want to query on: "colour" and "animal". I would like to be able to fulfil queries like "duc", "duck", "purpl", "purple", "purple duck". I managed to get all these working using multi_match() with AND operator.

What I don't seem to be able to do is match on queries like "purple duc", as multi_match doesn't allow for wildcards.

I've looked into match_phrase_prefix() but as i understand it, it doesn't span across multiple fields.

I'm turning toward the implementation of a tokeniser: it feels the solution may be there, so ultimately the questions are:

1) can someone confirm there's no out-of-the-box function to do what I want to do? It feels like a common enough pattern that there could be something ready to use.

2) can someone suggest any solution? Are tokenizers part of the solution? I'm more than happy to be pointed in the right direction and do more research myself. Obviously if someone has working solutions to share that would be awesome.

Thanks in advance - F

like image 441
Fab Avatar asked Sep 06 '15 08:09

Fab


People also ask

What is phrase prefix in Elasticsearch?

Match phrase prefix queryedit. Returns documents that contain the words of a provided text, in the same order as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term.

What is prefix query?

Prefix query is used to find all the documents with a given prefix. Like term query, phrase query is a low-level query and doesn't take the field mapping into consideration (refer to the Analyzed versus non-analyzed section for more details).

What is Elasticsearch index prefix?

Elasticsearch Index Prefix parameter used to the indexing of search term prefixes to speed up prefix searches on a website. You can set Elasticsearch Index Prefix from the admin panel.

What is match phrase in Elasticsearch?

Match phrase queryedit A phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. The analyzer can be set to control which analyzer will perform the analysis process on the text.


1 Answers

I actually wrote a blog post about this awhile back for Qbox, which you can find here: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. (Unfortunately some of the links on the post are broken, and can't easily be fixed at this point, but hopefully you'll get the idea.)

I'll refer you to the post for the details, but here is some code you can use to test it out quickly. Note that I'm using edge ngrams instead of full ngrams.

Also note in particular the use of the _all field, and the match query operator.

Okay, so here is the mapping:

PUT /test_index
{
   "settings": {
      "analysis": {
         "filter": {
            "edgeNGram_filter": {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "edgeNGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "edgeNGram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "_all": {
            "enabled": true,
            "index_analyzer": "edgeNGram_analyzer",
            "search_analyzer": "standard"
         },
         "properties": {
            "field1": {
               "type": "string",
               "include_in_all": true
            },
            "field2": {
               "type": "string",
               "include_in_all": true
            }
         }
      }
   }
}

Now add a few documents:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"field1":"purple duck","field2":"brown fox"}
{"index":{"_id":2}}
{"field1":"slow purple duck","field2":"quick brown fox"}
{"index":{"_id":3}}
{"field1":"red turtle","field2":"quick rabbit"}

And this query seems to illustrate what you're wanting:

POST /test_index/_search
{
   "query": {
      "match": {
         "_all": {
             "query": "purp fo slo",
             "operator": "and"
         }
      }
   }
}

returning:

{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.19930676,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.19930676,
            "_source": {
               "field1": "slow purple duck",
               "field2": "quick brown fox"
            }
         }
      ]
   }
}

Here is the code I used to test it out:

http://sense.qbox.io/gist/b87e426062f453d946d643c7fa3d5480cd8e26ec

like image 179
Sloan Ahrens Avatar answered Oct 05 '22 13:10

Sloan Ahrens