Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch phrase prefix search - How do I get the matched phrase?

I'm building an autocomplete feature using ElasticSearch. As the user types, I want to show a list of completions from the data, so the user can select one. For example, if the data contains the following phrases:

very unusual
very unlikely
very useful

and the user types:

very u

I want to display the phrases above.

I'm using this query:

  "query": {
    "multi_match": {
      "query": "very u",
      "fields": [
        "name",
        "description",
        "contentBlocks.caption",
        "contentBlocks.text"
      ],
      "type": "phrase_prefix",
      "max_expansions": 10,
      "cutoff_frequency": 0.001
    }

This matches the content I'm looking for, but extracting the matched phrases from the search results is quite awkward. I have been using highlighting, and I collect the matched phrases by parsing the highlights. For example:

    "highlight": {
      "contentBlocks.text": [
        "turned the <em>very</em> <em>unusual</em> doorknob"
      ]
    }

    "highlight": {
      "contentBlocks.text": [
        "invented a <em>very</em> <em>useful</em> mechanism"
      ]
    }

What's the right way to do this?


"Phrase Suggester" might be capable of doing what I have described, but it is not at all obvious how you would get it to do that.

I have indexed the fields of interest (for example, "description") as follows:

  "description" : {
    "index_analyzer" : "snowball_stem",
    "search_analyzer" : "snowball_stem",
    "type" : "string",
    "fields" : {
      "autocomplete" : {
        "index_analyzer" : "shingle_analyzer",
        "search_analyzer" : "shingle_analyzer",
        "type" : "string"
      }
    }
  },

I am using the snowball_stem analyzer for search, and the shingle_analyzer for the autocomplete function. shingle_analyzer looks like this:

"settings" : {
    "analysis" : {
        "analyzer" : {
            "shingle_analyzer" : {
                "type" : "custom",
                "tokenizer" : "standard",
                "filter" : [
                    "standard",
                    "lowercase",
                    "shingle_filter"
                ],
                "char_filter" : [
                    "html_strip"
                ]
            }
        },
        "filter" : {
            "shingle_filter" : {
                "type" : "shingle",
                "min_shingle_size" : 2,
                "max_shingle_size" : 2
            }
        }
    }
},

The documentation for the phrase suggester seems to be totally oriented toward "spelling correction" rather than completion. Since what I'm after is completion, I set the direct generator's min_word_length and prefix_length to the length of the input text, in this case, 2.

I crafted up a suggestion query based on the documentation:

{
    "text" : "sa",
    "autocomplete_description" : {
        "phrase" : {
            "analyzer" : "standard",
            "field" : "description.autocomplete",
            "size" : 10,
            "max_errors" : 2,
            "confidence" : 0.0,
            "gram_size" : 2,
            "direct_generator" : [
                {
                    "field" : "description.autocomplete",
                    "suggest_mode" : "always",
                    "size" : 10,
                    "min_word_length" : 2,
                    "prefix_length" : 2
                }
            ]
        }
    }
}

This search for suggestions for "sa" comes up with the following results:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "autocomplete_description" : [ {
    "text" : "sa",
    "offset" : 0,
    "length" : 2,
    "options" : [ {
      "text" : "say",
      "score" : 0.012580795
    }, {
      "text" : "sa",
      "score" : 0.01127677
    }, {
      "text" : "san",
      "score" : 0.0106529845
    }, {
      "text" : "sad",
      "score" : 0.008533429
    }, {
      "text" : "saw",
      "score" : 0.008107899
    }, {
      "text" : "sam",
      "score" : 0.007155634
    } ]
  } ]
}

What I expect to find for the input "sa" is words that begin with "sa" of any length. Why does it only return words of two or three characters? Why does it only return six options? The multi_match phrase_prefix query I've been using finds many longer words beginning with "sa", such as "saving", "sassy", "safari", and "salad".

When I search for suggestions for multi-word text, such as "one or" (which occurs plenty of times in the data), it finds nothing. The multi_match phrase_prefix query finds "one or more", "one or the", "one, or you", and "one or both".

How can I get this suggester to do what I want?

like image 723
David Haimson Avatar asked Apr 23 '14 22:04

David Haimson


People also ask

What is match phrase in Elasticsearch?

Match phrase queryedit A phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. The analyzer can be set to control which analyzer will perform the analysis process on the text.

What is phrase prefix in Elasticsearch?

Match phrase prefix queryedit. Returns documents that contain the words of a provided text, in the same order as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term.

How does Elasticsearch match query work?

The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term. (Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field> .

What is Elasticsearch index prefix?

Elasticsearch Index Prefix parameter used to the indexing of search term prefixes to speed up prefix searches on a website. You can set Elasticsearch Index Prefix from the admin panel.


1 Answers

You can get roughly what you want with the completion suggester. The main problem with this is that it's no longer search aware. You can sorta fix this by adding in a suggester context but it only works for filters and doesn't take into account the search text.

The only way that I know of to get the "best" behavior (context aware search completions) is to do the following:

  • Create a suggestions field where the text is tokenized as you would want it to be seen by the user (probably standard analyzer or maybe add on a 2-shingle token filter).
  • Let's say the user issues the incomplete query very un. Behind the scenes issue search for very and then use term aggregations to get a list terms that match the search context, but limit the terms returned with "include": "un.*".
  • The resulting list will look like [unusual,unlikely,uncool].

The only problem with this method, especially in a sharded environment is that it's a lot of queries and you're pulling a very high cardinality field (suggestions) into memory. So... I don't know if this is practically feasible. So maybe it's better to go back with the completion suggester. If you try either of these I'm interested in hearing your experience with it.

like image 193
JnBrymn Avatar answered Oct 21 '22 14:10

JnBrymn