I'm building an autocomplete feature using ElasticSearch. As the user types, I want to show a list of completions from the data, so the user can select one. For example, if the data contains the following phrases:
very unusual
very unlikely
very useful
and the user types:
very u
I want to display the phrases above.
I'm using this query:
"query": {
"multi_match": {
"query": "very u",
"fields": [
"name",
"description",
"contentBlocks.caption",
"contentBlocks.text"
],
"type": "phrase_prefix",
"max_expansions": 10,
"cutoff_frequency": 0.001
}
This matches the content I'm looking for, but extracting the matched phrases from the search results is quite awkward. I have been using highlighting, and I collect the matched phrases by parsing the highlights. For example:
"highlight": {
"contentBlocks.text": [
"turned the <em>very</em> <em>unusual</em> doorknob"
]
}
"highlight": {
"contentBlocks.text": [
"invented a <em>very</em> <em>useful</em> mechanism"
]
}
What's the right way to do this?
"Phrase Suggester" might be capable of doing what I have described, but it is not at all obvious how you would get it to do that.
I have indexed the fields of interest (for example, "description") as follows:
"description" : {
"index_analyzer" : "snowball_stem",
"search_analyzer" : "snowball_stem",
"type" : "string",
"fields" : {
"autocomplete" : {
"index_analyzer" : "shingle_analyzer",
"search_analyzer" : "shingle_analyzer",
"type" : "string"
}
}
},
I am using the snowball_stem analyzer for search, and the shingle_analyzer for the autocomplete function. shingle_analyzer looks like this:
"settings" : {
"analysis" : {
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"standard",
"lowercase",
"shingle_filter"
],
"char_filter" : [
"html_strip"
]
}
},
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2,
"max_shingle_size" : 2
}
}
}
},
The documentation for the phrase suggester seems to be totally oriented toward "spelling correction" rather than completion. Since what I'm after is completion, I set the direct generator's min_word_length and prefix_length to the length of the input text, in this case, 2.
I crafted up a suggestion query based on the documentation:
{
"text" : "sa",
"autocomplete_description" : {
"phrase" : {
"analyzer" : "standard",
"field" : "description.autocomplete",
"size" : 10,
"max_errors" : 2,
"confidence" : 0.0,
"gram_size" : 2,
"direct_generator" : [
{
"field" : "description.autocomplete",
"suggest_mode" : "always",
"size" : 10,
"min_word_length" : 2,
"prefix_length" : 2
}
]
}
}
}
This search for suggestions for "sa" comes up with the following results:
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"autocomplete_description" : [ {
"text" : "sa",
"offset" : 0,
"length" : 2,
"options" : [ {
"text" : "say",
"score" : 0.012580795
}, {
"text" : "sa",
"score" : 0.01127677
}, {
"text" : "san",
"score" : 0.0106529845
}, {
"text" : "sad",
"score" : 0.008533429
}, {
"text" : "saw",
"score" : 0.008107899
}, {
"text" : "sam",
"score" : 0.007155634
} ]
} ]
}
What I expect to find for the input "sa" is words that begin with "sa" of any length. Why does it only return words of two or three characters? Why does it only return six options? The multi_match phrase_prefix query I've been using finds many longer words beginning with "sa", such as "saving", "sassy", "safari", and "salad".
When I search for suggestions for multi-word text, such as "one or" (which occurs plenty of times in the data), it finds nothing. The multi_match phrase_prefix query finds "one or more", "one or the", "one, or you", and "one or both".
How can I get this suggester to do what I want?
Match phrase queryedit A phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. The analyzer can be set to control which analyzer will perform the analysis process on the text.
Match phrase prefix queryedit. Returns documents that contain the words of a provided text, in the same order as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term.
The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term. (Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field> .
Elasticsearch Index Prefix parameter used to the indexing of search term prefixes to speed up prefix searches on a website. You can set Elasticsearch Index Prefix from the admin panel.
You can get roughly what you want with the completion suggester. The main problem with this is that it's no longer search aware. You can sorta fix this by adding in a suggester context but it only works for filters and doesn't take into account the search text.
The only way that I know of to get the "best" behavior (context aware search completions) is to do the following:
suggestions
field where the text is tokenized as you would want it to be seen by the user (probably standard analyzer or maybe add on a 2-shingle token filter).very un
. Behind the scenes issue search for very
and then use term aggregations to get a list terms that match the search context, but limit the terms returned with "include": "un.*"
.The only problem with this method, especially in a sharded environment is that it's a lot of queries and you're pulling a very high cardinality field (suggestions
) into memory. So... I don't know if this is practically feasible. So maybe it's better to go back with the completion suggester. If you try either of these I'm interested in hearing your experience with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With