I'd like to take a query like "jan do" and have it match values like "jane doe", "don janek" -- and of course: "jan do", "do jan".
So the rules I can think of at the moment are:
So far, I have this mapping
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"question": {
"properties": {
"title": {
"type": "string"
},
"answer": {
"type": "object",
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}
}
}
I've been searching things as phrases:
POST /test/_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"answer.text": {
"query": "jan do",
"type": "phrase_prefix"
}
}
},
{
"match": {
"answer.text.stemmed": {
"query": "jan do",
"operator": "and"
}
}
}
]
}
}
}
And that works okay when things actually start that phrase, but now I want to tokenize the query and treat each token like a prefix.
Is there a way I can do this (probably at query time)?
My other option is to just construct a query like this:
POST test/_search
{
"query": {
"bool": {
"should": [
{
"prefix": {
"answer.text.stemmed": "jan"
}
},
{
"prefix": {
"answer.text.stemmed": "do"
}
}
]
}
}
}
This seems to work, but it doesn't preserve the order of the words. Also, I feel like that's cheating and possibly not the most performant option. What if there were 10 prefixes? 100? I'd like to know whether anyone feels otherwise.
As the comment above suggests, you should take a look at ngrams in Elasticsearch, and in particular edge ngrams.
I wrote up an introduction to using ngrams in this blog post for Qbox, but here is a quick example you can play with.
Here is an index definition that applies an edge ngram token filter as well as several other filters to a custom analyzer (using the standard tokenizer).
There have been some changes in the way analyzers are applied in ES 2.0. But notice that I am using the standard analyzer for the "search_analyzer"
. This is because I don't want the search text to be tokenized into ngrams, I want it to be matched directly to indexed tokens. I'll refer you to the blog post for a description of the details.
Anyway, here is the mapping:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"edgengram_filter"
]
}
},
"filter": {
"edgengram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"price":{
"type": "integer"
}
}
}
}
}
Then I index a few simple documents:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"name": "very cool shoes","price": 26}
{"index":{"_id":2}}
{"name": "great shampoo","price": 15}
{"index":{"_id":3}}
{"name": "shirt","price": 25}
And now the following query will get me the expected autocomplete results:
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "ver sh",
"operator": "and"
}
}
}
}
...
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2169777,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.2169777,
"_source": {
"name": "very cool shoes",
"price": 26
}
}
]
}
}
Here is all the code I used in the example:
http://sense.qbox.io/gist/c2ba05900d0749fa3b1ba516c66431ae1a9d5e61
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With