We have a elasticsearch index containing a catalog of products, that we want to search by title and description.
We want it to have the following constraints:
I have tried to create a query like that, but the results are really less than accurate. Sometimes finding completely unrelated stuff. I think that's because of the wildcard query.
Also I think there must be a more elegant solution for the "created_at" scoring. Right?
I am using Elasticsearch 6.2
This is my current code. I would be happy to see an elegant solution for this:
{
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"min_score": 0.3,
"size": 12,
"from": 0,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": [
"212",
"213"
]
}
},
"should": [
{
"match": {
"title_completion": {
"query": "Development",
"boost": 20
}
}
},
{
"wildcard": {
"title": {
"value": "*Development*",
"boost": 1
}
}
},
{
"wildcard": {
"title_completion": {
"value": "*Development*",
"boost": 10
}
}
},
{
"match": {
"title": {
"query": "Development",
"operator": "and",
"fuzziness": 1
}
}
},
{
"range": {
"created_at": {
"gte": 1563264817998,
"boost": 11
}
}
},
{
"range": {
"created_at": {
"gte": 1563264040398,
"boost": 4
}
}
},
{
"range": {
"created_at": {
"gte": 1563256264398,
"boost": 1
}
}
}
]
}
}
}
If a search request results in more than ten hits, ElasticSearch will, by default, only return the first ten hits. To override that default value in order to retrieve more or fewer hits, we can add a size parameter to the search request body.
You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .
Both MySQL and Elasticsearch provide a powerful capability of full text search. If your system is using MySQL as the data store, the feature of full text search can be quickly enabled by creating full text indexes for the target data fields.
First of all, building a request returning relevant results is usually a difficult task. It can't be done without knowing the content of the documents. That said, I can give you hints to fulfill your requirements and avoid unrelevant results.
You can use boost
as you did in your query to give more importance to matches on title compared to description.
You should use AUTO
value for the fuzzy field to define different values of fuzziness depending on the length of the term. E.g., by default terms having less than 3 letters (most common terms where a change in letter can result in a different word) will not allows changes. Terms with more than 3 letters will allow one change and more than 5 will allow 2 changes. You can change this behavior depending of your tests.
Use a should
clause in the bool
statement. Clauses in a should
statements does not filter documents (unless specified otherwise). The queries in should
clause are only used to increase the score.
Use a must
of filter
clause in the bool
statement to ensure that all documents validate a constraint. If you don't want the subqueries to contribute to the score (I believe its your case), use filter
instead of match
because filter
will be able to cache the results. Your query is ok for this requirement.
You should use a function score
with a decay function. If decay function are not clear for you, you can skip the equations in the document and jump to the figure which self explanatory. The following query is an example using a gauss decay function.
{
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
Avoid wildcard queries: If you use *
they are not efficient and will consume a lot of memory. If you want to be able to search in part of a term (finding "penthouse" when the user search "house") you should create a subfield using ngram
tokenizer and write a standard match
query using the subfield.
Avoid setting a minimum score: The score is a relative value. A small score or a high score does not mean that the document is relevant or not. You can read this article about the subject.
Be carefull with fuzzy
queries: Fuzzy can generate a lot of noise and confuse users. In general, I would recommend to increase the default AUTO
threshold for fuzzy and accept that some queries with mispelling does not return good results. Usually, it is simpler for a user to detect a mispelling in his input compared to understanding why he has completly unrelated results.
This is just an example that you will need to adapt with your data.
{
"size": 12,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": <CATEGORY_IDS>
}
},
"should": [
{
"match": {
"title": {
"query": <QUERY>,
"fuzziness": AUTO:4:12,
"boost": 3
}
}
},
{
"match": {
"title_completion": {
"query": <QUERY>,
"boost": 1
}
}
},
{
"match": {
// title_completion field with ngram tokenizer
"title_completion.ngram": {
"query": <QUERY>,
// Use lower boost because it match only partially
"boost": 0.5
}
}
}
]
},
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With