Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch Query for good title keyword results

We have a elasticsearch index containing a catalog of products, that we want to search by title and description.

We want it to have the following constraints:

  • We are searching title and description for occurences (matches in title should be twice as important as description)
  • We want it to have a very light fuzzy search result (but still accurate results)
  • Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
  • category_id should filter products out (so no results of other categories should be shown)
  • The created_at attribute should be valued very high in sorting as well. products should lose score the "older" they get. (This is very important, because they lose importance with every day)

I have tried to create a query like that, but the results are really less than accurate. Sometimes finding completely unrelated stuff. I think that's because of the wildcard query.

Also I think there must be a more elegant solution for the "created_at" scoring. Right?

I am using Elasticsearch 6.2

This is my current code. I would be happy to see an elegant solution for this:

{
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ],
  "min_score": 0.3,
  "size": 12,
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "category_id": [
            "212",
            "213"
          ]
        }
      },
      "should": [
        {
          "match": {
            "title_completion": {
              "query": "Development",
              "boost": 20
            }
          }
        },
        {
          "wildcard": {
            "title": {
              "value": "*Development*",
              "boost": 1
            }
          }
        },
        {
          "wildcard": {
            "title_completion": {
              "value": "*Development*",
              "boost": 10
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "Development",
              "operator": "and",
              "fuzziness": 1
            }
          }
        },
        {
          "range": {
            "created_at": {
              "gte": 1563264817998,
              "boost": 11
            }
          }
        },
        {
          "range": {
            "created_at": {
              "gte": 1563264040398,
              "boost": 4
            }
          }
        },
        {
          "range": {
            "created_at": {
              "gte": 1563256264398,
              "boost": 1
            }
          }
        }
      ]
    }
  }
}
like image 901
SimonEritsch Avatar asked Jul 16 '19 08:07

SimonEritsch


People also ask

How do I get more than 10 results in Elasticsearch?

If a search request results in more than ten hits, ElasticSearch will, by default, only return the first ten hits. To override that default value in order to retrieve more or fewer hits, we can add a size parameter to the search request body.

How do you query an Elasticsearch index?

You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .

Is Elasticsearch good for full-text search?

Both MySQL and Elasticsearch provide a powerful capability of full text search. If your system is using MySQL as the data store, the feature of full text search can be quickly enabled by creating full text indexes for the target data fields.


1 Answers

First of all, building a request returning relevant results is usually a difficult task. It can't be done without knowing the content of the documents. That said, I can give you hints to fulfill your requirements and avoid unrelevant results.

We are searching title and description for occurences (matches in title should be twice as important as description)

You can use boost as you did in your query to give more importance to matches on title compared to description.

We want it to have a very light fuzzy search result (but still accurate results)

You should use AUTO value for the fuzzy field to define different values of fuzziness depending on the length of the term. E.g., by default terms having less than 3 letters (most common terms where a change in letter can result in a different word) will not allows changes. Terms with more than 3 letters will allow one change and more than 5 will allow 2 changes. You can change this behavior depending of your tests.

Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)

Use a should clause in the bool statement. Clauses in a should statements does not filter documents (unless specified otherwise). The queries in should clause are only used to increase the score.

category_id should filter products out (so no results of other categories should be shown)

Use a must of filter clause in the bool statement to ensure that all documents validate a constraint. If you don't want the subqueries to contribute to the score (I believe its your case), use filter instead of match because filter will be able to cache the results. Your query is ok for this requirement.

The created_at attribute should be valued very high in sorting as well. products should lose score the "older" they get. (This is very important, because they lose importance with every day)

You should use a function score with a decay function. If decay function are not clear for you, you can skip the equations in the document and jump to the figure which self explanatory. The following query is an example using a gauss decay function.

{
    "function_score": {
        // Name of the decay function
        "gauss": {
            // Field to use
            "created_at": {
                    "origin": "now",  // "now" is the default so you can omit this field
                    "offset": "1d",   // Values with less than 1 day will not be impacted
                    "scale": "10d",   // Duration for which the scores will be scaled using a gauss function
                    "decay" : 0.01    // Score for values further than scale
            }
        }
    }
}

Hints for writing queries

  • Avoid wildcard queries: If you use * they are not efficient and will consume a lot of memory. If you want to be able to search in part of a term (finding "penthouse" when the user search "house") you should create a subfield using ngram tokenizer and write a standard match query using the subfield.

  • Avoid setting a minimum score: The score is a relative value. A small score or a high score does not mean that the document is relevant or not. You can read this article about the subject.

  • Be carefull with fuzzy queries: Fuzzy can generate a lot of noise and confuse users. In general, I would recommend to increase the default AUTO threshold for fuzzy and accept that some queries with mispelling does not return good results. Usually, it is simpler for a user to detect a mispelling in his input compared to understanding why he has completly unrelated results.

Example query

This is just an example that you will need to adapt with your data.

{
  "size": 12,
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "category_id": <CATEGORY_IDS>
        }
      },
      "should": [
        {
          "match": {
            "title": {
              "query": <QUERY>,
              "fuzziness": AUTO:4:12,
              "boost": 3
            }
          }
        },
        {
          "match": {
            "title_completion": {
              "query": <QUERY>,
              "boost": 1
            }
          }
        },
        {
          "match": {
            // title_completion field with ngram tokenizer
            "title_completion.ngram": {
              "query": <QUERY>,
              // Use lower boost because it match only partially
              "boost": 0.5
            }
          }
        }
      ]
    },
    "function_score": {
        // Name of the decay function
        "gauss": {
            // Field to use
            "created_at": {
                "origin": "now",  // "now" is the default so you can omit this field
                "offset": "1d",   // Values with less than 1 day will not be impacted
                "scale": "10d",   // Duration for which the scores will be scaled using a gauss function
                "decay" : 0.01    // Score for values further than scale
            }
        }
    }
  }
}
like image 173
Pierre-Nicolas Mougel Avatar answered Oct 14 '22 03:10

Pierre-Nicolas Mougel