Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort keyword field array within ElasticSearch document by relevance

I've got an ElasticSearch index that looks something like this:

{
    "mappings": {
        "article": {
            "properties": {
                "title": { "type": "string" },
                "tags": {
                    "type": "keyword"
                },
        }
    }
}

And data that looks something like this:

{ "title": "Something about Dogs", "tags": ["articles", "dogs"] },
{ "title": "Something about Cats", "tags": ["articles", "cats"] },
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }

If I search for dog, I get the first and third documents, as I'd expect. And I can weight the search documents the way I like (in reality, I'm using a function_score query to weight on a bunch of fields irrelevant to this question).

What I'd like to do is sort the tags field so that the most relevant tags are returned first, without affecting the sort order of the documents themselves. So I'm hoping for a result like this:

{ "title": "Something about Dog Food", "tags": ["dogs", "dogfood", "articles"] }

Instead of what I get now:

{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }

The documentation on sort and function score don't cover my case. Any help appreciated. Thanks!

like image 385
Joe Mastey Avatar asked Oct 24 '17 20:10

Joe Mastey


2 Answers

You cannot sort the _source (your array of tags) of the documents given its "matching" capability. One way of doing this is by using nested fields and inner_hits that allows you to sort the matching nested fields.

My suggestion is to transform your tags in a nested field (I chose keyword there just by simplicity, but you can also have text and the analyzer of your choice):

PUT test
{
  "mappings": {
    "article": {
      "properties": {
        "title": {
          "type": "string"
        },
        "tags": {
          "type": "nested",
          "properties": {
            "value": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

And use this kind of query:

GET test/_search
{
  "_source": {
    "exclude": "tags"
  },
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "dogs"
          }
        },
        {
          "nested": {
            "path": "tags",
            "query": {
              "bool": {
                "should": [
                  {
                    "match_all": {}
                  },
                  {
                    "match": {
                      "tags.value": "dogs"
                    }
                  }
                ]
              }
            },
            "inner_hits": {
              "sort": {
                "_score": "desc"
              }
            }
          }
        }
      ]
    }
  }
}

Where you try to match on the tags nested field value for the same text you try to match on title. Then, using inner_hits sorting, you can actually sort the nested values based on their inner scoring.

@Val's suggestion is very good, but is good as long as for your "relevant tags" you are ok with just a simple text matching as a substring (i1.indexOf(params.search)). His solution's biggest advantage is that you don't have to change the mapping.

My solution's big advantage is that you are actually using Elasticsearch true search capabilities to determine the "relevant" tags. But the drawback is that you need nested field instead of the regular simple keyword.

like image 152
Andrei Stefan Avatar answered Oct 30 '22 09:10

Andrei Stefan


What you get from a search call are the source documents. The documents in the response are returned in exactly the same form as when you indexed them, which means that if you indexed ["articles", "dogs", "dogfood"], you'll always get that array in that unaltered form.

One way to get around this is to declare a script_field that applies a small script to sort your array and return the result of that sort.

What the script does is simply move the terms that contain the search term in the front of the list

{
    "_source": ["title"],
    "query" : {
        "match_all": {}
    },
    "script_fields" : {
        "sorted_tags" : {
            "script" : {
                "lang": "painless",
                "source": "return params._source.tags.stream().sorted((i1, i2) -> i1.indexOf(params.search) > -1 ? -1 : 1).collect(Collectors.toList())",
                "params" : {
                    "search": "dog"
                }
            }
        }
    }
}

This will return something like this, as you can see the sorted_tags array contains the terms as you expect.

{
  "took": 18,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "tests",
        "_type": "article",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "Something about Dog Food"
        },
        "fields": {
          "sorted_tags": [
            "dogfood",
            "dogs",
            "articles"
          ]
        }
      }
    ]
  }
}
like image 23
Val Avatar answered Oct 30 '22 11:10

Val