Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scoring documents by both textual match and distance to a point

I have an ElasticSearch index with a list of "shops".

I'd like to allow customers to search these shops by both geo_distance (so, search for a point and get shops near that location), and textual match, like matches on shop name / address.

I'd like to get results that match either of these two criteria, and I'd like the order of these results to be a combination of both. The stronger the textual match, and the closer to the point searched, the higher the result. (Obviously, there's going to be a formula to combine these two, that'll need tweaking, not too worried about that part yet).

My issue / what I've tried:

  • geo_distance is a filter, not a query, so I can't combine both on the query part of the request.

  • I can use a bool => should filter (rather than query) that matches on either name or location. This gives me the results I want, but not in order.

  • I can also have _geo_distance as part of a sort clause so that documents closer to the point rank higher.

What I haven't figured out is how I would take the "regular" _score that ElasticSearch gives to documents when doing textual matches, and combine that with the geo_distance score.

By having the textual match in the filter, it doesn't seem to affect the score of documents (which makes sense). And I don't see how I could combine the textual match in the query part and a geo_distance filter so it's an OR rather than an AND.

I guess my best bet would be the equivalent of this:

{
  function_score: {
    query: {  ... },
    functions: [
      { geo_distance function },
      { multi_match_result score },
    ],
    score_mode: 'multiply'
  }
}

but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.

Any pointers will be greatly appreciated.

I'm working with ElasticSearch v1.4, but I can upgrade if necessary.

like image 977
Daniel Magliola Avatar asked Jun 24 '16 14:06

Daniel Magliola


1 Answers

but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.

You can't really do it in the way that you're asking, but you can do what you want just as easily. For the simpler case, you get scoring just by using a normal query.

The problem with filters is that they're yes/no questions, so if you use them in a function_score, then it either boosts the score or it doesn't. What you probably want is degradation of the score as the distance from the origin grows. It's the yes/no nature that stops them from impacting the score at all. There's no improvement to relevancy implied by matching a filter -- it just means that it's part of the answer, but it doesn't make sense to say that it should be closer to the top/bottom as a result.

This is where the Decay function score helps. It works with numbers, dates, and -- most helpfully here -- geo_points. In addition to the types of data it accepts, it can decay using either gaussian, exponential, or linear decay functions. The one that you want to choose is honestly arbitrary and you should give the one that chooses the best "experience". I would suggest to start with gauss.

"function_score": {
  "functions": [
    "gauss": {
      "my_geo_point_field": {
        "origin": "0, 1",
        "scale": "5km",
        "offset": "500m",
        "decay": 0.5
      }
    }
  ]
}

Note that origin is in x, y format (due to standard GeoJSON), which is longitude, latitude.

Decay

Each one of the values impacts how the score decays based on the graph (taken wholesale from the documentation). If you would use an offset of 0, then the score begins to drop once it's not exactly at the origin. With the offset, it allows it some buffer to be considered just as good.

The scale is directly associated with the decay in that the score will be chopped down by the decay value once it is scale-distance away from the origin (+/- the offset). In my above example, anything 5km from the origin would get half of the score as anything at the origin.

Again, just note that the different types of decay functions change the shape of scoring.

I'd like the order of these results to be a combination of both.

This is the purpose of the bool / should compound query. You get OR behavior with score improvement based on each match. Combining this with the above, you'd want something like:

{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": { ... }
        },
        {
          "function_score": {
            "functions": [
              "gauss": {
                "my_geo_point_field": {
                  "origin": "0, 1",
                  "scale": "5km",
                  "offset": "500m",
                  "decay": 0.5
                }
              }
            ]
          }
        }
      ]
    }
  }
}

NOTE: If you add a must, then the should behavior changes from literal OR-like behavior (at least 1 must match) to completely optional behavior (none must match).

I'm working with ElasticSearch v1.4, but I can upgrade if necessary.

Starting with Elasticsearch 2.0, every filter is a query and every query is also a filter. The only difference is the context that it's used in. This doesn't change my answer here, but it's something that may help you in the future in addition to what I say next.

Geo-related performance increased dramatically in ES 2.2+. You should upgrade (and recreate your geo-related indices) to take advantage of those changes. ES 5.0 will have similar benefits!

like image 99
pickypg Avatar answered Sep 28 '22 19:09

pickypg