Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Diversified results on Elasticsearch search

I've done a complex query using the popularity to improve the results of social media documents using Elasticsearch. The query works really fine and the top results are always centered on the query and with interesting elements.

However it has a problem, for some queries the first results are all from the same user.

I would like to downscore a document if same user was retrieved on a higher document. This way I expect to have more diversification on the results.

Note that I don't want them to be removed, as in some cases it may still be interesting to find more documents of the same user, but I would like them to be in a lower position.

Can anybody suggest a way to make it work?


As suggested in some comments I update a (simplified version) of my query:

query = {"function_score": {
  "functions": [
    {"gauss": {"createdAt":
        {"origin": "now", "scale": "30d", "offset": "7d", "decay" :0.9 } 
    }},
    {"gauss": {"shares.last.twitter_retweets_log":
        {"origin": 4.52, "scale": 2.61, "decay" : 0.9} 
    }},
  ],
  "query": {"bool":{"must":[
    {"exists":{"field": "images"}},
    {"multi_match":{"query": "foo boo", fields:["text", "link.title"]}}
  ]}},
  "score_mode": "multiply"
}};

P.S: some documents that may be interesting, as they talk about diversity, but I'm not sure how to apply:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-sampler-aggregation.html?q=sampler
  • https://lucene.apache.org/core/5_1_0/misc/org/apache/lucene/search/DiversifiedTopDocsCollector.html
like image 738
David Mabodo Avatar asked Dec 11 '15 10:12

David Mabodo


People also ask

How do I get more than 10 results in Elasticsearch?

If a search request results in more than ten hits, ElasticSearch will, by default, only return the first ten hits. To override that default value in order to retrieve more or fewer hits, we can add a size parameter to the search request body.

How do I get Elasticsearch to index all data?

You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.

What is hits in elastic search?

A search consists of one or more queries that are combined and sent to Elasticsearch. Documents that match a search's queries are returned in the hits, or search results, of the response.


1 Answers

You can couple the sampler with the top_hits aggregation to get diversified results.

{
    "query": {
        "match": {
            "query": "iphone"
        }
    },
    "size":0,
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200,
                "field" : "user.id"                
            },
            "aggs": {
                "diversifiedMatches": {
                    "top_hits": {
                        "size":10
                    }
                }
            }
        }
    }
}

There are some caveats e.g:

1) Deduplication is per-shard not global

2) Choice of diversification field must be a single-value field

3) No support for pagination

4) No support for sorting on anything other than score

Addressing the above issues would be hard and would require expensive/complex co-ordination internally plus more guidance from the client about when and where "duplicate" results can be re-introduced (page 2? page 3? how many?) etc.

like image 154
MarkH Avatar answered Sep 18 '22 13:09

MarkH