I've done a complex query using the popularity to improve the results of social media documents using Elasticsearch. The query works really fine and the top results are always centered on the query and with interesting elements.
However it has a problem, for some queries the first results are all from the same user.
I would like to downscore a document if same user was retrieved on a higher document. This way I expect to have more diversification on the results.
Note that I don't want them to be removed, as in some cases it may still be interesting to find more documents of the same user, but I would like them to be in a lower position.
Can anybody suggest a way to make it work?
As suggested in some comments I update a (simplified version) of my query:
query = {"function_score": {
"functions": [
{"gauss": {"createdAt":
{"origin": "now", "scale": "30d", "offset": "7d", "decay" :0.9 }
}},
{"gauss": {"shares.last.twitter_retweets_log":
{"origin": 4.52, "scale": 2.61, "decay" : 0.9}
}},
],
"query": {"bool":{"must":[
{"exists":{"field": "images"}},
{"multi_match":{"query": "foo boo", fields:["text", "link.title"]}}
]}},
"score_mode": "multiply"
}};
P.S: some documents that may be interesting, as they talk about diversity, but I'm not sure how to apply:
If a search request results in more than ten hits, ElasticSearch will, by default, only return the first ten hits. To override that default value in order to retrieve more or fewer hits, we can add a size parameter to the search request body.
You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.
A search consists of one or more queries that are combined and sent to Elasticsearch. Documents that match a search's queries are returned in the hits, or search results, of the response.
You can couple the sampler with the top_hits
aggregation to get diversified results.
{
"query": {
"match": {
"query": "iphone"
}
},
"size":0,
"aggs": {
"sample": {
"sampler": {
"shard_size": 200,
"field" : "user.id"
},
"aggs": {
"diversifiedMatches": {
"top_hits": {
"size":10
}
}
}
}
}
}
There are some caveats e.g:
1) Deduplication is per-shard not global
2) Choice of diversification field must be a single-value field
3) No support for pagination
4) No support for sorting on anything other than score
Addressing the above issues would be hard and would require expensive/complex co-ordination internally plus more guidance from the client about when and where "duplicate" results can be re-introduced (page 2? page 3? how many?) etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With