I need to obtain a random sample from an ElasticSearch index, i.e. to issue a query that retrieves some documents from a given index with weighted probability Wj/ΣWi
(where Wj
is a weight of row j
and Wj/ΣWi
is a sum of weights of all documents in this query).
Currently, I have the following query:
GET products/_search?pretty=true
{"size":5,
"query": {
"function_score": {
"query": {
"bool":{
"must": {
"term":
{"category_id": "5df3ab90-6e93-0133-7197-04383561729e"}
}
}
},
"functions":
[{"random_score":{}}]
}
},
"sort": [{"_score":{"order":"desc"}}]
}
It returns 5 items from selected category, randomly.
Each item has a field weight
. So, I probably have to use
"script_score": {
"script": "weight = data['weight'].value / SUM; if (_score.doubleValue() > weight) {return 1;} else {return 0;}"
}
as described here.
I have the following issues:
Thanks a lot for your help!
By default, if the value field is missing the document is ignored and the aggregation moves on to the next document. If the weight field is missing, it is assumed to have a weight of 1 (like a normal average). What is Elasticsearch?
Without Elasticsearch, it might be easier to see that selecting documents involved the total number of documents and the sum of weights assigned to documents. With Elasticsearch, I was unable to find a way to assign scores that would rely on a sum of weights to pick a top hit that would act as the featured product.
Just as easily, we can take a sum of the weights, the size of that artificial set, and select a random number within the range of 0 to sum-1 . Map this number back to an index within the original set and we have our featured product!
In case it helps anyone, here is how I recently implemented a weighted shuffling.
On this example, we shuffle companies. Each company has a "company_score" between 0 and 100. With this simple weighted shuffling, a company with score 100 is 5 times more likely to appear in first page than a company with score 20.
json_body = {
"sort": ["_score"],
"query": {
"function_score": {
"query": main_query, # put your main query here
"functions": [
{
"random_score": {},
},
{
"field_value_factor": {
"field": "company_score",
"modifier": "none",
"missing": 0,
}
}
],
# How to combine the result of the two functions 'random_score' and 'field_value_factor'.
# This way, on average the combined _score of a company having score 100 will be 5 times as much
# as the combined _score of a company having score 20, and thus will be 5 times more likely
# to appear on first page.
"score_mode": "multiply",
# How to combine the result of function_score with the original _score from the query.
# We overwrite it as our combined _score (random x company_score) is all we need.
"boost_mode": "replace",
}
}
}
In addition to other answers:
You may also consider a case with a non-uniform distribution of the source documents by the features to balance on. For example, you want to retrieve 100 randomly mixed news: 50% on sports and 50% on politics from the index with 10,000 news on sports and 1,000,000 news on politics.
In this case, you may use a custom script_score
function to mix with random_score
to transform the source distribution to wanted 50/50 distribution in the results:
GET objects/_search
{
"size": 100,
"sort": [
"_score"
],
"query": {
"function_score": {
"query": { "match_all": {} },
"functions": [
{
"random_score": {}
},
{
"script_score": {
"script": {
"source": """
double boost = 0.0;
if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('politics') && Math.random()*1000000 <= 50) {
boost += 1.0;
}
if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('sports') && Math.random()*10000 <= 50) {
boost += 1.0;
}
return boost;
"""
}
}
}
],
"score_mode": "multiply",
"boost_mode": "replace"
}
}
}
Note, that source documents in the above example are nested like below:
{
"title": "...",
"body": "...",
"labels": {
"genres": ["news"],
"topics": ["sports", "celebrities"]
}
}
but you might have a simpler data model with plain fields; in this case just use doc['topic'].contains('sports')
instead of params._source[]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With