Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weighted random sampling in Elasticsearch

I need to obtain a random sample from an ElasticSearch index, i.e. to issue a query that retrieves some documents from a given index with weighted probability Wj/ΣWi (where Wj is a weight of row j and Wj/ΣWi is a sum of weights of all documents in this query).

Currently, I have the following query:

GET products/_search?pretty=true

{"size":5,
  "query": {
    "function_score": {
      "query": {
        "bool":{
          "must": {
            "term":
              {"category_id": "5df3ab90-6e93-0133-7197-04383561729e"}
          }
        }
      },
      "functions":
        [{"random_score":{}}]
    }
  },
  "sort": [{"_score":{"order":"desc"}}]
}

It returns 5 items from selected category, randomly. Each item has a field weight. So, I probably have to use

"script_score": {
  "script": "weight = data['weight'].value / SUM; if (_score.doubleValue() > weight) {return 1;} else {return 0;}"
}

as described here.

I have the following issues:

  • What is the correct way to do this?
  • Do I need to enable Dynamic Scripting?
  • How to calculate the total sum of the query?

Thanks a lot for your help!

like image 445
dpaluy Avatar asked Dec 07 '15 07:12

dpaluy


People also ask

What happens if a field is missing from an Elasticsearch table?

By default, if the value field is missing the document is ignored and the aggregation moves on to the next document. If the weight field is missing, it is assumed to have a weight of 1 (like a normal average). What is Elasticsearch?

Is it possible to select documents without Elasticsearch?

Without Elasticsearch, it might be easier to see that selecting documents involved the total number of documents and the sum of weights assigned to documents. With Elasticsearch, I was unable to find a way to assign scores that would rely on a sum of weights to pick a top hit that would act as the featured product.

How do you get random numbers from a set of weights?

Just as easily, we can take a sum of the weights, the size of that artificial set, and select a random number within the range of 0 to sum-1 . Map this number back to an index within the original set and we have our featured product!


2 Answers

In case it helps anyone, here is how I recently implemented a weighted shuffling.

On this example, we shuffle companies. Each company has a "company_score" between 0 and 100. With this simple weighted shuffling, a company with score 100 is 5 times more likely to appear in first page than a company with score 20.

json_body = {
    "sort": ["_score"],
    "query": {
        "function_score": {
            "query": main_query,  # put your main query here
            "functions": [
                {
                    "random_score": {},
                },
                {
                    "field_value_factor": {
                        "field": "company_score",
                        "modifier": "none",
                        "missing": 0,
                    }
                }
            ],
            # How to combine the result of the two functions 'random_score' and 'field_value_factor'.
            # This way, on average the combined _score of a company having score 100 will be 5 times as much
            # as the combined _score of a company having score 20, and thus will be 5 times more likely
            # to appear on first page.
            "score_mode": "multiply",
            # How to combine the result of function_score with the original _score from the query.
            # We overwrite it as our combined _score (random x company_score) is all we need.
            "boost_mode": "replace",
        }
    }
}
like image 145
Vermeer Grange Avatar answered Oct 16 '22 07:10

Vermeer Grange


In addition to other answers:

You may also consider a case with a non-uniform distribution of the source documents by the features to balance on. For example, you want to retrieve 100 randomly mixed news: 50% on sports and 50% on politics from the index with 10,000 news on sports and 1,000,000 news on politics.

In this case, you may use a custom script_score function to mix with random_score to transform the source distribution to wanted 50/50 distribution in the results:

GET objects/_search
{
  "size": 100,
  "sort": [
    "_score"
  ],
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "functions": [
        {
          "random_score": {}
        },
        {
          "script_score": {
            "script": {
              "source": """
                double boost = 0.0;
                if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('politics') && Math.random()*1000000 <= 50) {
                  boost += 1.0;
                }
                if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('sports') && Math.random()*10000 <= 50) {
                  boost += 1.0;
                }
                return boost;
              """
            }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "replace"
    }
  }
}

Note, that source documents in the above example are nested like below:

{
  "title": "...",
  "body": "...",
  "labels": {
    "genres": ["news"],
    "topics": ["sports", "celebrities"]
  }
}

but you might have a simpler data model with plain fields; in this case just use doc['topic'].contains('sports') instead of params._source[].

like image 1
denpost Avatar answered Oct 16 '22 08:10

denpost