Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch random selection based on weighting out of 100

I have been running a Rails site for a couple of years and some articles are being pulled from the DB based on a weight field. The data structure is:

{name: 'Content Piece 1', weight: 50}
{name: 'Content Piece 2', weight: 25}
{name: 'Content Piece 3', weight: 25}

The Ruby code that I originally wrote looks like so:

choices = []
sum = articles.inject(0.0) { |sum, article|
  sum += listing['weight']
}
pick = rand(sum)
choices << articles.detect { |listing|
  if pick <= listing['weight']
    true
  else
    pick -= listing['weight']
    false
  end
}

This works well at pulling out each content piece and respecting the weight. After running this code 100 times across the data set, multiple times I get the content pieces distributed fairly well based on the weights:

100.times do
  choices = []
  sum = articles.inject(0.0) { |sum, article|
    sum += listing['weight']
  }
  pick = rand(sum)
  choices << articles.detect { |listing|
    if pick <= listing['weight']
      true
    else
      pick -= listing['weight']
      false
    end
  }
end

{:total_runs=>100, "Content Piece 1"=>51, "Content Piece 2"=>22, "Content Piece 3"=>27}
{:total_runs=>100, "Content Piece 1"=>53, "Content Piece 2"=>30, "Content Piece 3"=>17}

I am starting to more frequently use ElasticSearch at the moment and I was hoping I could index the data in ES and pull the content out based on weights.

I found a SO post talking about something very similar that can be found here:

Weighted random sampling in Elasticsearch

I have pulled the search query across and changed it to match my data structure:

{
  "sort": ["_score"],
  "size": 1, 
  "query": {
    "function_score": {
      "functions": [
        {
          "random_score": {}
        },
        {
          "field_value_factor": {
            "field": "weight",
            "modifier": "none",
            "missing": 0
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "replace"
    }
  }
}

This query does definitely respect the weighting and pulls out the Content Piece with the weight 50 a lot more than the other 2 content pieces with the weights of 25 but it doesn't distribute the content out of a total of 100 weight, if that makes sense. I run this query 100 times and get results like so:

{:total_runs=>100, "Content Piece 1"=>70, "Content Piece 2"=>22, "Content Piece 3"=>8}
{:total_runs=>100, "Content Piece 1"=>81, "Content Piece 2"=>7, "Content Piece 3"=>12}
{:total_runs=>100, "Content Piece 1"=>90, "Content Piece 2"=>3, "Content Piece 3"=>7}

As I am new to ES and still learning the ins and outs of the querying, scoring etc I was wondering if anyone could help with a solution to more mimic the Ruby code I wrote to more effectively distribute the content based on the weights out of 100. Would the Painless scripting work for porting the Ruby code?

I hope this makes sense, let me know if you have any more questions to help explain what I am trying to achieve. Thanks!

like image 546
Arthur Avatar asked Jul 15 '19 16:07

Arthur


1 Answers

Your elasticsearch query is correct and you don't need scripts to perform what you want. It is just a problem with probabilities. For a short answer, replace the multiplier (i.e., field_value_factor) for the weight of 50 by 40 and the multiplier for the weight of 25 by 30 and you will get the expected result.

Basically, the problem is that multiplying a random value by a weight is not producing a weighted distribution where the weight is the multiplier. The multiplier can be derived from the weight, but there are not the same.

I can give you an example with your case. For the weight 50, if the random value is above 0.5, it will necessarily have the highest score (0.5 * 50 >= 1 * 25). Since a value of 0.5 as a probability of 50%, you now for sure that the item with weight 50 will be returned at least half of the time.

But even if the random value for weight 50 is below 0.5, it can still be selected. In fact its probability to be selected in this case is 1/3.

I'm just a bit surprised by your result because its probability should be more like 66% (i.e., 50% + 50%/3) and the other probabilities should be around 16.5%. Maybe try to increase the number of runs to be sure.

Solution for any weight using script_score

You do not need to compute the multiplier with this solution but you must provide a range, e.g., min_value and max_value for each document. max_value is the sum of min_value and the document wight and min_value is the cumulative sum of the weight of the previous documents.

If you have for example 4 documents with weights 5, 15, 30, 50, then the ranges could be :

  • Documents with weight 5 : min_value = 0, max_value = 5
  • Documents with weight 15 : min_value = 5, max_value = 5+15 = 20
  • Documents with weight 30 : min_value = 20, max_value = 20+30 = 50
  • Documents with weight 30 : min_value = 50, max_value = 50+50 = 100

The corresponding elasticsearch query is

{
  "sort": ["_score"],
  "size": 1, 
  "query": {
    "function_score": {
      "functions": [
        {
          "script_score": {
               "script" : {
                    "params": {
                        "random": <RANDOM_VALUE>,
                    },
                    "source": "params.random >= doc['min_value'].value && params.random < doc['max_value'].value ? 1 : 0"
                }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "replace"
    }
  }
}

The random parameter in the query should be computed for each request and must be between 0 and the sum of all your weights (in your case 100, but it does not have to be).

The problem with this approach is that you will have to update the ranges of all documents if you change a weight because the cumulative sum would have changed. If you have at most 20 documents and you do not update the weights often this should not be a problem.

like image 143
Pierre-Nicolas Mougel Avatar answered Nov 17 '22 10:11

Pierre-Nicolas Mougel