I need to obtain a random sample from an ElasticSearch index, i.e. to issue a query that retrieves some documents from a given index with weighted probability <code>Wj/ΣWi</code> (where <code>Wj</code> is a weight of row <code>j</code> and <code>Wj/ΣWi</code> is a sum of weights of all documents in this query). Currently, I have the following query: <pre class="prettyprint"><code>GET products/_search?pretty=true {"size":5, "query": { "function_score": { "query": { "bool":{ "must": { "term": {"category_id": "5df3ab90-6e93-0133-7197-04383561729e"} } } }, "functions": [{"random_score":{}}] } }, "sort": [{"_score":{"order":"desc"}}] } </code></pre> It returns 5 items from selected category, randomly. Each item has a field <code>weight</code>. So, I probably have to use <pre class="prettyprint"><code>"script_score": { "script": "weight = data['weight'].value / SUM; if (_score.doubleValue() > weight) {return 1;} else {return 0;}" } </code></pre> as described here. I have the following issues: <ul> <li>What is the correct way to do this?</li> <li>Do I need to enable <a href="https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-scripting.html#enable-dynamic-scripting">Dynamic Scripting</a>?</li> <li>How to calculate the total sum of the query?</li> </ul> Thanks a lot for your help!

In addition to other answers: You may also consider a case with a non-uniform distribution of the source documents by the features to balance on. For example, you want to retrieve 100 randomly mixed news: 50% on sports and 50% on politics from the index with 10,000 news on sports and 1,000,000 news on politics. In this case, you may use a custom <code>script_score</code> function to mix with <code>random_score</code> to transform the source distribution to wanted 50/50 distribution in the results: <pre class="prettyprint"><code>GET objects/_search { "size": 100, "sort": [ "_score" ], "query": { "function_score": { "query": { "match_all": {} }, "functions": [ { "random_score": {} }, { "script_score": { "script": { "source": """ double boost = 0.0; if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('politics') && Math.random()*1000000 <= 50) { boost += 1.0; } if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('sports') && Math.random()*10000 <= 50) { boost += 1.0; } return boost; """ } } } ], "score_mode": "multiply", "boost_mode": "replace" } } } </code></pre> Note, that source documents in the above example are nested like below: <pre class="prettyprint"><code>{ "title": "...", "body": "...", "labels": { "genres": ["news"], "topics": ["sports", "celebrities"] } } </code></pre> but you might have a simpler data model with plain fields; in this case just use <code>doc['topic'].contains('sports')</code> instead of <code>params._source[]</code>.

Weighted random sampling in Elasticsearch

Tags:

random

elasticsearch

weighted

random-sample

I need to obtain a random sample from an ElasticSearch index, i.e. to issue a query that retrieves some documents from a given index with weighted probability Wj/ΣWi (where Wj is a weight of row j and Wj/ΣWi is a sum of weights of all documents in this query).

Currently, I have the following query:

GET products/_search?pretty=true

{"size":5,
  "query": {
    "function_score": {
      "query": {
        "bool":{
          "must": {
            "term":
              {"category_id": "5df3ab90-6e93-0133-7197-04383561729e"}
          }
        }
      },
      "functions":
        [{"random_score":{}}]
    }
  },
  "sort": [{"_score":{"order":"desc"}}]
}

It returns 5 items from selected category, randomly. Each item has a field weight. So, I probably have to use

"script_score": {
  "script": "weight = data['weight'].value / SUM; if (_score.doubleValue() > weight) {return 1;} else {return 0;}"
}

as described here.

I have the following issues:

What is the correct way to do this?
Do I need to enable Dynamic Scripting?
How to calculate the total sum of the query?

Thanks a lot for your help!

445

asked Dec 07 '15 07:12

dpaluy

2 Answers

In case it helps anyone, here is how I recently implemented a weighted shuffling.

On this example, we shuffle companies. Each company has a "company_score" between 0 and 100. With this simple weighted shuffling, a company with score 100 is 5 times more likely to appear in first page than a company with score 20.

json_body = {
    "sort": ["_score"],
    "query": {
        "function_score": {
            "query": main_query,  # put your main query here
            "functions": [
                {
                    "random_score": {},
                },
                {
                    "field_value_factor": {
                        "field": "company_score",
                        "modifier": "none",
                        "missing": 0,
                    }
                }
            ],
            # How to combine the result of the two functions 'random_score' and 'field_value_factor'.
            # This way, on average the combined _score of a company having score 100 will be 5 times as much
            # as the combined _score of a company having score 20, and thus will be 5 times more likely
            # to appear on first page.
            "score_mode": "multiply",
            # How to combine the result of function_score with the original _score from the query.
            # We overwrite it as our combined _score (random x company_score) is all we need.
            "boost_mode": "replace",
        }
    }
}

145

answered Oct 16 '22 07:10

Vermeer Grange

In addition to other answers:

You may also consider a case with a non-uniform distribution of the source documents by the features to balance on. For example, you want to retrieve 100 randomly mixed news: 50% on sports and 50% on politics from the index with 10,000 news on sports and 1,000,000 news on politics.

In this case, you may use a custom script_score function to mix with random_score to transform the source distribution to wanted 50/50 distribution in the results:

GET objects/_search
{
  "size": 100,
  "sort": [
    "_score"
  ],
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "functions": [
        {
          "random_score": {}
        },
        {
          "script_score": {
            "script": {
              "source": """
                double boost = 0.0;
                if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('politics') && Math.random()*1000000 <= 50) {
                  boost += 1.0;
                }
                if (params._source['labels'] != null && params._source['labels']['genres'] != null && params._source['labels']['genres'].contains('sports') && Math.random()*10000 <= 50) {
                  boost += 1.0;
                }
                return boost;
              """
            }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "replace"
    }
  }
}

Note, that source documents in the above example are nested like below:

{
  "title": "...",
  "body": "...",
  "labels": {
    "genres": ["news"],
    "topics": ["sports", "celebrities"]
  }
}

but you might have a simpler data model with plain fields; in this case just use doc['topic'].contains('sports') instead of params._source[].

answered Oct 16 '22 08:10

denpost

Related questions
                            
                                Insert aggregation results into an index
                            
                                Elasticsearch Go nested query
                            
                                What is recommended analyzer / filter for human names in Elastic search
                            
                                Using logstash and elasticseach
                            
                                Adding fields depending on event message in Logstash not working
                            
                                Reduce number of shards in ElasticSearch
                            
                                Sync MongoDB with ElasticSearch
                            
                                Elasticsearch - Assigning Shards
                            
                                Getting distinct values using NEST ElasticSearch client
                            
                                Can't find logs in Elastic search docker container
                            
                                Best approch of Elastic Search time based feeds module?
                            
                                Searchkick index is empty after reindexing from model
                            
                                Preferred method of indexing bulk data into ElasticSearch?
                            
                                How can I integrate Tomcat6's catalina.out file with Logstash + ElasticSearch + Kibana?
                            
                                Kibana + Elasticsearch without Logstash possible?
                            
                                Indexing/Searching "complex" JSON in elasticsearch
                            
                                script_score the script could not be loaded scripts of type [inline], operation [search] and lang [groovy] are disabled
                            
                                logstash output to elasticsearch index and mapping
                            
                                Elastic Search Give an error No alive nodes found in your cluster
                            
                                can I prioritize more exact matches when using ngram filter in search results?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With