For each search request I have allowed tags list. For example, <pre class="prettyprint"><code>["search", "open_source", "freeware", "linux"] </code></pre> And I want to retrieve documents with all tags in this list. I want to retrieve: <pre class="prettyprint"><code>{ "tags": ["search", "freeware"] } </code></pre> and exclude <pre class="prettyprint"><code>{ "tags": ["search", "windows"] } </code></pre> because list doesn't contain <code>windows</code> tag. There is an example for equals exactly in Elasticsearch documentation: https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_multiple_exact_values.html Firstly, we include a field that maintains the number of tags: <pre class="prettyprint"><code>{ "tags" : ["search"], "tag_count" : 1 } { "tags" : ["search", "open_source"], "tag_count" : 2 } </code></pre> Secondly, we retrieve with needed tag_count <pre class="prettyprint"><code>GET /my_index/my_type/_search { "query": { "filtered" : { "filter" : { "bool" : { "must" : [ { "term" : { "tags" : "search" } }, { "term" : { "tags" : "open_source" } }, { "term" : { "tag_count" : 2 } } ] } } } } } </code></pre> The problem is I don't know <code>tag_count</code>. Also I have tried to write query with <code>script_field</code> <code>tags_count</code>, write each allowed tag in terms query and set <code>minimal_should_match</code> to <code>tags_count</code>, but I can't set script variable in <code>minimal_should_match</code>. What can I investigate?

If index size is medium size and tags cardinality is rather low I would just use <code>terms</code> aggregation to get distinct tags and create <code>must</code> and <code>must not</code> filters to filter out docs which contain tags you don't "allow". There are many ways to cache the list of all tags to an in-memory database like Redis, here are a few that came to my mind: <ol> <li>Have a time-to-live of a few minutes or hours, re-generate the list if cache has expired</li> <li>Have a background process refreshing the list at regular intervals</li> <li>Update the list when new docs are inserted, then doc deletions should be handled as well</li> </ol> A more performant and 100% accurate method could look like this: <ol> <li>Query all documents which have the requested tags but exclude docs with known other tags (as with the first solution)</li> <li>Go through the list of returned docs</li> <li>If a doc contains a tag which is not "allowed", it means it wasn't in known tags cache and thus must be added there, exclude this doc from the result set</li> <li>Tags at Redis could have a TTL of for example one day or one week, this way old tags are automatically pruned and you get simpler ES queries</li> </ol> This way you don't need a backup process to maintain the list of tags or use the possibly heavy <code>terms</code> aggregation as it hits all docs, and get always the correct result set and fairly performant queries. This wouldn't work if subsequent aggregations are used as ES might return false documents which are pruned on the client side. However this could be detected by adding a <code>terms</code> aggregation as well and confirm that it doesn't have unexpected tags. If it does those need to be added to the tag cache, added to the <code>must_not</code> filter and query has to be re-executed. This isn't ideal if new tags are being created frequently.

Retrieve docs that contains only allowed tags (exactly equals)

Tags:

elasticsearch

For each search request I have allowed tags list. For example,

["search", "open_source", "freeware", "linux"]

And I want to retrieve documents with all tags in this list. I want to retrieve:

{
    "tags": ["search", "freeware"]
}

and exclude

{
    "tags": ["search", "windows"]
}

because list doesn't contain windows tag.

There is an example for equals exactly in Elasticsearch documentation:

https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_multiple_exact_values.html

Firstly, we include a field that maintains the number of tags:

{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }

Secondly, we retrieve with needed tag_count

GET /my_index/my_type/_search
{
    "query": {
        "filtered" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tags" : "open_source" } }, 
                        { "term" : { "tag_count" : 2 } } 
                    ]
                }
            }
        }
    }
}

The problem is I don't know tag_count.

Also I have tried to write query with script_field tags_count, write each allowed tag in terms query and set minimal_should_match to tags_count, but I can't set script variable in minimal_should_match.

What can I investigate?

456

asked Oct 29 '15 13:10

Ivan

2 Answers

So I admit this is not a great solution, but maybe it will inspire other better solutions?

Given portions of the records you are searching look like what you have in your post with the tag_count fields:

"tags" : ["search"],
"tag_count" : 1

"tags" : ["search", "open_source"],
"tag_count" : 2

And you have a query like:

["search", "open_source", "freeware"]

Then you might programmatically generate a query like:

{
    "query" : {
        "bool" : {
            "should" : [
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 1 } },
                        ],
                        "minimum_should_match" : 2
                    }
                },
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 2 } },
                        ],
                        "minimum_should_match" : 3
                    }
                },
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 3 } },
                        ],
                        "minimum_should_match" : 4
                    }
                }
            ],
            "minimum_should_match" : 1
        }
    }
}

The number of nested bool queries will match the number query of query tags (not great for a number of reasons - but with smaller queries / smaller indices, can perhaps get away with this?). Basically each clause will handle every possible case of tag_count and minimum_should_match will be tag_count + 1 (so match tag_count and appropriate number of tags - tag_count amount).

159

answered Oct 18 '22 21:10

eemp

If index size is medium size and tags cardinality is rather low I would just use terms aggregation to get distinct tags and create must and must not filters to filter out docs which contain tags you don't "allow". There are many ways to cache the list of all tags to an in-memory database like Redis, here are a few that came to my mind:

Have a time-to-live of a few minutes or hours, re-generate the list if cache has expired
Have a background process refreshing the list at regular intervals
Update the list when new docs are inserted, then doc deletions should be handled as well

A more performant and 100% accurate method could look like this:

Query all documents which have the requested tags but exclude docs with known other tags (as with the first solution)
Go through the list of returned docs
If a doc contains a tag which is not "allowed", it means it wasn't in known tags cache and thus must be added there, exclude this doc from the result set
Tags at Redis could have a TTL of for example one day or one week, this way old tags are automatically pruned and you get simpler ES queries

This way you don't need a backup process to maintain the list of tags or use the possibly heavy terms aggregation as it hits all docs, and get always the correct result set and fairly performant queries.

This wouldn't work if subsequent aggregations are used as ES might return false documents which are pruned on the client side. However this could be detected by adding a terms aggregation as well and confirm that it doesn't have unexpected tags. If it does those need to be added to the tag cache, added to the must_not filter and query has to be re-executed. This isn't ideal if new tags are being created frequently.

answered Oct 18 '22 22:10

NikoNyrh

Related questions
                            
                                How to delete several documents by ID in one operation using Elasticsearch Nest
                            
                                Wildcard queries in field name
                            
                                Can we use Kibana for Apache Solr not using elasticsearch
                            
                                quick recovery after node restart in elasticsearch
                            
                                Running Filebeat in windows
                            
                                docker elasticsearch container not forwarding port (macOs)
                            
                                Unable to load JNA native support library Elasticsearch 6.x
                            
                                Index CSV to ElasticSearch in Python
                            
                                Kibana 4 custom dashboard
                            
                                Use existing field as id in elasticsearch
                            
                                ElasticSearch to Spark RDD
                            
                                How to disconnect from elasticsearch-py client/connection-pool
                            
                                What happens to my elasticsearch index when I stop rails server?
                            
                                How to write from DynamoDB to ElasticSearch using Lambda?
                            
                                Elasticsearch exclude top hit on field value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With