Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve docs that contains only allowed tags (exactly equals)

For each search request I have allowed tags list. For example,

["search", "open_source", "freeware", "linux"]

And I want to retrieve documents with all tags in this list. I want to retrieve:

{
    "tags": ["search", "freeware"]
}

and exclude

{
    "tags": ["search", "windows"]
}

because list doesn't contain windows tag.

There is an example for equals exactly in Elasticsearch documentation:

https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_multiple_exact_values.html

Firstly, we include a field that maintains the number of tags:

{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }

Secondly, we retrieve with needed tag_count

GET /my_index/my_type/_search
{
    "query": {
        "filtered" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tags" : "open_source" } }, 
                        { "term" : { "tag_count" : 2 } } 
                    ]
                }
            }
        }
    }
}

The problem is I don't know tag_count.

Also I have tried to write query with script_field tags_count, write each allowed tag in terms query and set minimal_should_match to tags_count, but I can't set script variable in minimal_should_match.

What can I investigate?

like image 456
Ivan Avatar asked Oct 29 '15 13:10

Ivan


People also ask

How do I retrieve data from Elasticsearch?

You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .

What is the Elasticsearch query to get all documents from an index?

You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.

What should you use to fetch a document in Elasticsearch?

You use GET to retrieve a document and its source or stored fields from a particular index. Use HEAD to verify that a document exists. You can use the _source resource retrieve just the document source or verify that it exists.

What are term based search queries?

Term queryedit. Returns documents that contain an exact term in a provided field. You can use the term query to find documents based on a precise value such as a price, a product ID, or a username. Avoid using the term query for text fields.


2 Answers

So I admit this is not a great solution, but maybe it will inspire other better solutions?

Given portions of the records you are searching look like what you have in your post with the tag_count fields:

"tags" : ["search"],
"tag_count" : 1

or

"tags" : ["search", "open_source"],
"tag_count" : 2

And you have a query like:

["search", "open_source", "freeware"]

Then you might programmatically generate a query like:

{
    "query" : {
        "bool" : {
            "should" : [
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 1 } },
                        ],
                        "minimum_should_match" : 2
                    }
                },
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 2 } },
                        ],
                        "minimum_should_match" : 3
                    }
                },
                {
                    "bool" : {
                        "should" : [
                            { "term" : { "tags" : "search" } },
                            { "term" : { "tags" : "open_source" } },
                            { "term" : { "tags" : "freeware" } },
                            { "term" : { "tag_count" : 3 } },
                        ],
                        "minimum_should_match" : 4
                    }
                }
            ],
            "minimum_should_match" : 1
        }
    }
}

The number of nested bool queries will match the number query of query tags (not great for a number of reasons - but with smaller queries / smaller indices, can perhaps get away with this?). Basically each clause will handle every possible case of tag_count and minimum_should_match will be tag_count + 1 (so match tag_count and appropriate number of tags - tag_count amount).

like image 159
eemp Avatar answered Oct 18 '22 21:10

eemp


If index size is medium size and tags cardinality is rather low I would just use terms aggregation to get distinct tags and create must and must not filters to filter out docs which contain tags you don't "allow". There are many ways to cache the list of all tags to an in-memory database like Redis, here are a few that came to my mind:

  1. Have a time-to-live of a few minutes or hours, re-generate the list if cache has expired
  2. Have a background process refreshing the list at regular intervals
  3. Update the list when new docs are inserted, then doc deletions should be handled as well

A more performant and 100% accurate method could look like this:

  1. Query all documents which have the requested tags but exclude docs with known other tags (as with the first solution)
  2. Go through the list of returned docs
  3. If a doc contains a tag which is not "allowed", it means it wasn't in known tags cache and thus must be added there, exclude this doc from the result set
  4. Tags at Redis could have a TTL of for example one day or one week, this way old tags are automatically pruned and you get simpler ES queries

This way you don't need a backup process to maintain the list of tags or use the possibly heavy terms aggregation as it hits all docs, and get always the correct result set and fairly performant queries.

This wouldn't work if subsequent aggregations are used as ES might return false documents which are pruned on the client side. However this could be detected by adding a terms aggregation as well and confirm that it doesn't have unexpected tags. If it does those need to be added to the tag cache, added to the must_not filter and query has to be re-executed. This isn't ideal if new tags are being created frequently.

like image 42
NikoNyrh Avatar answered Oct 18 '22 22:10

NikoNyrh