For each search request I have allowed tags list. For example,
["search", "open_source", "freeware", "linux"]
And I want to retrieve documents with all tags in this list. I want to retrieve:
{
"tags": ["search", "freeware"]
}
and exclude
{
"tags": ["search", "windows"]
}
because list doesn't contain windows
tag.
There is an example for equals exactly in Elasticsearch documentation:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_multiple_exact_values.html
Firstly, we include a field that maintains the number of tags:
{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }
Secondly, we retrieve with needed tag_count
GET /my_index/my_type/_search
{
"query": {
"filtered" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tag_count" : 2 } }
]
}
}
}
}
}
The problem is I don't know tag_count
.
Also I have tried to write query with script_field
tags_count
, write each allowed tag in terms query and set minimal_should_match
to tags_count
, but I can't set script variable in minimal_should_match
.
What can I investigate?
You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .
You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.
You use GET to retrieve a document and its source or stored fields from a particular index. Use HEAD to verify that a document exists. You can use the _source resource retrieve just the document source or verify that it exists.
Term queryedit. Returns documents that contain an exact term in a provided field. You can use the term query to find documents based on a precise value such as a price, a product ID, or a username. Avoid using the term query for text fields.
So I admit this is not a great solution, but maybe it will inspire other better solutions?
Given portions of the records you are searching look like what you have in your post with the tag_count fields:
"tags" : ["search"],
"tag_count" : 1
or
"tags" : ["search", "open_source"],
"tag_count" : 2
And you have a query like:
["search", "open_source", "freeware"]
Then you might programmatically generate a query like:
{
"query" : {
"bool" : {
"should" : [
{
"bool" : {
"should" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tags" : "freeware" } },
{ "term" : { "tag_count" : 1 } },
],
"minimum_should_match" : 2
}
},
{
"bool" : {
"should" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tags" : "freeware" } },
{ "term" : { "tag_count" : 2 } },
],
"minimum_should_match" : 3
}
},
{
"bool" : {
"should" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tags" : "open_source" } },
{ "term" : { "tags" : "freeware" } },
{ "term" : { "tag_count" : 3 } },
],
"minimum_should_match" : 4
}
}
],
"minimum_should_match" : 1
}
}
}
The number of nested bool queries will match the number query of query tags (not great for a number of reasons - but with smaller queries / smaller indices, can perhaps get away with this?). Basically each clause will handle every possible case of tag_count and minimum_should_match will be tag_count + 1 (so match tag_count and appropriate number of tags - tag_count amount).
If index size is medium size and tags cardinality is rather low I would just use terms
aggregation to get distinct tags and create must
and must not
filters to filter out docs which contain tags you don't "allow". There are many ways to cache the list of all tags to an in-memory database like Redis, here are a few that came to my mind:
A more performant and 100% accurate method could look like this:
This way you don't need a backup process to maintain the list of tags or use the possibly heavy terms
aggregation as it hits all docs, and get always the correct result set and fairly performant queries.
This wouldn't work if subsequent aggregations are used as ES might return false documents which are pruned on the client side. However this could be detected by adding a terms
aggregation as well and confirm that it doesn't have unexpected tags. If it does those need to be added to the tag cache, added to the must_not
filter and query has to be re-executed. This isn't ideal if new tags are being created frequently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With