Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to exclude a field from getting searched by elasticsearch 6.1?

I have an index with multiple fields in it. I want to filter out based on presence of search string in all the fields except one - user_comments. The query search that I am doing is

{
    "from": offset,
    "size": limit,
    "_source": [
      "document_title"
    ],
    "query": {
      "function_score": {
        "query": {
          "bool": {
            "must":
            {
              "query_string": {
                "query": "#{query}"
              }
            }
          }
        }
      }
    }
  }

Although the query string is searching through all the fields, and giving me documents with matching string in the user_comments field as well. But, I want to query it against all the fields leaving out the user_comments field. The white-list is a very big list and also the name of the fields are dynamic, so it is not feasible to mention the white-listed field list using the fields parameter like.

"query_string": {
                    "query": "#{query}",
                    "fields": [
                      "document_title",
                      "field2"
                    ]
                  }

Can anybody please suggest an idea on how to exclude a field from being searched?

like image 346
Richa Sinha Avatar asked Oct 11 '18 09:10

Richa Sinha


People also ask

How do I remove a field from a document in Elasticsearch?

Delete by query API | Elasticsearch Guide [7.13] | Elastic The only way to remove the field is to run a reindex into a new index, making sure you don't carry that field over.

What is _source field in Elasticsearch?

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

What does the _ALL field do in Elasticsearch?

The _all field is meant to index all the content that come from all the fields that your documents are composed of. You can search on it but never return it, since it's indexed but not stored in lucene.


1 Answers

There is a way to make it work, it's not pretty but will do the job. You may achieve your goal using a boost and multifield parameters of query_string, bool query to combine the scores and setting min_score:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "#{query}",
            "type": "most_fields",
            "boost": 1
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "#{query}",
            "boost": -1
          }
        }
      ]
    }
  },
  "min_score": 0.00001
}

So what happens under the hood?

Let's assume you have the following set of documents:

PUT my-query-string/doc/1
{
  "title": "Prodigy in Bristol",
  "text": "Prodigy in Bristol",
  "comments": "Prodigy in Bristol"
}
PUT my-query-string/doc/2
{
  "title": "Prodigy in Birmigham",
  "text": "Prodigy in Birmigham",
  "comments": "And also in Bristol"
}
PUT my-query-string/doc/3
{
  "title": "Prodigy in Birmigham",
  "text": "Prodigy in Birmigham and Bristol",
  "comments": "And also in Cardiff"
}
PUT my-query-string/doc/4
{
  "title": "Prodigy in Birmigham",
  "text": "Prodigy in Birmigham",
  "comments": "And also in Cardiff"
}

In your search request you would like to see only documents 1 and 3, but your original query will return 1, 2 and 3.

In Elasticsearch, search results are sorted by relevance _score, the bigger the score the better.

So let's try to boost down the "comments" field so its impact into relevance score is neglected. We can do this by combining two queries with a should and using a negative boost:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "Bristol"
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "Bristol",
            "boost": -1
          }
        }
      ]
    }
  }
}

This will give us the following output:

{
  "hits": {
    "total": 3,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham and Bristol",
          "comments": "And also in Cardiff"
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "2",
        "_score": 0,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham",
          "comments": "And also in Bristol"
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "1",
        "_score": 0,
        "_source": {
          "title": "Prodigy in Bristol",
          "text": "Prodigy in Bristol",
          "comments": "Prodigy in Bristol",
          "discount_percent": 10
        }
      }
    ]
  }
}

Document 2 has got penalized, but also document 1 did, although it is a desired match for us. Why did it happen?

Here's how Elasticsearch computed _score in this case:

_score = max(title:"Bristol", text:"Bristol", comments:"Bristol") - comments:"Bristol"

Document 1 matches the comments:"Bristol" part and it also happens to be the best score. According to our formula the resulting score is 0.

What we would actually like to do is to boost first clause (with "all" fields) more if more fields matched.

Can we boost query_string matching more fields?

We can, query_string in multifield mode has a type parameter that does exactly that. The query will look like this:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "type": "most_fields",
            "query": "Bristol"
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "Bristol",
            "boost": -1
          }
        }
      ]
    }
  }
}

This will give us the following output:

{
  "hits": {
    "total": 3,
    "max_score": 0.57536423,
    "hits": [
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "1",
        "_score": 0.57536423,
        "_source": {
          "title": "Prodigy in Bristol",
          "text": "Prodigy in Bristol",
          "comments": "Prodigy in Bristol",
          "discount_percent": 10
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham and Bristol",
          "comments": "And also in Cardiff"
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "2",
        "_score": 0,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham",
          "comments": "And also in Bristol"
        }
      }
    ]
  }
}

As you can see, the undesired document 2 is on the bottom and has score of 0. Here's how the score was computed this time:

_score = sum(title:"Bristol", text:"Bristol", comments:"Bristol") - comments:"Bristol"

So the documents matching "Bristol" in any field got selected. Relevance score for comments:"Bristol" got eliminated, and only documents matching title:"Bristol" or text:"Bristol" got a _score > 0.

Can we filter out those results with undesired score?

Yes, we can, using min_score:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "Bristol",
            "type": "most_fields",
            "boost": 1
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "Bristol",
            "boost": -1
          }
        }
      ]
    }
  },
  "min_score": 0.00001
}

This will work (in our case) since the score of the documents will be 0 if and only if "Bristol" was matched against field "comments" only and didn't match any other field.

The output will be:

{
  "hits": {
    "total": 2,
    "max_score": 0.57536423,
    "hits": [
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "1",
        "_score": 0.57536423,
        "_source": {
          "title": "Prodigy in Bristol",
          "text": "Prodigy in Bristol",
          "comments": "Prodigy in Bristol",
          "discount_percent": 10
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham and Bristol",
          "comments": "And also in Cardiff"
        }
      }
    ]
  }
}

Can it be done in a different way?

Sure. I wouldn't actually advise to go with _score tweaking since it is a pretty complex matter.

I would advise to make a fetch of existing mapping and construct a list of fields to run the query against beforehand, this will make the code much simpler and straightforward.

Original solution proposed in the answer (kept for history)

Originally it was proposed to use this kind of query with exactly the same intent as the solution above:

POST my-query-string/doc/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": {
            "query_string": {
              "fields" : ["*", "comments^0"],
              "query": "#{query}"
            }
          }
        }
      }
    }
  },
  "min_score": 0.00001
}

The only problem is that if an index contains any numeric values, this part:

"fields": ["*"]

raises an error since textual query string cannot be applied to a number.

like image 124
Nikolay Vasiliev Avatar answered Oct 22 '22 03:10

Nikolay Vasiliev