Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query a phrase with stopwords in ElasticSearch

I am indexing some text with stopwords enabled and I would like to search against these using "match phrase" query without slop, but it looks like stopwords are still taking in account for terms positions.

Building index:

PUT /fr_articles
{
   "settings": {
      "analysis": {
         "analyzer": {
            "stop": {
               "type": "standard",
               "stopwords" : ["the"]
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "title": {
               "type": "string",
               "analyzer": "stop"
            }
         }
      }
   }
}

Add a document:

POST /fr_articles/test/1
{
    "title" : "Tom the king of Toulon!"
}

Search:

POST /fr_articles/_search
{
   "fields": [
      "title"
   ],
   "explain": true,
   "query": {
      "match": {
         "title": {
            "query": "tom king",
            "type" : "phrase"
         }
      }
   }
}

Nothing found ;-(

Is there a way to fix it? Or maybe with multiple span queries, but I want the term near each other.

Thanks you,

like image 856
Thomas Decaux Avatar asked Jul 30 '15 08:07

Thomas Decaux


People also ask

Does Elasticsearch remove stop words?

It is common to remove stop-words in both structured and unstructured search. Elasticsearch provides a Stop filter which can be configured to remove words from the token stream.

What is phrase query in Elasticsearch?

A phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. The analyzer can be set to control which analyzer will perform the analysis process on the text.

Does Elasticsearch support text queries?

The full text queries enable you to search analyzed text fields such as the body of an email. The query string is processed using the same analyzer that was applied to the field during indexing.


1 Answers

The position increments cause this issue, yes. While the stop word may be gone and not searchable, it still doesn't shove the two words up next to each other, so the query "tom the king" finds neither "tom king" nor "such that tom will not be their king".

Often, when you remove something in analysis with a filter, it's not quite as if it was never there. The intent of StopFilter, in particular, is to remove search hits resulting from uninteresting terms. It is not to change the structure of the document or a sentence.

You used to be able to disable position increments, on StopFilter, but that option has been removed, as of Lucene 4.4.


Okay, forget that CharFilter tomfoolery. Ugly hack, don't do that.

To query without using position increments, you need to configure that in your query parser, not in the analysis. This can be done in elasticsearch, with a Query String Query, with enable_position_increments set to false.

Something like:

{
    "query_string" : {
        "default_field" : "title",
        "query" : "\"tom king\""
        "enable_position_increments" : false
    }
}

As a point of interest, similar solution in raw Lucene, by setting QueryParser.setEnablePositionIncrements.

like image 104
femtoRgon Avatar answered Oct 11 '22 04:10

femtoRgon