Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch wildcard search on not_analyzed field

I have an index like following settings and mapping;

{
  "settings":{
     "index":{
        "analysis":{
           "analyzer":{
              "analyzer_keyword":{
                 "tokenizer":"keyword",
                 "filter":"lowercase"
              }
           }
        }
     }
  },
  "mappings":{
     "product":{
        "properties":{
           "name":{
              "analyzer":"analyzer_keyword",
              "type":"string",
              "index": "not_analyzed"
           }
        }
     }
  }
}

I am struggling with making an implementation for wildcard search on name field. My example data like this;

[
{"name": "SVF-123"},
{"name": "SVF-234"}
]

When I perform following query;

http://localhost:9200/my_index/product/_search -d '
{
    "query": {
        "filtered" : {
            "query" : {
                "query_string" : {
                    "query": "*SVF-1*"
                }
            }
        }

    }
}'

It returns SVF-123,SVF-234. I think, it still tokenizes data. It must return only SVF-123.

Could you please help on this?

Thanks in advance

like image 521
Hüseyin BABAL Avatar asked Jan 16 '14 11:01

Hüseyin BABAL


2 Answers

There's a couple of things going wrong here.

First, you are saying that you don't want terms analyzed index time. Then, there's an analyzer configured (that's used search time) that generates incompatible terms. (They are lowercased)

By default, all terms end up in the _all-field with the standard analyzer. That is where you end up searching. Since it tokenizes on "-", you end up with an OR of "*SVF" and "1*".

Try to do a terms facet on _all and on name to see what's going on.

Here's a runnable Play and gist: https://www.found.no/play/gist/3e5fcb1b4c41cfc20226 (https://gist.github.com/alexbrasetvik/3e5fcb1b4c41cfc20226)

You need to make sure the terms you index is compatible with what you search for. You probably want to disable _all, since it can muddy what's going on.

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {
        "analysis": {
            "text": [
                "SVF-123",
                "SVF-234"
            ],
            "analyzer": {
                "analyzer_keyword": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "type": {
            "properties": {
                "name": {
                    "type": "string",
                    "index": "not_analyzed",
                    "analyzer": "analyzer_keyword"
                }
            }
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"name":"SVF-123"}
{"index":{"_index":"play","_type":"type"}}
{"name":"SVF-234"}
'

# Do searches

# See all the generated terms.
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "facets": {
        "name": {
            "terms": {
                "field": "name"
            }
        },
        "_all": {
            "terms": {
                "field": "_all"
            }
        }
    }
}
'

# Analyzed, so no match
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "match": {
            "name": {
                "query": "SVF-123"
            }
        }
    }
}
'

# Not analyzed according to `analyzer_keyword`, so matches. (Note: term, not match)
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "term": {
            "name": {
                "value": "SVF-123"
            }
        }
    }
}
'


curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "term": {
            "_all": {
                "value": "svf"
            }
        }
    }
}
'
like image 97
Alex Brasetvik Avatar answered Sep 30 '22 21:09

Alex Brasetvik


My solution adventure

I have started my case as you can see in my question. Whenever, I have changed a part of my settings, one part started to work, but another part stop working. Let me give my solution history:

1.) I have indexed my data as default. This means, my data is analyzed as default. This will cause problem on my side. For example;

When user started to search a keyword like SVF-1, system run this query:

{
    "query": {
        "filtered" : {
            "query" : {
                "query_string" : {
                    "analyze_wildcard": true,
                    "query": "*SVF-1*"
                }
            }
        }

    }
}

and results;

SVF-123
SVF-234

This is normal, because name field of my documents are analyzed. This splits query into tokens SVF and 1, and SVF matches my documents, although 1 does not match. I have skipped this way. I have create a mapping for my fields make them not_analyzed

{
  "mappings":{
     "product":{
        "properties":{
           "name":{
              "type":"string",
              "index": "not_analyzed"
           },
           "site":{
              "type":"string",
              "index": "not_analyzed"
           } 
        }
     }
  }
}

but my problem continued.

2.) I wanted to try another way after lots of research. Decided to use wildcard query. My query is;

{
    "query": {
        "wildcard" : {
            "name" : {
                "value" : *SVF-1*"
             }
          }
      },
            "filter":{
                    "term": {"site":"pro_en_GB"}
            }
    }
}

This query worked, but one problem here. My fields are not_analyzed anymore, and I am making wildcard query. Case sensitivity is problem here. If I search like svf-1, it returns nothing. Since, user can input lowercase version of query.

3.) I have changed my document structure to;

{
  "mappings":{
     "product":{
        "properties":{
           "name":{
              "type":"string",
              "index": "not_analyzed"
           },
           "nameLowerCase":{
              "type":"string",
              "index": "not_analyzed"
           }
           "site":{
              "type":"string",
              "index": "not_analyzed"
           } 
        }
     }
  }
}

I have adde one more field for name called nameLowerCase. When I am indexing my document, I am setting my document like;

{
    name: "SVF-123",
    nameLowerCase: "svf-123",
    site: "pro_en_GB"
}

Here, I am converting query keyword to lowercase and make search operation on new nameLowerCase index. And displaying name field.

Final version of my query is;

{
    "query": {
        "wildcard" : {
            "nameLowerCase" : {
                "value" : "*svf-1*"
             }
          }
      },
            "filter":{
                    "term": {"site":"pro_en_GB"}
            }
    }
}

Now it works. There is also one way to solve this problem by using multi_field. My query contains dash(-), and faced some problems.

Lots of thanks to @Alex Brasetvik for his detailed explanation and effort

like image 37
Hüseyin BABAL Avatar answered Sep 30 '22 23:09

Hüseyin BABAL