Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: splitting words on underscore; search founds nothing

I'm configuring a tokenizer that splits words by underscore char as well as by all other punctuation chars. I decided to use word_delimiter filter for this. Then I set my analyzer as a default for desired field.

I have two issues with it:

  • Analyzer splits strings into words, but don't preserve original string, despite the preserve_original option. See analyze query.
  • Search by substrings splitted by underscore still produces no results

Here is my template, data object, analyzer test and search requests:

PUT simple
{
  "template" : "simple",
  "settings" : {
    "index" : {
      "analysis" : {
          "analyzer" : {
              "underscore_splits_words" : {
                  "tokenizer" : "standard",
                  "filter" : ["word_delimiter"],
                  "generate_word_parts" : true,
                  "preserve_original" : true
              }
          }
      }
    },
    "mappings": {
        "_default_": {
             "properties" : {
                "request" : { "type" : "string", "analyzer" : "underscore_splits_words" }
            }
        }
    }
  }
}

Data object:

POST simple/0 
{ "request" : "GET /queue/1/under_score-hyphenword/poll?ttl=300&limit=10" }

This returns tokens: "under", "score", "hyphenword", but no "underscore_splits_words":

POST simple/_analyze?analyzer=underscore_splits_words
{"/queue/1/under_score-hyphenword/poll?ttl=300&limit=10"}

Search results

Hit:

GET simple/_search?q=hyphenword

Hit:

POST simple/_search
{ 
"query": {
        "query_string": {
          "query": "hyphenword"
        }
      }
}

Miss:

GET simple/_search?q=score

Miss:

POST simple/_search
{ 
"query": {
        "query_string": {
          "query": "score"
        }
      }
}

Please suggest a correct way to achieve my goal. Thanks!

like image 291
Volodymyr Linevych Avatar asked Aug 05 '15 16:08

Volodymyr Linevych


1 Answers

You should be able to use the "simple" analyzer for this to work. There's no need for a custom analyzer, because the simple analyzer uses the letter tokenizer and the lowercase tokenizer in conjunction (thus, any non-alphabetical characters signal a new token). The reason you are not getting any hits is because you are not specifying the field in your query, so you are querying the _all field, which is mainly for convenient fulltext searching.

Create index

PUT myindex
{
    "mappings":     {
        "mytype": {
            "properties": {
                "request": {
                    "type": "string",
                    "analyzer": "simple"
                }
            }
        }
    }
}

Insert a document

POST myindex/mytype/1 
{ "request" : "GET /queue/1/key_word-hyphenword/poll?ttl=300&limit=10" }

Query for the document

GET myindex/mytype/_search?q=request:key

Query using the Query DSL:

POST myindex/mytype/_search
 {
     "query": {
         "query_string": {
             "default_field": "request", 
             "query": "key"
         }
     }
 }

Another query using the query DSL:

POST myindex/mytype/_search
{
    "query": {
        "bool": {
            "must": [
                { "match": { "request": "key"}}
            ]
        }
    }
}

The output from the queries looks correct:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "myindex",
            "_type": "mytype",
            "_id": "1",
            "_score": 0.095891505,
            "_source": {
               "request": "GET /queue/1/key_word-hyphenword/poll?ttl=300&limit=10"
            }
         }
      ]
   }
}

If you want to be omit the specific field you're searching (NOT RECOMMENDED), you can set the default analyzer for the all mappings in the index when you create the index. (Note, this feature is deprecated, and you shouldn't use it for performance/stability reasons.)

Create index with default mapping to make the _all field analyzed using the "simple" analyzer

PUT myindex
{
    "mappings":     {
        "_default_": {
            "index_analyzer": "simple"
        }
    }
}

Insert a document

POST myindex/mytype/1 
{ "request" : "GET /queue/1/key_word-hyphenword/poll?ttl=300&limit=10" }

Query the index without specifying the field

GET myindex/mytype/_search?q=key

You will get the same result (1 hit).

like image 97
Jona Avatar answered Nov 15 '22 10:11

Jona