Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exclude from CamelCase tokenizer in Elasticsearch

Struggling to make iPhone match when searching for iphone in Elasticsearch.

Since I have some source code at stake, I surely need CamelCase tokenizer, but it appears to break iPhone into two terms, so iphone can't be found.

Anyone knows of a way to add exceptions to breaking camelCase words into tokens (camel + case)?

UPDATE: to make it clear, I want NullPointerException to be tokenized as [null, pointer, exception], but I don't want iPhone to become [i, phone].

Any other solution?

UPDATE 2: @ChintanShah's answer suggests a different approach that gives us even more - NullPointerException will be tokenized as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is definitely much more useful from the point of view of the one that searches. And indexing is also faster! Price to pay is index size, but it is a superior solution.

like image 946
tishma Avatar asked Jan 02 '16 14:01

tishma


1 Answers

You can achieve your requirements with word_delimiter token filter. This is my setup

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "camel_filter",
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "camel_filter": {
          "type": "word_delimiter",
          "generate_number_parts": false,
          "stem_english_possessive": false,
          "split_on_numerics": false,
          "protected_words": [
            "iPhone",
            "WiFi"
          ]
        }
      }
    }
  },
  "mappings": {
  }
}

This will split the words on case changes so NullPointerException will be tokenized as null, pointer and exception but iPhone and WiFi will remain as it is as they are protected. word_delimiter has lot of options for flexibility. You can also preserve_original which will help you a lot.

GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer

Result

{
   "tokens": [
      {
         "token": "iphone",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 1
      }
   ]
}

Now with

GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer

Result

{
   "tokens": [
      {
         "token": "null",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 1
      },
      {
         "token": "pointer",
         "start_offset": 4,
         "end_offset": 11,
         "type": "word",
         "position": 2
      },
      {
         "token": "exception",
         "start_offset": 11,
         "end_offset": 20,
         "type": "word",
         "position": 3
      }
   ]
}

Another approach is to analyze your field twice with different analyzers but I feel word_delimiter will do the trick.

Does this help?

like image 139
ChintanShah25 Avatar answered Nov 18 '22 21:11

ChintanShah25