Exclude from CamelCase tokenizer in Elasticsearch

Question

Struggling to make iPhone match when searching for iphone in Elasticsearch.

Since I have some source code at stake, I surely need CamelCase tokenizer, but it appears to break iPhone into two terms, so iphone can't be found.

Anyone knows of a way to add exceptions to breaking camelCase words into tokens (camel + case)?

UPDATE: to make it clear, I want NullPointerException to be tokenized as [null, pointer, exception], but I don't want iPhone to become [i, phone].

Any other solution?

UPDATE 2: @ChintanShah's answer suggests a different approach that gives us even more - NullPointerException will be tokenized as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is definitely much more useful from the point of view of the one that searches. And indexing is also faster! Price to pay is index size, but it is a superior solution.

ChintanShah25 · Accepted Answer

You can achieve your requirements with word_delimiter token filter. This is my setup

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "camel_filter",
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "camel_filter": {
          "type": "word_delimiter",
          "generate_number_parts": false,
          "stem_english_possessive": false,
          "split_on_numerics": false,
          "protected_words": [
            "iPhone",
            "WiFi"
          ]
        }
      }
    }
  },
  "mappings": {
  }
}

This will split the words on case changes so NullPointerException will be tokenized as null, pointer and exception but iPhone and WiFi will remain as it is as they are protected. word_delimiter has lot of options for flexibility. You can also preserve_original which will help you a lot.

GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer

Result

{
   "tokens": [
      {
         "token": "iphone",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 1
      }
   ]
}

Now with

GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer

Result

{
   "tokens": [
      {
         "token": "null",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 1
      },
      {
         "token": "pointer",
         "start_offset": 4,
         "end_offset": 11,
         "type": "word",
         "position": 2
      },
      {
         "token": "exception",
         "start_offset": 11,
         "end_offset": 20,
         "type": "word",
         "position": 3
      }
   ]
}

Another approach is to analyze your field twice with different analyzers but I feel word_delimiter will do the trick.

Does this help?

Exclude from CamelCase tokenizer in Elasticsearch

Tags:

elasticsearch

camelcasing

tishma

1 Answers

ChintanShah25

Recent Activity

Donate For Us

Exclude from CamelCase tokenizer in Elasticsearch

Tags:

elasticsearch

camelcasing

tishma

1 Answers

ChintanShah25

Related questions

Recent Activity

Donate For Us