Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform "lowercase filter" along with "char_filter"?

As far as I read in the ES documentation:

  1. "Character filters are used to “tidy up” a string before it is tokenized."
  2. "After tokenization, the resulting token stream is passed through any specified token filters"

( source: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html )

From those two statements, I understand that the following steps are executed:

  1. char_filter;
  2. tokenization;
  3. filter.

Problem:

I may have a char_filter that turns multiple letters at once.

Example: ph -> f.

However, "PH" won't be turned into "f", because "PH" is not part of the mapping.

So, the analysis of "philipp" retrieves "filipp", whereas "Philipp" retrieves "philipp".

Working with both upper and lowercase (to achieve the same result in both cases), the number of mappings in the char_filter will be (number of characters)².

Example: ph -> f; Ph -> F; pH -> f; PH -> F.

I wouldn't be a problem if I only had 4 mappings, but if I need more x² mappings, the char_filter tends to become a big mess.

Example of index:

{
    "settings" : {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "default_index" : {
                        "type" : "custom",
                        "tokenizer" : "whitespace",
                        "filter" : [
                            "lowercase"
                        ],
                        "char_filter" : [
                            "misc_simplifications"
                        ]
                    }
                },
                "char_filter" : {
                    "misc_simplifications" : {
                        "type" : "mapping",
                        "mappings" : [
                            "ph=>f","Ph=>F","pH=>f","PH=>F"
                        ]
                    }
                }
            }
        }
    }
}

Philosophical question:

I understand that I may want to treat "ph" and "Ph" equally, but "pH" could mean something totally different. But is there a way of turning the characters into lowercase before the char_filter phase? Does it make sense?

Because that big mapping gives me the feeling that I am doing something wrong or that I can find an easier (more elegant) solution.

like image 661
dan Avatar asked Nov 13 '14 17:11

dan


1 Answers

you're correct in the sequence of steps:

  1. CharFilter
  2. Tokenizer
  3. TokenFilter

However, the main purpose of the CharFilter is to clean up the data to make the tokenisation easier. For example by stripping out the XML tags or replacing a delimiter with a space character.

So - I would put misc_simplifications as a TokenFilter to be applied after the Lowercase filter.

{
"settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "default_index" : {
                    "type" : "custom",
                    "tokenizer" : "whitespace",
                    "filter" : [
                        "lowercase",
                        "misc_simplifications"
                    ]
                }
            },
            "filter" : {
                "misc_simplifications" : {
                    "type" : "pattern_replace",
                    "pattern": "ph",
                    "replacement":"f"
                }
            }
        }
    }
  }
}

Note I've used pattern replace instead of mappings. You could also modify the regexp to only replace where "ph" is at the beginning of the token.

Also - your mappings look like phonetic replacements. I'm not sure of your requirements, but it looks like possibly the phonetic token filter would help you.

like image 94
Olly Cruickshank Avatar answered Sep 28 '22 01:09

Olly Cruickshank