How to perform "lowercase filter" along with "char_filter"?

Question

As far as I read in the ES documentation:

"Character filters are used to “tidy up” a string before it is tokenized."
"After tokenization, the resulting token stream is passed through any specified token filters"

( source: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html )

From those two statements, I understand that the following steps are executed:

char_filter;
tokenization;
filter.

Problem:

I may have a char_filter that turns multiple letters at once.

Example: ph -> f.

However, "PH" won't be turned into "f", because "PH" is not part of the mapping.

So, the analysis of "philipp" retrieves "filipp", whereas "Philipp" retrieves "philipp".

Working with both upper and lowercase (to achieve the same result in both cases), the number of mappings in the char_filter will be (number of characters)².

Example: ph -> f; Ph -> F; pH -> f; PH -> F.

I wouldn't be a problem if I only had 4 mappings, but if I need more x² mappings, the char_filter tends to become a big mess.

Example of index:

{
    "settings" : {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "default_index" : {
                        "type" : "custom",
                        "tokenizer" : "whitespace",
                        "filter" : [
                            "lowercase"
                        ],
                        "char_filter" : [
                            "misc_simplifications"
                        ]
                    }
                },
                "char_filter" : {
                    "misc_simplifications" : {
                        "type" : "mapping",
                        "mappings" : [
                            "ph=>f","Ph=>F","pH=>f","PH=>F"
                        ]
                    }
                }
            }
        }
    }
}

Philosophical question:

I understand that I may want to treat "ph" and "Ph" equally, but "pH" could mean something totally different. But is there a way of turning the characters into lowercase before the char_filter phase? Does it make sense?

Because that big mapping gives me the feeling that I am doing something wrong or that I can find an easier (more elegant) solution.

Olly Cruickshank · Accepted Answer

you're correct in the sequence of steps:

CharFilter
Tokenizer
TokenFilter

However, the main purpose of the CharFilter is to clean up the data to make the tokenisation easier. For example by stripping out the XML tags or replacing a delimiter with a space character.

So - I would put misc_simplifications as a TokenFilter to be applied after the Lowercase filter.

{
"settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "default_index" : {
                    "type" : "custom",
                    "tokenizer" : "whitespace",
                    "filter" : [
                        "lowercase",
                        "misc_simplifications"
                    ]
                }
            },
            "filter" : {
                "misc_simplifications" : {
                    "type" : "pattern_replace",
                    "pattern": "ph",
                    "replacement":"f"
                }
            }
        }
    }
  }
}

Note I've used pattern replace instead of mappings. You could also modify the regexp to only replace where "ph" is at the beginning of the token.

Also - your mappings look like phonetic replacements. I'm not sure of your requirements, but it looks like possibly the phonetic token filter would help you.

How to perform "lowercase filter" along with "char_filter"?

Tags:

elasticsearch

dan

1 Answers

Olly Cruickshank

Recent Activity

Donate For Us

How to perform "lowercase filter" along with "char_filter"?

Tags:

elasticsearch

dan

1 Answers

Olly Cruickshank

Related questions

Recent Activity

Donate For Us