I've got data coming in from Logstash that's being analyzed in an overeager manner. Essentially, the field "OS X 10.8"
would be broken into "OS"
, "X"
, and "10.8"
. I know I could just change the mapping and re-index for existing data, but how would I change the default analyzer (either in ElasticSearch or LogStash) to avoid this problem in future data?
Concrete Solution: I created a mapping for the type before I sent data to the new cluster for the first time.
Solution from IRC: Create an Index Template
By default, Elasticsearch uses the standard analyzer for all text analysis. The standard analyzer gives you out-of-the-box support for most natural languages and use cases. If you chose to use the standard analyzer as-is, no further configuration is needed.
To add an analyzer, you must close the index, define the analyzer, and reopen the index. You cannot close the write index of a data stream. To update the analyzer for a data stream's write index and future backing indices, update the analyzer in the index template used by the stream.
Configurationedit The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255 .
According this page analyzers can be specified per-query, per-field or per-index.
At index time
, Elasticsearch will look for an analyzer in this order:
field mapping
. default
in the index settings. standard
analyzer.At query time
, there are a few more layers:
full-text query
.search_analyzer
defined in the field mapping.field mapping
.default_search
in the index settings.default
in the index settings.standard
analyzer.On the other hand, this page point to important thing:
An analyzer is registered under a logical name. It can then be referenced from mapping definitions or certain APIs. When none are defined, defaults are used. There is an option to define which analyzers will be used by default when none can be derived.
So the only way to define a custom analyzer as default is overriding one of pre-defined analyzers, in this case the default
analyzer. it means we can not use an arbitrary name for our analyzer, it must be named default
here a simple example of index setting:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"char_filter": {
"charMappings": {
"type": "mapping",
"mappings": [
"\\u200C => "
]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords_path": "stopwords.txt"
}
},
"analyzer": {
"default": {<--------- analyzer name must be default
"tokenizer": "standard",
"char_filter": [
"charMappings"
],
"filter": [
"lowercase",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With