I'm trying to make Elasticsearch ignore hyphens. I don't want it to split either side of the hyphen into seperate words. It seems simple but I'm banging my head on the wall.
I want the string "Roland JD-Xi" to produce the following terms: [ roland jd-xi, roland, jd-xi, jdxi, roland jdxi ]
I haven't been able to achieve this easily. Most people will just type 'jdxi' so my initial thought would be to just remove the hyphen. So I'm using the following definition
name: {
"type": "string",
"analyzer": "language",
"include_in_all": true,
"boost": 5,
"fields": {
"my_standard": {
"type": "string",
"analyzer": "my_standard"
},
"my_prefix": {
"type": "string",
"analyzer": "my_text_prefix",
"search_analyzer": "my_standard"
},
"my_suffix": {
"type": "string",
"analyzer": "my_text_suffix",
"search_analyzer": "my_standard"
}
}
}
And the relevant analyser and filters are defined as
{
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"analyzer": {
"std": {
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "length", "strip_hyphens"]
...
"my_text_prefix": {
"tokenizer": "whitespace",
"char_filter": "my_filter",
"filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_front"]
},
"my_text_suffix": {
"tokenizer": "whitespace",
"char_filter": "my_filter",
"filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_back"]
},
"my_standard": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": "my_filter",
"filter": ["standard", "elision", "asciifolding", "lowercase"]
}
},
"char_filter": {
"my_filter": {
"type": "mapping",
"mappings": ["- => ", ". => "]
}
},
"filter": {
"edge_ngram_front": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20,
"side": "front"
},
"edge_ngram_back": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20,
"side": "back"
},
"strip_spaces": {
"type": "pattern_replace",
"pattern": "\\s",
"replacement": ""
},
"strip_dots": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": ""
},
"strip_hyphens": {
"type": "pattern_replace",
"pattern": "-",
"replacement": ""
},
"stop": {
"type": "stop",
"stopwords": "_none_"
},
"length": {
"type": "length",
"min": 1
}
}
}
I've been able to test (i.e. _analyze) this and the string "Roland JD-Xi" is tokenised as [ roland, jdxi ]
It not exactly what I want but close enough as it should match 'jdxi'.
But thats my problem. If I do a simple "index/_search?q=jdxi" it doesn't bring back the document. However if I do a "index/_search?q=roland+jdxi" it does bring back the document.
So at least I know the hyphens are being removed but if the tokens "roland" and "jdxi" are being created how come "index/_search?q=jdxi" doesn't match the document?
The key difference is that normalizers can only emit a single token while analyzers can emit many. Since they only emit one token, normalizers do not use a tokenizer. They do use character filters and token filters but are limited to using those that work at a single character at a time.
In a nutshell an analyzer is used to tell elasticsearch how the text should be indexed and searched. And what you're looking into is the Analyze API, which is a very nice tool to understand how analyzers work. The text is provided to this API and is not related to the index.
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace.
I've reproduced your case on ES 6 and searching for index/_search?q=jdxi
returns the document.
The issue could be that when searching for index/_search?q=jdxi
without specifying a field, it will basically search in _all
which contains whatever was in the name
field (basically the same as index/_search?q=name:jdxi
). Since that field was not analyzed using your my_standard
analyzer, you don't get any results.
What you should do instead is searching using the my_standard
sub-field, i.e. index/_search?q=name.my_standard:jdxi
and pretty sure you'll get the document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With