I have a token filter and analyzer as follows. However, I can't get the original token to be preserved. For example, if I _analyze using the word : saint-louis , I get back only saintlouis, whereas I expected to get both saintlouis and saint-louis as I have my preserve_original set to true. The ES version i am using is 6.3.2 and Lucene version is 7.3.1
"analysis": {
"filter": {
"hyphenFilter": {
"pattern": "-",
"type": "pattern_replace",
"preserve_original": "true",
"replacement": ""
}
},
"analyzer": {
"whitespace_lowercase": {
"filter": [
"lowercase",
"asciifolding",
"hyphenFilter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
So looks like preserve_original is not supported on pattern_replace token filters, at least not on the version I am using.
I made a workaround as follows:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"type": "custom",
"filter": [
"lowercase",
"hyphen_filter"
]
}
},
"filter": {
"hyphen_filter": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_words": "true"
}
}
}
}
}
This would, for example, tokenize a word like anti-spam to antispam(removed the hyphen), anti-spam(preserved the original), anti and spam.
POST /_analyze
{ "text": "anti-spam", "analyzer" : "my_analyzer" }
{
"tokens": [
{
"token": "anti-spam",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "anti",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "antispam",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "spam",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 1
}
]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With