I am using elasticsearch version 1.2.1. I have a use case in which I would like to create a custom tokenizer that will break the tokens by their length up to a certain minimum length. For example, assuming minimum length is 4, the token "abcdefghij" will be split into: "abcd efgh ij".
I am wondering if I can implement this logic without the need of coding a custom Lucene Tokenizer class?
Thanks in advance.
For your requirement, if you can't do it using the pattern tokenizer then you'll need to code up a custom Lucene Tokenizer class yourself. You can create a custom Elasticsearch plugin for it. You can refer to this for examples about how Elasticsearch plugins are created for custom analyzers.
The Pattern Tokenizer supports a parameter "group"
It has a default of "-1", which means to use the pattern for splitting, which is what you saw.
However by defining a group >= 0 in your pattern and setting the group-parameter this can be done! E.g. the following tokenizer will split the input into 4-character tokens:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(.{4})",
"group": "1"
}
}
}
}
}
Analyzing a document via the following:
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "comma,separated,values"
}
Results in the following tokens:
{
"tokens": [
{
"token": "comm",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "a,se",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "para",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "ted,",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 3
},
{
"token": "valu",
"start_offset": 16,
"end_offset": 20,
"type": "word",
"position": 4
}
]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With