Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elasticsearch custom tokenizer - split token by length

I am using elasticsearch version 1.2.1. I have a use case in which I would like to create a custom tokenizer that will break the tokens by their length up to a certain minimum length. For example, assuming minimum length is 4, the token "abcdefghij" will be split into: "abcd efgh ij".

I am wondering if I can implement this logic without the need of coding a custom Lucene Tokenizer class?

Thanks in advance.

like image 483
ybensimhon Avatar asked Feb 08 '15 16:02

ybensimhon


2 Answers

For your requirement, if you can't do it using the pattern tokenizer then you'll need to code up a custom Lucene Tokenizer class yourself. You can create a custom Elasticsearch plugin for it. You can refer to this for examples about how Elasticsearch plugins are created for custom analyzers.

like image 132
bittusarkar Avatar answered Oct 04 '22 23:10

bittusarkar


The Pattern Tokenizer supports a parameter "group"

It has a default of "-1", which means to use the pattern for splitting, which is what you saw.

However by defining a group >= 0 in your pattern and setting the group-parameter this can be done! E.g. the following tokenizer will split the input into 4-character tokens:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(.{4})",
          "group": "1"
        }
      }
    }
  }
}

Analyzing a document via the following:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}

Results in the following tokens:

{
  "tokens": [
    {
      "token": "comm",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "a,se",
      "start_offset": 4,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "para",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    },
    {
      "token": "ted,",
      "start_offset": 12,
      "end_offset": 16,
      "type": "word",
      "position": 3
    },
    {
      "token": "valu",
      "start_offset": 16,
      "end_offset": 20,
      "type": "word",
      "position": 4
    }
  ]
}
like image 32
centic Avatar answered Oct 04 '22 23:10

centic