Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

line breaks or punctuation marks as position gaps in elasticsearch

In elasticsearch, is there a way to set up an analyzer that would produce position gaps between tokens when line breaks or punctuation marks are encountered?

Let's say I index an object with the following nonsensical string (with line break) as one of its fields:

The quick brown fox runs after the rabbit.
Then comes the jumpy frog.

The standard analyzer will yield the following tokens with respective positions:

0 the
1 quick
2 brown
3 fox
4 runs
5 after
6 the
7 rabbit
8 then
9 comes
10 the
11 jumpy
12 frog

This means that a match_phrase query of the rabbit then comes will match this document as a hit. Is there a way to introduce a position gap between rabbit and then so that it doesn't match unless a slop is introduced?

Of course, a workaround could be to transform the single string into an array (one line per entry) and use position_offset_gap in field mapping, but I would really rather keep a single string with newlines (and an ultimate solution would involve larger position gaps for newlines than, say, for punctuation marks).

like image 367
Shadocko Avatar asked Sep 16 '15 12:09

Shadocko


1 Answers

I eventually figured out a solution using a char_filter to introduce extra tokens on line breaks and punctuation marks:

PUT /index
{                                              
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": ["my_mapping"],
          "filter": ["lowercase"]
        }
      }
    }
  }
}

Testing with the example string

POST /index/_analyze?analyzer=my_analyzer&pretty
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.

yields the following result:

{
  "tokens" : [ {
    "token" : "the",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
... snip ...
    "token" : "rabbit",
    "start_offset" : 35,
    "end_offset" : 41,
    "type" : "<ALPHANUM>",
    "position" : 8
  }, {
    "token" : "_period_",
    "start_offset" : 41,
    "end_offset" : 41,
    "type" : "<ALPHANUM>",
    "position" : 9
  }, {
    "token" : "_newline_",
    "start_offset" : 42,
    "end_offset" : 42,
    "type" : "<ALPHANUM>",
    "position" : 10
  }, {
    "token" : "then",
    "start_offset" : 43,
    "end_offset" : 47,
    "type" : "<ALPHANUM>",
    "position" : 11
... snip ...
  } ]
}
like image 101
Shadocko Avatar answered Nov 08 '22 04:11

Shadocko