In elasticsearch, is there a way to set up an analyzer that would produce position gaps between tokens when line breaks or punctuation marks are encountered?
Let's say I index an object with the following nonsensical string (with line break) as one of its fields:
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.
The standard analyzer will yield the following tokens with respective positions:
0 the
1 quick
2 brown
3 fox
4 runs
5 after
6 the
7 rabbit
8 then
9 comes
10 the
11 jumpy
12 frog
This means that a match_phrase
query of the rabbit then comes
will match this document as a hit.
Is there a way to introduce a position gap between rabbit
and then
so that it doesn't match unless a slop
is introduced?
Of course, a workaround could be to transform the single string into an array (one line per entry) and use position_offset_gap
in field mapping, but I would really rather keep a single string with newlines (and an ultimate solution would involve larger position gaps for newlines than, say, for punctuation marks).
I eventually figured out a solution using a char_filter
to introduce extra tokens on line breaks and punctuation marks:
PUT /index
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["my_mapping"],
"filter": ["lowercase"]
}
}
}
}
}
Testing with the example string
POST /index/_analyze?analyzer=my_analyzer&pretty
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.
yields the following result:
{
"tokens" : [ {
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
... snip ...
"token" : "rabbit",
"start_offset" : 35,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 8
}, {
"token" : "_period_",
"start_offset" : 41,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 9
}, {
"token" : "_newline_",
"start_offset" : 42,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 10
}, {
"token" : "then",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 11
... snip ...
} ]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With