Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word boundary in Lucene regex

I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?

like image 544
dimid Avatar asked Jan 30 '18 09:01

dimid


People also ask

What is word boundary in regex?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12" , it would match before the 1 or after the 2. The dash is not a word character.

What is word boundary in regex python?

Introduction to the Python regex word boundaryBetween two characters in the string if the first character is a word character ( \w ) and the other is not ( \W – inverse character set of the word character \w ). After the last character in a string if the last character is the word character ( \w )

What does \b mean in regex?

The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

What is non word boundary in regex?

The basic purpose of non-word-boundary is to created a regex that says: if we are at the beginning/end of a word char ( \w = [a-zA-Z0-9_] ) make sure the previous/next character is also a word char , e.g.: "a\B." ~ "a\w" : "ab" , "a4" , "a_" , ... but not "a " , "a."


1 Answers

In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.

Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:

(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?

Details

  • (.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
  • word - a word
  • ([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).
like image 200
Wiktor Stribiżew Avatar answered Sep 19 '22 04:09

Wiktor Stribiżew