I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat
. I specify that it must start a word, so catering
will match as cat
is at the start, but ducat
won't match as cat
doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)@nimal/i", "something@nimal", $match);
preg_match("/(^|\b)@nimal/i", "something!@nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (@nimal)
But the result is instead the opposite,
> 1 (@nimal)
> false
In the first, I would expect it to fail as the group will eat the @
, leaving nimal
to match against @nimal
, which obviously it doesn't. Instead, the group matchs an empty string, so @nimal
is matched, meaning @
is considered to be part of the word.
In the second, I would expect the group to eat the !
leaving @nimal
to match the rest (which it should). Instead, it appears to combine the !
and @
together to form a word, which is confirmed by the following matching,
preg_match("/g\b!@\bn/i", "something!@nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.
The following three positions are qualified as word boundaries: Before the first character in a string if the first character is a word character. After the last character in a string if the last character is a word character. Between two characters in a string if one is a word character and the other is not.
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.
A word boundary is a zero-width test between two characters. To pass the test, there must be a word character on one side, and a non-word character on the other side. It does not matter which side each character appears on, but there must be one of each.
In PHP, regular expressions are strings composed of delimiters, a pattern and optional modifiers. $exp = "/w3schools/i"; In the example above, / is the delimiter, w3schools is the pattern that is being searched for, and i is a modifier that makes the search case-insensitive.
The word boundary \b
matches on a change from a \w
(a word character) to a \W
a non word character. You want to match if there is a \b
before your @
which is a \W
character. So to match you need a word character before your @
something@nimal
^^
==> Match because of the word boundary between g
and @
.
something!@nimal
^^
==> NO match because between !
and @
there is no word boundary, both characters are \W
One problem I've encountered doing similar matching is words like can't
and it's
, where the apostrophe is considered a word/non-word boundary (as it is matched by \W
and not \w
). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^']
.
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek
.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With