I'm currently writing a library for matching specific words in content. Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions. A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word <code>cat</code>. I specify that it must start a word, so <code>catering</code> will match as <code>cat</code> is at the start, but <code>ducat</code> won't match as <code>cat</code> doesn't start the word. I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to. Take the following, <pre class="prettyprint"><code>preg_match("/(^|\b)@nimal/i", "something@nimal", $match); preg_match("/(^|\b)@nimal/i", "something!@nimal", $match); </code></pre> In the statements above I would expect the following results, <pre class="prettyprint"><code>> false > 1 (@nimal) </code></pre> But the result is instead the opposite, <pre class="prettyprint"><code>> 1 (@nimal) > false </code></pre> In the first, I would expect it to fail as the group will eat the <code>@</code>, leaving <code>nimal</code> to match against <code>@nimal</code>, which obviously it doesn't. Instead, the group matchs an empty string, so <code>@nimal</code> is matched, meaning <code>@</code> is considered to be part of the word. In the second, I would expect the group to eat the <code>!</code> leaving <code>@nimal</code> to match the rest (which it should). Instead, it appears to combine the <code>!</code> and <code>@</code> together to form a word, which is confirmed by the following matching, <pre class="prettyprint"><code>preg_match("/g\b!@\bn/i", "something!@nimal", $match); </code></pre> Any ideas why regular expression does this? I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.

The word boundary <code>\b</code> matches on a change from a <code>\w</code> (a word character) to a <code>\W</code> a non word character. You want to match if there is a <code>\b</code> before your <code>@</code> which is a <code>\W</code> character. So to match you need a word character before your <code>@</code> <pre class="prettyprint"><code>something@nimal ^^ </code></pre> ==> Match because of the word boundary between <code>g</code> and <code>@</code>. <pre class="prettyprint"><code>something!@nimal ^^ </code></pre> ==> NO match because between <code>!</code> and <code>@</code> there is no word boundary, both characters are <code>\W</code>

How exactly do Regular Expression word boundaries work in PHP?

Tags:

regex

php

I'm currently writing a library for matching specific words in content.

Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.

A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.

I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.

Take the following,

preg_match("/(^|\b)@nimal/i", "something@nimal", $match);
preg_match("/(^|\b)@nimal/i", "something!@nimal", $match);

In the statements above I would expect the following results,

> false
> 1 (@nimal)

But the result is instead the opposite,

> 1 (@nimal)
> false

In the first, I would expect it to fail as the group will eat the @, leaving nimal to match against @nimal, which obviously it doesn't. Instead, the group matchs an empty string, so @nimal is matched, meaning @ is considered to be part of the word.

In the second, I would expect the group to eat the ! leaving @nimal to match the rest (which it should). Instead, it appears to combine the ! and @ together to form a word, which is confirmed by the following matching,

preg_match("/g\b!@\bn/i", "something!@nimal", $match);

Any ideas why regular expression does this?

I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.

997

asked Jun 30 '11 08:06

Stephen Melrose

2 Answers

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your @ which is a \W character. So to match you need a word character before your @

something@nimal
        ^^

==> Match because of the word boundary between g and @.

something!@nimal
         ^^

==> NO match because between ! and @ there is no word boundary, both characters are \W

answered Nov 07 '22 19:11

stema

One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].

You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.

It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.

answered Nov 07 '22 19:11

Coder

Related questions
                            
                                How to convert all keys in a multi-dimenional array to snake_case?
                            
                                (mysql, php) How to get auto_increment field value before inserting data?
                            
                                What's the recommended way to store current time using PHP and MySQL?
                            
                                Why does `intval(19.9 * 100)` equal `1989`?
                            
                                PHP function to convert unicode to special characters?
                            
                                PHP: Manage roles with math?
                            
                                Using PHP interfaces in Codeigniter
                            
                                How to deal with Polish Characters while using regex?
                            
                                PHP - Strings - Remove a HTML tag with a specific class, including its contents
                            
                                Converting errors to exceptions: design flaw?
                            
                                mb_str_replace()... is slow. any alternatives?
                            
                                PHP Netbeans: xdebug stops on every include() or require()
                            
                                $_SERVER['HTTP_HOST'] contains port number too =/
                            
                                How to compress or convert to low quality Mp3 file from PHP [closed]
                            
                                PHP OOP :: Building an API Wrapper class
                            
                                PHPDoc Comments in Notepad++?
                            
                                need to put code comments inside a heredoc
                            
                                How to get the result of a select count(*) query in PHP?
                            
                                How to connect an Oracle database from PHP
                            
                                Avoiding if-statements with object oriented design, PHP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With