Why special characters like = or " break PHP regexp when using \b word boundary?

Tags:

php

this is a follow up after reading How to specify "Space or end of string" and "space or start of string"?

From there, it states means to match a word in a phrase. I can even add a few other solutions. But as soon as a = or " is added, it quit working. Why?

i am going to search for stackoverflow and replace it with OK using preg_replace()

preg_replace('/\bstackoverflow\b/', 'OK', $input_line)

input:
1: stackoverflow xxx
2: xxx stackoverflow xxx
3: xxx stackoverflow
result:
1: OK xxx
2: xxx OK xxx
3: xxx OK

now, if i change it to match stackoverflow="", it stops working.

preg_replace('/\bstackoverflow=""\b/', 'OK', $input_line)

input:
1: stackoverflow="" xxx
2: xxx stackoverflow="" xxx
3: xxx stackoverflow=""
result:
1: stackoverflow="" xxx
2: xxx stackoverflow="" xxx
3: xxx stackoverflow=""

the same will happen if i use on my regex: /\bstackoverflow=\b/ or /\bstackoverflow"\b/. I already checked the manual if = or " are special chars, they are not. but i even tried /\bstackoverflow\=\"\"\b/

Why is that?

in that example removing \b will also solve it, but it will also match nostackoverflow=""not which i do not want.

i also tried alternatives to \b such as [ ^] and ( |^). Interestingly [ ^] (space or beginning of line) will not work for beginning of line, only space. But ( |^) will work fine for both.

664

asked Nov 23 '15 23:11

gcb

2 Answers

The problem is your use of \b which is a "word boundary." It's a placeholder for (^\w|\w$|\W\w|\w\W), where \w is a "word" character [A-Za-z0-9_] and \W is the opposite. The problem is that a " doesn't match the "word" characters, so the boundary condition is not met.

Try using a \s instead, which will match any whitespace character.

(?:^|\s)stackoverflow=""(?:\s|$)

Characters inside a class are not interpreted, except for ^ used as a negation operator at the beginning of a class, and - as a range operator. This is why [ ^] wouldn't work for you. It was searching for a literal ^.

$ php -a
Interactive shell

php > $input_line='
php ' stackoverflow="" xxx
php ' xxx stackoverflow="" xxx
php ' xxx stackoverflow=""
php ' ';
php > echo preg_replace('/(?:^|\s)stackoverflow=""(?:\s|$)/', 'OK', $input_line);
OKxxx
xxxOKxxx
xxxOK

https://regex101.com/r/nP2aB8/1

answered Oct 05 '22 23:10

miken32

Background

From the regular-expressions.info Word boundaries page:

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.

There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.

A very good explanation from nhahtdh post:

A word boundary \b is equivalent to:
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means:

Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).

OR

Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).

What's wrong with your regex

The reason why \b is not suitable is because it requires a word/non-word character to appear after/before it which depends on the immediate context on both sides of \b. When you build a regex dynamically, you do not know which one to use, \B or \b. For your case, you could use '/\bstackoverflow=""\B/', but it would require a smart word/non-word boundary appending. However, there is an easier way: use negative lookarounds.

Solution

(?<!\w)stackoverflow=""(?!\w)

See regex demo

The regex contains negative lookarounds instead of word boundaries. The (?<!\w) lookbehind fails the match if there is a word character before stackoverflow="", and (?!\w) lookahead fails the match if stackoverflow="" is followed by a word character.

What a word shorthand character class \w matches depends if you enable the Unicode modifier /u. Without it, a \w matches just [a-zA-Z0-9_]. You can lay further restrictions using the lookarounds.

Demo

PHP demo:

$re = '/(?<!\w)stackoverflow=""(?!\w)/'; 
$str = ",stackoverflow=\"\" xxx\nxxx stackoverflow=\"\" xxx\nxxx stackoverflow=\"\"\nstackoverflow=\"\" xxx"; 
echo preg_replace($re, "NEW=\"\"", $str);

NOTE: If you pass your string as a variable, remember to escape all special characters in it with preg_quote:

$re = '/(?<!\w)' . preg_quote($keyword, '/') . '(?!\w)/';

Here, notice the second argument to preg_quote, which is /, the regex delimiter char.

answered Oct 06 '22 01:10

Wiktor Stribiżew

Related questions
                            
                                How to apply Machine Learning algorithm in PHP? [closed]
                            
                                Composer vs Symfony 2 autoloader
                            
                                Paypal Sandbox recurring payment with initial amount pending
                            
                                How to validate brackets in equation string in PHP
                            
                                Sending data from server to client?
                            
                                Execute commands on remote machine via PHP
                            
                                How can I disable error parsing in specific file type in NetBeans
                            
                                PHP Dart game calculation slow performance
                            
                                PHP - Get all parameters from a function (even the optional one)
                            
                                What is the equivalent php structure to python's dictionary?
                            
                                Making DomPDF as my pdf writer for phpWord
                            
                                Ampersand prepended at end of array var_dump
                            
                                Why is $HTTP_RAW_POST_DATA being called?
                            
                                Converting php string to Title Case
                            
                                max connection MySql reached during tests
                            
                                "Received an assertion that is valid in the future. Check clock synchronization on IdP and SP"
                            
                                No query results for model [App\Products] Laravel
                            
                                SOAP: HTTP Bad Request
                            
                                Drupal 7: Localhost/user link defaults to website/user
                            
                                Laravel: Returning a view from a controller

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With