Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP regex and adjacent capturing groups

I'm using capturing groups in regular expressions for the first time and I'm wondering what my problem is, as I assume that the regex engine looks through the string left-to-right.

I'm trying to convert an UpperCamelCase string into a hyphened-lowercase-string, so for example:

HelloWorldThisIsATest => hello-world-this-is-a-test

My precondition is an alphabetic string, so I don't need to worry about numbers or other characters. Here is what I tried:

mb_strtolower(preg_replace('/([A-Za-z])([A-Z])/', '$1-$2', "HelloWorldThisIsATest"));

The result:

hello-world-this-is-atest

This is almost what I want, except there should be a hyphen between a and test. I've already included A-Z in my first capturing group so I would assume that the engine sees AT and hyphenates that.

What am I doing wrong?

like image 881
rink.attendant.6 Avatar asked Jun 23 '14 06:06

rink.attendant.6


People also ask

How do you repeat a group in regex?

"Capturing a repeated group captures all iterations." In your regex101 try to replace your regex with (\w+),? and it will give you the same result. The key here is the g flag which repeats your pattern to match into multiple groups.

What is the use of \1 in regex?

The backreference \1 (backslash one) references the first capturing group. \1 matches the exact same text that was matched by the first capturing group. The / before it is a literal character. It is simply the forward slash in the closing HTML tag that we are trying to match.

What will the \$ regular expression match?

\$ will help to find the character "$" available in the content based on the expression flags assigned to the regular expression. Say for example: \$: only find the single "$" in a content \$/g: find the "$" globally available in content.


1 Answers

The Reason your Regex will Not Work: Overlapping Matches

  • Your regex matches sA in IsATest, allowing you to insert a - between the s and the A
  • In order to insert a - between the A and the T, the regex would have to match AT.
  • This is impossible because the A is already matched as part of sA. You cannot have overlapping matches in direct regex.
  • Is all hope lost? No! This is a perfect situation for lookarounds.

Do it in Two Easy Lines

Here's the easy way to do it with regex:

$regex = '~(?<=[a-zA-Z])(?=[A-Z])~';
echo strtolower(preg_replace($regex,"-","HelloWorldThisIsATest"));

See the output at the bottom of the php demo:

Output: hello-world-this-is-a-test

Will add explanation in a moment. :)

  • The regex doesn't match any characters. Rather, it targets positions in the string: the positions between the change in letter case. To do so, it uses a lookbehind and a lookahead
  • The (?<=[a-zA-Z]) lookbehind asserts that what precedes the current position is a letter
  • The (?=[A-Z]) lookahead asserts that what follows the current position is an upper-case letter.
  • We just replace these positions with a -, and convert the lot to lowercase.

If you look carefully on this regex101 screen, you can see lines between the words, where the regex matches.

Reference

  • Lookahead and Lookbehind Zero-Length Assertions
  • Mastering Lookahead and Lookbehind
like image 148
zx81 Avatar answered Oct 07 '22 07:10

zx81