Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change what PCRE regexp thinks are newlines in multi-line mode?

With PCRE regular expressions in PHP, multi-line mode (/m) enables ^ and $ to match the start and end of lines (separated by newlines) in the source text, as well as the start and end of the source text.

This appears to work great on Linux with \n (LF) being the newline separator, but fails on Windows with \r\n (CRLF).

Is there any way to change what PCRE thinks are newlines? Or to perhaps allow it to match either CRLF or LF in the same way that $ matches the end of line/string?

EXAMPLE:

$EOL = "\n";    // Linux LF
$SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four";
if (preg_match('/^two$/m',$SOURCE_TEXT)) {
    echo 'Found match.';    // <<< RESULT
} else {
    echo 'Did not find match!';
}

RESULT: Success

$EOL = "\r\n";    // Windows CR+LF
$SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four";
if (preg_match('/^two$/m',$SOURCE_TEXT)) {
    echo 'Found match.';
} else {
    echo 'Did not find match!';    // <<< RESULT
}

RESULT: Fail

like image 571
MrWhite Avatar asked Jul 25 '11 09:07

MrWhite


2 Answers

Did you try the (*CRLF) and related modifiers? They are detailed on Wikipedia here (under Newline/linebreak options) and seem to do the right thing in my testing. i.e. '/(*CRLF)^two$/m' should match the windows \r\n newlines. Also (*ANYCRLF) should match both linux and windows but I haven't tested this.

like image 74
Ben Holland Avatar answered Sep 29 '22 12:09

Ben Holland


Note: The answer is only applicable to older PHP versions, when I wrote it, I was not aware of the sequences and modifiers that are available: \R, (*BSR_ANYCRLF) and (*BSR_UNICODE). See as well the answer to: How to replace different newline styles in PHP the smartest way?

In PHP it's not possible to specify the newline character-sequence(s) for PCRE regex patterns. The m modifier is looking for \n only, that's documented. And there is no runtime setting available to make a change which would be possible in perl but that's not an option with PHP.

I normally just modify the string prior using it with preg_match and the like:

$subject = str_replace("\r\n", "\n", $subject);

This might not be exactly what you're looking for but probably it helps.

Edit: Regarding the windows EOL example you've added to your question:

$EOL = "\r\n";    // Windows CR+LF
$SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four";
if (preg_match('/^two$/m',$SOURCE_TEXT)) {
    echo 'Found match.';
} else {
    echo 'Did not find match!';    // <<< RESULT
}

This fails because in the text, there is a \r after two. So two is not at the end of a line, there is an additional character, \r before the end of the line ($).

The PHP manual clearly explains that only \n is considered as the character that specifies a line ending. $ does consider \n only, so if you're looking for two\r at the end of a line, you need to change your pattern. That's the other option (instead of converting the text as suggested above).

like image 38
hakre Avatar answered Sep 29 '22 11:09

hakre