Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

error when removing CSS comments via REGEX

Tags:

regex

php

php-7.3

turns out that both of these sequences (previously working)

"`([\n\A;]+)\/\*(.+?)\*\/`ism" => "$1",     // error
"`([\n\A;\s]+)//(.+?)[\n\r]`ism" =>"$1\n",  // error

Now throw an error in PHP 7.3

Warning: preg_replace(): Compilation failed: escape sequence is invalid in character class offset 4

CONTEXT: consider this snipit, which removes CSS comments from a string

$buffer = ".selector {color:#fff; } /* some comment to remove*/";
$regex = array(
"`^([\t\s]+)`ism"=>'',
"`^\/\*(.+?)\*\/`ism"=>"",
"`([\n\A;]+)\/\*(.+?)\*\/`ism"=>"$1",     // 7.3 error
"`([\n\A;\s]+)//(.+?)[\n\r]`ism"=>"$1\n", // 7.3 error
"`(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+`ism"=>"\n"
);
$buffer = preg_replace(array_keys($regex),$regex,$buffer);
//returns cleaned up $buffer value with pure css and no comments

Refer to: https://stackoverflow.com/a/1581063/1293658

Q1 - Any ideas whats wrong with the REGEX in this case? This thread seems to suggest it's simply a misplaced backslash https://github.com/thujohn/twitter/issues/250

Q2 - Is this a PHP 7.3 bug or a problem with the REGEX sequence in this code?

like image 981
Christian Žagarskas Avatar asked Sep 07 '19 01:09

Christian Žagarskas


1 Answers

Do not use zero-width assertions inside character classes.

  • ^, $, \A, \b, \B, \Z, \z, \G - as anchors, (non-)word boundaries - do not make sense inside character classes since they do not match any character. The ^ and \b mean something different in the character class: ^ is either the negated character class mark if used after the open [ or denotes a literal ^. \b means a backspace char.

  • You can't use \R (=any line break) there, neither.

The two patterns with \A inside a character class must be re-written as a grouping construct, (...), with an alternation operator |:

"`(\A|[\n;]+)/\*.+?\*/`s"=>"$1", 
"`(\A|[;\s]+)//.+\R`"=>"$1\n", 

I removed the redundant modifiers and capturing groups you are not using, and replaced [\r\n] with \R. The "`(\A|[\n;]+)/\*.+?\*/`s"=>"$1" can also be re-written in a more efficient way:

"`(\A|[\n;]+)/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`"=>"$1"

Note that in PHP 7.3, acc. to the Upgrade history of the bundled PCRE library table, the regex library is PCRE 10.32. See PCRE to PCRE2 migration:

Until PHP 7.2, PHP used the 8.x versions of the legacy PCRE library, and from PHP 7.3, PHP will use PCRE2. Note that PCRE2 is considered to be a new library although it's based on and largely compatible with PCRE (8.x).

Acc. to this resource, the updated library is more strict to regex patterns, and treats former leniently accepted user errors as real errors now:

  • Modifier S is now on by default. PCRE does some extra optimization.
  • Option X is disabled by default. It makes PCRE do more syntax validation than before.
  • Unicode 10 is used, while it was Unicode 7. This means more emojis, more characters, and more sets. Unicode regex may be impacted.
  • Some invalid patterns may be impacted.

In simple words, PCRE2 is more strict in the pattern validations, so after the upgrade, some of your existing patterns could not compile anymore.

like image 111
Wiktor Stribiżew Avatar answered Oct 04 '22 20:10

Wiktor Stribiżew