Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a safe PCRE regex delimiter to use on HTML5 pattern input element attribute?

Tags:

html

regex

php

pcre

It seems to be that the HTML5 spec (and therefore ECMA262) allows <input type="text" pattern="[0-9]/[0-9]" /> to match the string '0/0' even though the forward slash is not escaped. Web applications like Drupal would like to provide server-side validation for browsers that don't support HTML5 with something like:

<?php
preg_match('/^(' . $pattern . ')$/', $value);
?>

Unfortunately the string '[0-9]/[0-9]' is not a valid PRCE regex. It appears that most if not all HTML5-capable browser support both pattern="[0-9]/[0-9]" and pattern="[0-9]\/[0-9]" which begs the question - what can we use as a delimiter to run this pattern against Perl-style regex?

We've filed a bug report against the W3C spec but are the browsers wrong here? Does the HTML5 spec need to be clarified? Is there a workaround we can use in PHP?

like image 320
Dave Reid Avatar asked Mar 02 '12 01:03

Dave Reid


3 Answers

It is a valid regex if you use # instead of / for the delimiter. Example:

preg_match('#^('.$pattern.')$#', $value);
like image 136
Alec Gorge Avatar answered Sep 18 '22 12:09

Alec Gorge


I recomend using "\xFF" byte as pattern delimiter, because it is not allowed in UTF-8 string, so we can be sure it will not occur in the pattern. And because preg_match does not understand UTF-8, it will cause no trouble.

Example: preg_match("\xFF$pattern\$\xFFADmsu", $subject);

Please note ADmsu modifiers and adding $. The u modifier requires valid UTF-8 bytes only in the pattern, but not in delimiters around.

like image 28
Josef Kufner Avatar answered Sep 22 '22 12:09

Josef Kufner


One of the problems with PCRE is that almost any delimiter is legal for the start and end markers, depending on what makes the rest of the escaping easier. So #foo# is legal, /foo/ is legal, !foo! is legal (I think), etc. Undelimited regex, I'd say, are extremely dangerous for exactly that reason. That sounds like an HTML5 spec bug that it doesn't specify.

Maybe in PHP, scan the string and pick a delimiter from a whitelist that is not present in the string? (Eg, if there's no / use that, if there is use #, if that's there use %, etc.)

like image 29
Crell Avatar answered Sep 20 '22 12:09

Crell