Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does \x in PHP PCRE mean?

Tags:

regex

php

pcre

From the manual:

After \x, up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, \x{...} is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

So what does this mean?

The code point of "ä" is E4 while the UTF-8 representation is C3A4, but neiter of those matches:

$t = 'ä'; // same as "\xC3\xA4";

preg_match('/\\xC3A4/u', $t); // doesn't match
preg_match('/\\x00E4/u', $t); // doesn't match

With the curly braces it does match when I give the code point:

preg_match('/\\x{00E4}/u', $t); // matches
like image 611
AndreKR Avatar asked Aug 29 '13 23:08

AndreKR


People also ask

What is PCRE PHP?

The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5, with just a few differences (see below). The current implementation corresponds to Perl 5.005.

Does PHP use PCRE?

The PCRE extension is a core PHP extension, so it is always enabled. By default, this extension is compiled using the bundled PCRE library.

What is PCRE matching?

PCRE tries to match Perl syntax and semantics as closely as it can. PCRE also supports some alternative regular expression syntax (which does not conflict with the Perl syntax) in order to provide some compatibility with regular expressions in Python, . NET, and Oniguruma.

What does this regex do?

Short for regular expression, a regex is a string of text that lets you create patterns that help match, locate, and manage text. Perl is a great example of a programming language that utilizes regular expressions. However, its only one of the many places you can find regular expressions.


1 Answers

The syntax is a way to specify a character by value:

  • \xAB specifies a code-point in the range 0-FF.
  • \x{ABCD} specifies a code-point in the range 0-FFFF.

The indicated wording from the manual is bit confusing, perhaps in an attempt to be precise. Character values 128-255 (and some) are encoded as 2-bytes in UTF-8. Thus, a unicode regular expression will match 7-bit clean ASCII but will not match different encodings/codepages (i.e. CP437) that utilize values in said range. The manual is, in a roundabout way, saying that a unicode regular expression is only suitable to be used with correctly encoded input. However;

It doesn't mean that \xABCD is parsed as \x{ABCD} (one character). It is parsed as \xAB (one character) and then CD (two characters)1. The braces address this parsing ambiguity issue:

After \x, up to two hexadecimal digits are read .. In UTF-8 mode, \x{...} is allowed ..

Some other languages use \u instead of \x for the longer form.


1 Consider that this matches:

preg_match('/\xC3A4/u', "\xC3" . "A4");

like image 64
user2246674 Avatar answered Sep 20 '22 17:09

user2246674