Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum Hex value in regex

Tags:

Without using u flag the hex range that can be used is [\x{00}-\x{ff}], but with u flag it goes up to a 4-byte value \x{7fffffff} ([\x{00000000}-\x{7fffffff}]).

So if I execute the below code:

preg_match("/[\x{00000000}-\x{80000000}]+/u", $str, $match);

Will get this error:

Warning: preg_match(): Compilation failed: character value in \x{...} sequence is too large

So I can't match a letter like 𡃁 with equivalent hex value of f0 a1 83 81. The question is not how to match these letters, but how this range & this boundary came from as u modifier should treat strings as UTF-16

PCRE supports UTF-16 since v8.30

echo PCRE_VERSION;

PCRE version with PHP 5.3.24 - 5.3.28, 5.4.14 - 5.5.7:

8.32 2012-11-30

PCRE version with PHP 5.3.19 - 5.3.23, 5.4.9 - 5.4.13:

8.31 2012-07-06

http://3v4l.org/CrPZ8

like image 715
revo Avatar asked Jan 06 '14 16:01

revo


People also ask

What is\\ in regex?

To match a literal space, you'll need to escape it: "\\ " . This is a useful way of describing complex regular expressions: phone <- regex(" \\(? #

How to limit length in regex?

By combining the interval quantifier with the surrounding start- and end-of-string anchors, the regex will fail to match if the subject text's length falls outside the desired range.

What is the regular expression for a valid HTML hex color value?

regex = "^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$"; Where: ^ represents the starting of the string. # represents the hexadecimal color code must start with a '#' symbol.

Which regular expression might be used to match hexadecimal numbers?

An example regular expression that combines some of the operators and constructs to match a hexadecimal number is \b0[xX]([0-9a-fA-F]+)\b .


2 Answers

Unicode and UTF-8, UTF-16, UTF-32 encoding

Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.

In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.

I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Using the example in the question, 𡃁 maps to the code point U+210C1, but it can be encoded as F0 A1 83 81 in UTF-8, D844 DCC1 in UTF-16 and 000210C1 in UTF-32.

To be precise, the example above shows how to map a code point to code units (character encoding form). How the code units are mapped to octet sequence is another matter. See Unicode encoding model

PCRE 8-bit, 16-bit and 32-bit library

Since PHP hasn't adopted PCRE2 yet (version 10.10), the quoted text are from the documentation of original PCRE.

Support for 16-bit and 32-bit library

PCRE includes support for 16-bit string in version 8.30 and 32-bit string from version 8.32, in additional to the default 8-bit library.

As well as support for 8-bit character strings, PCRE also supports 16-bit strings (from release 8.30) and 32-bit strings (from release 8.32), by means of two additional libraries. They can be built as well as, or instead of, the 8-bit library. [...]

Meaning of 8-bit, 16-bit, 32-bit

8-bit, 16-bit and 32-bit here refers to the data unit (code unit).

References to bytes and UTF-8 in this document should be read as references to 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data units and UTF-32 when using the 32-bit library, unless specified otherwise. More details of the specific differences for the 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.

This means that 8-bit/16-bit/32-bit library expects the pattern and the input string to be sequences of 8-bit/16-bit/32-bit data units, or valid UTF-8/UTF-16/UTF-32 strings.

Different APIs for different width of data unit

PCRE provides 3 sets of identical API for 8-bit, 16-bit and 32-bit libraries, differentiated by the prefix (pcre_, pcre16_ and pcre_32 respectively).

The 16-bit and 32-bit functions operate in the same way as their 8-bit counterparts; they just use different data types for their arguments and results, and their names start with pcre16_ or pcre32_ instead of pcre_. For every option that has UTF8 in its name (for example, PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8 replaced by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the 16-bit and 32-bit option names define the same bit values.

In PCRE2, a similar function naming convention is used, where 8-bit/16-bit/32-bit function has _8, _16, _32 suffix respectively. Applications which use only one code unit width can define PCRE2_CODE_UNIT_WIDTH to use generic name of the function without the suffix.

UTF mode vs. non-UTF mode

When the UTF mode is set (via in-pattern options (*UTF), (*UTF8), (*UTF16), (*UTF32)1 or compile options PCRE_UTF8, PCRE_UTF16, PCRE_UTF32), all sequences of data units are interpreted as sequences of Unicode characters, which consist of all code points from U+0000 to U+10FFFF, except for surrogates and BOM.

1 The in-pattern options (*UTF8), (*UTF16), (*UTF32) are only available in the corresponding library. You can't use (*UTF16) in 8-bit library, nor any mismatched combination, since it simply doesn't make sense. (*UTF) is available in all libraries, and provides a portable way to specify UTF mode in-pattern.

In UTF mode, the pattern (which is a sequence of data units) is interpreted and validated as a sequence of Unicode code points by decoding the sequence as UTF-8/UTF-16/UTF-32 data (depending on the API used), before it is compiled. The input string is also interpreted and optionally validated as a sequence of Unicode code points during the matching process. In this mode, a character class matches one valid Unicode code point.

On the other hand, when the UTF mode is not set (non-UTF mode), all operations directly work on the data unit sequences. In this mode, a character class matches one data unit, and except for the maximum value that can be stored in a single data unit, there is no restriction on the value of a data unit. This mode can be used for matching structure in binary data. However, do not use this mode when you are dealing with Unicode character, well, unless you are fine with ASCII and ignore the rest of the languages.

Constraints on character values

Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:

8-bit non-UTF mode    less than 0x100
8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
16-bit non-UTF mode   less than 0x10000
16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
32-bit non-UTF mode   less than 0x100000000
32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint

Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called "surrogate" codepoints), and 0xffef.

PHP and PCRE

The PCRE functions in PHP are implemented by a wrapper which translates PHP-specific flags and calls into PCRE API (as seen in PHP 5.6.10 branch).

The source code calls into PCRE 8-bit library API (pcre_), so any string passed into preg_ function is interpreted as a sequence of 8-bit data units (bytes). Therefore, even if the PCRE 16-bit and 32-bit libraries are built, they are not accessible via the API on PHP side at all.

As a result, PCRE functions in PHP expects:

  • ... an array of bytes in non-UTF mode (default), which the library reads in 8-bit "characters" and compiles to match strings of 8-bit "characters".
  • ... an array of bytes which contains a Unicode string UTF-8 encoded, which the library reads in Unicode characters and compiles to match UTF-8 Unicode strings.

This explains the behavior as seen in the question:

  • In non-UTF mode (without u flag), the maximum value in hexadecimal regex escape sequence is FF (as shown in [\x{00}-\x{ff}])
  • In UTF mode, any value beyond 0x10ffff (like \x{7fffffff}) in hexadecimal regex escape sequence is simply non-sense.

Example code

This example code demonstrates:

  • PHP strings are just arrays of bytes and don't understand anything about encoding.
  • The differences between UTF mode and non-UTF mode in PCRE function.
  • PCRE function calls into 8-bit library
// NOTE: Save this file as UTF-8

// Take note of double-quoted string literal, which supports escape sequence and variable expansion
// The code won't work correctly with single-quoted string literal, which has restrictive escape syntax
// Read more at: https://php.net/language.types.string
$str_1 = "\xf0\xa1\x83\x81\xf0\xa1\x83\x81";
$str_2 = "𡃁𡃁";
$str_3 = "\xf0\xa1\x83\x81\x81\x81\x81\x81\x81";

echo ($str_1 === $str_2)."\n";

var_dump($str_3);

// Test 1a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_1, $match);
print_r($match); // Only match 𡃁

// Test 1b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_2, $match);
print_r($match); // Only match 𡃁 (same as 1a)

// Test 1c
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_3, $match);
print_r($match); // Match 𡃁 and the five bytes of 0x81

// Test 2a
$match = null;
preg_match("/𡃁+/", $str_1, $match);
print_r($match); // Only match 𡃁 (same as 1a)

// Test 2b
$match = null;
preg_match("/𡃁+/", $str_2, $match);
print_r($match); // Only match 𡃁 (same as 1b and 2a)

// Test 2c
$match = null;
preg_match("/𡃁+/", $str_3, $match);
print_r($match); // Match 𡃁 and the five bytes of 0x81 (same as 1c)

// Test 3a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_1, $match);
print_r($match); // Match two 𡃁

// Test 3b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_2, $match);
print_r($match); // Match two 𡃁 (same as 3a)

// Test 4a
$match = null;
preg_match("/𡃁+/u", $str_1, $match);
print_r($match); // Match two 𡃁 (same as 3a)

// Test 4b
$match = null;
preg_match("/𡃁+/u", $str_2, $match);
print_r($match); // Match two 𡃁 (same as 3b and 4a)

Since PHP strings are simply an array of bytes, as long as the file is saved correctly in some ASCII-compatible encoding, PHP will just happily read the bytes without caring about what encoding it was originally in. The programmer is fully responsible for encoding and decoding the strings correctly.

  • What every programmer absolutely, positively needs to know about encodings and character sets to work with text (section "Using and abusing PHP's handling of encodings")
  • Character Sets / Character Encoding Issues
  • PHP Charset FAQ - Which encoding should I use for my source files?

Due to the above reason, if you save the file above in UTF-8 encoding, you will see that $str_1 and $str_2 are the same string. $str_1 is decodes from the escape sequence, while $str_2 is read verbatim from the source code. As a result, "/\xf0\xa1\x83\x81+/u" and "/𡃁+/u" are the same string underneath (also the case for "/\xf0\xa1\x83\x81+/" and "/𡃁+/").

The difference between UTF mode and non-UTF mode is clearly shown in the example above:

  • "/𡃁+/" is seen as a sequence of characters F0 A1 83 81 2B where "character" is one byte. Therefore, the resulting regex matches the sequence F0 A1 83 followed by byte 81 repeating once or more.
  • "/𡃁+/u" is validated and interpreted as a sequence of UTF-8 characters U+210C1 U+002B. Therefore, the resulting regex matches the code point U+210C1 repeated once or more in the UTF-8 string.

Matching Unicode character

Unless the input contains other binary data, it's strongly recommended to always turn u mode on. The pattern has access to all facilities to properly match Unicode characters, and both the input and pattern are validated as valid UTF strings.

Again, using 𡃁 as example, the example above shows two ways to specify the regex:

"/\xf0\xa1\x83\x81+/u"
"/𡃁+/u"

The first method doesn't work with single-quoted string -- as \x escape sequence is not recognized in single-quote, the library will receive the string \xf0\xa1\x83\x81+, which combines with UTF mode will match U+00F0 U+00A1 U+0083 followed by U+0081 repeated once or more. Apart from that, it's also confusing to the next person reading the code: how are they supposed to know that it's a single Unicode character repeated once or more?

The second method works well and it can even be used with single-quoted string, but you need to save the file in UTF-8 encoding, especially the case with characters like ÿ, since the character is also valid in single-byte encoding. This method an option if you want to match single character or a sequence of characters. However, as end points of character range, it may not be clear what you are trying to match. Compare a-z, A-Z, 0-9, א-ת, as opposed to 一-龥 (which matches most of CJK Unified Ideographs block (4E00–9FFF) except for unassigned code points at the end) or 一-十 (which is an incorrect attempt to match Chinese characters for number from 1 to 10).

The third method is to specify the code point in hexadecimal escape directly:

"/\x{210C1}/u"
'/\x{210C1}/u'

This works when the file is saved in any ASCII-compatible encoding, works with both single and double-quoted string, and also gives clear code point in character range. This method has the disadvantage of not knowing how the character looks like, and it is also hard to read when specifying a sequence of Unicode characters.

like image 142
nhahtdh Avatar answered Oct 14 '22 03:10

nhahtdh


So I can't match a letter like 𡃁 with equivalent hex value of f0 a1 83 81. The question is not how to match these letters, but how this range & this boundary came from as u modifier should treat strings as UTF-16

You are mixing two concepts which is causing this confusion.

F0 A1 83 81 isn't the hex value of the character 𡃁. This is the way UTF-8 encodes the code point for that character in the byte stream.

It is correct that PHP supports UTF-16 code points for the \x{} pattern, but the values inside { and } represent UTF-16 code points and not the actual bytes used to encode the given character in the byte stream.

So the largest possible value you can use with \x{} is actually 10FFFF.

And to match 𡃁 with PHP you need to use it's code point which as suggested by @minitech in his comment is \x{0210c1}.

Further explanation quoted from section "Validity of strings" from the PCRE documentation.

The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.

Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to encode code points with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and UTF-32.)

like image 21
Ibrahim Najjar Avatar answered Oct 14 '22 03:10

Ibrahim Najjar