Without using u
flag the hex range that can be used is [\x{00}-\x{ff}]
, but with u
flag it goes up to a 4-byte value \x{7fffffff}
([\x{00000000}-\x{7fffffff}]
).
So if I execute the below code:
preg_match("/[\x{00000000}-\x{80000000}]+/u", $str, $match);
Will get this error:
Warning: preg_match(): Compilation failed: character value in \x{...} sequence is too large
So I can't match a letter like 𡃁
with equivalent hex value of f0 a1 83 81
. The question is not how to match these letters, but how this range & this boundary came from as u
modifier should treat strings as UTF-16
PCRE supports UTF-16 since v8.30
echo PCRE_VERSION;
PCRE version with PHP 5.3.24 - 5.3.28, 5.4.14 - 5.5.7:
8.32 2012-11-30
PCRE version with PHP 5.3.19 - 5.3.23, 5.4.9 - 5.4.13:
8.31 2012-07-06
http://3v4l.org/CrPZ8
To match a literal space, you'll need to escape it: "\\ " . This is a useful way of describing complex regular expressions: phone <- regex(" \\(? #
By combining the interval quantifier with the surrounding start- and end-of-string anchors, the regex will fail to match if the subject text's length falls outside the desired range.
regex = "^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$"; Where: ^ represents the starting of the string. # represents the hexadecimal color code must start with a '#' symbol.
An example regular expression that combines some of the operators and constructs to match a hexadecimal number is \b0[xX]([0-9a-fA-F]+)\b .
Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.
In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.
I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Using the example in the question, 𡃁
maps to the code point U+210C1
, but it can be encoded as F0 A1 83 81
in UTF-8, D844 DCC1
in UTF-16 and 000210C1
in UTF-32.
To be precise, the example above shows how to map a code point to code units (character encoding form). How the code units are mapped to octet sequence is another matter. See Unicode encoding model
Since PHP hasn't adopted PCRE2 yet (version 10.10), the quoted text are from the documentation of original PCRE.
PCRE includes support for 16-bit string in version 8.30 and 32-bit string from version 8.32, in additional to the default 8-bit library.
As well as support for 8-bit character strings, PCRE also supports 16-bit strings (from release 8.30) and 32-bit strings (from release 8.32), by means of two additional libraries. They can be built as well as, or instead of, the 8-bit library. [...]
8-bit, 16-bit and 32-bit here refers to the data unit (code unit).
References to bytes and UTF-8 in this document should be read as references to 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data units and UTF-32 when using the 32-bit library, unless specified otherwise. More details of the specific differences for the 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
This means that 8-bit/16-bit/32-bit library expects the pattern and the input string to be sequences of 8-bit/16-bit/32-bit data units, or valid UTF-8/UTF-16/UTF-32 strings.
PCRE provides 3 sets of identical API for 8-bit, 16-bit and 32-bit libraries, differentiated by the prefix (pcre_
, pcre16_
and pcre_32
respectively).
The 16-bit and 32-bit functions operate in the same way as their 8-bit counterparts; they just use different data types for their arguments and results, and their names start with
pcre16_
orpcre32_
instead ofpcre_
. For every option that has UTF8 in its name (for example,PCRE_UTF8
), there are corresponding 16-bit and 32-bit names with UTF8 replaced by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the 16-bit and 32-bit option names define the same bit values.
In PCRE2, a similar function naming convention is used, where 8-bit/16-bit/32-bit function has _8
, _16
, _32
suffix respectively. Applications which use only one code unit width can define PCRE2_CODE_UNIT_WIDTH
to use generic name of the function without the suffix.
When the UTF mode is set (via in-pattern options (*UTF)
, (*UTF8)
, (*UTF16)
, (*UTF32)
1 or compile options PCRE_UTF8
, PCRE_UTF16
, PCRE_UTF32
), all sequences of data units are interpreted as sequences of Unicode characters, which consist of all code points from U+0000 to U+10FFFF, except for surrogates and BOM.
1 The in-pattern options (*UTF8)
, (*UTF16)
, (*UTF32)
are only available in the corresponding library. You can't use (*UTF16)
in 8-bit library, nor any mismatched combination, since it simply doesn't make sense. (*UTF)
is available in all libraries, and provides a portable way to specify UTF mode in-pattern.
In UTF mode, the pattern (which is a sequence of data units) is interpreted and validated as a sequence of Unicode code points by decoding the sequence as UTF-8/UTF-16/UTF-32 data (depending on the API used), before it is compiled. The input string is also interpreted and optionally validated as a sequence of Unicode code points during the matching process. In this mode, a character class matches one valid Unicode code point.
On the other hand, when the UTF mode is not set (non-UTF mode), all operations directly work on the data unit sequences. In this mode, a character class matches one data unit, and except for the maximum value that can be stored in a single data unit, there is no restriction on the value of a data unit. This mode can be used for matching structure in binary data. However, do not use this mode when you are dealing with Unicode character, well, unless you are fine with ASCII and ignore the rest of the languages.
Constraints on character values
Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:
8-bit non-UTF mode less than 0x100 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint 16-bit non-UTF mode less than 0x10000 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint 32-bit non-UTF mode less than 0x100000000 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called "surrogate" codepoints), and 0xffef.
The PCRE functions in PHP are implemented by a wrapper which translates PHP-specific flags and calls into PCRE API (as seen in PHP 5.6.10 branch).
The source code calls into PCRE 8-bit library API (pcre_
), so any string passed into preg_
function is interpreted as a sequence of 8-bit data units (bytes). Therefore, even if the PCRE 16-bit and 32-bit libraries are built, they are not accessible via the API on PHP side at all.
As a result, PCRE functions in PHP expects:
This explains the behavior as seen in the question:
u
flag), the maximum value in hexadecimal regex escape sequence is FF (as shown in [\x{00}-\x{ff}]
)\x{7fffffff}
) in hexadecimal regex escape sequence is simply non-sense.This example code demonstrates:
// NOTE: Save this file as UTF-8
// Take note of double-quoted string literal, which supports escape sequence and variable expansion
// The code won't work correctly with single-quoted string literal, which has restrictive escape syntax
// Read more at: https://php.net/language.types.string
$str_1 = "\xf0\xa1\x83\x81\xf0\xa1\x83\x81";
$str_2 = "𡃁𡃁";
$str_3 = "\xf0\xa1\x83\x81\x81\x81\x81\x81\x81";
echo ($str_1 === $str_2)."\n";
var_dump($str_3);
// Test 1a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_1, $match);
print_r($match); // Only match 𡃁
// Test 1b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_2, $match);
print_r($match); // Only match 𡃁 (same as 1a)
// Test 1c
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_3, $match);
print_r($match); // Match 𡃁 and the five bytes of 0x81
// Test 2a
$match = null;
preg_match("/𡃁+/", $str_1, $match);
print_r($match); // Only match 𡃁 (same as 1a)
// Test 2b
$match = null;
preg_match("/𡃁+/", $str_2, $match);
print_r($match); // Only match 𡃁 (same as 1b and 2a)
// Test 2c
$match = null;
preg_match("/𡃁+/", $str_3, $match);
print_r($match); // Match 𡃁 and the five bytes of 0x81 (same as 1c)
// Test 3a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_1, $match);
print_r($match); // Match two 𡃁
// Test 3b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_2, $match);
print_r($match); // Match two 𡃁 (same as 3a)
// Test 4a
$match = null;
preg_match("/𡃁+/u", $str_1, $match);
print_r($match); // Match two 𡃁 (same as 3a)
// Test 4b
$match = null;
preg_match("/𡃁+/u", $str_2, $match);
print_r($match); // Match two 𡃁 (same as 3b and 4a)
Since PHP strings are simply an array of bytes, as long as the file is saved correctly in some ASCII-compatible encoding, PHP will just happily read the bytes without caring about what encoding it was originally in. The programmer is fully responsible for encoding and decoding the strings correctly.
Due to the above reason, if you save the file above in UTF-8 encoding, you will see that $str_1
and $str_2
are the same string. $str_1
is decodes from the escape sequence, while $str_2
is read verbatim from the source code. As a result, "/\xf0\xa1\x83\x81+/u"
and "/𡃁+/u"
are the same string underneath (also the case for "/\xf0\xa1\x83\x81+/"
and "/𡃁+/"
).
The difference between UTF mode and non-UTF mode is clearly shown in the example above:
"/𡃁+/"
is seen as a sequence of characters F0 A1 83 81 2B
where "character" is one byte. Therefore, the resulting regex matches the sequence F0 A1 83
followed by byte 81
repeating once or more."/𡃁+/u"
is validated and interpreted as a sequence of UTF-8 characters U+210C1 U+002B
. Therefore, the resulting regex matches the code point U+210C1
repeated once or more in the UTF-8 string.Unless the input contains other binary data, it's strongly recommended to always turn u
mode on. The pattern has access to all facilities to properly match Unicode characters, and both the input and pattern are validated as valid UTF strings.
Again, using 𡃁
as example, the example above shows two ways to specify the regex:
"/\xf0\xa1\x83\x81+/u"
"/𡃁+/u"
The first method doesn't work with single-quoted string -- as \x
escape sequence is not recognized in single-quote, the library will receive the string \xf0\xa1\x83\x81+
, which combines with UTF mode will match U+00F0 U+00A1 U+0083
followed by U+0081
repeated once or more. Apart from that, it's also confusing to the next person reading the code: how are they supposed to know that it's a single Unicode character repeated once or more?
The second method works well and it can even be used with single-quoted string, but you need to save the file in UTF-8 encoding, especially the case with characters like ÿ
, since the character is also valid in single-byte encoding. This method an option if you want to match single character or a sequence of characters. However, as end points of character range, it may not be clear what you are trying to match. Compare a-z
, A-Z
, 0-9
, א-ת
, as opposed to 一-龥
(which matches most of CJK Unified Ideographs block (4E00–9FFF) except for unassigned code points at the end) or 一-十
(which is an incorrect attempt to match Chinese characters for number from 1 to 10).
The third method is to specify the code point in hexadecimal escape directly:
"/\x{210C1}/u"
'/\x{210C1}/u'
This works when the file is saved in any ASCII-compatible encoding, works with both single and double-quoted string, and also gives clear code point in character range. This method has the disadvantage of not knowing how the character looks like, and it is also hard to read when specifying a sequence of Unicode characters.
So I can't match a letter like 𡃁 with equivalent hex value of f0 a1 83 81. The question is not how to match these letters, but how this range & this boundary came from as u modifier should treat strings as UTF-16
You are mixing two concepts which is causing this confusion.
F0 A1 83 81
isn't the hex value of the character 𡃁. This is the way
UTF-8 encodes the code point for that character in the byte stream.
It is correct that PHP supports UTF-16 code points for the \x{}
pattern, but the values inside {
and }
represent UTF-16 code points and not the actual bytes used to encode the given character in the byte stream.
So the largest possible value you can use with \x{}
is actually 10FFFF
.
And to match 𡃁 with PHP you need to use it's code point which as suggested by @minitech in his comment is \x{0210c1}
.
Further explanation quoted from section "Validity of strings" from the PCRE documentation.
The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to encode code points with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and UTF-32.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With