Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the expression \X match when inside a RegEx?

Tags:

regex

unicode

According to http://www.regular-expressions.info,

You can consider \X the Unicode version of the dot in regex engines that use plain ASCII.

Does this mean that it will match any possible Unicode code point?

like image 469
federico-t Avatar asked Mar 29 '12 15:03

federico-t


2 Answers

From Perl regex manual:

This matches a Unicode extended grapheme cluster.

\X matches quite well what normal (non-Unicode-programmer) usage would consider a single character. As an example, consider a G with some sort of diacritic mark, such as an arrow. There is no such single character in Unicode, but one can be composed by using a G followed by a Unicode "COMBINING UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it were a single character.

Mnemonic: eXtended Unicode character.

And from PCRE man pages (2012):

PCRE implements a simpler version of \X than Perl, which changed to make \X match what Unicode calls an "extended grapheme cluster". This is more complicated than an extended Unicode sequence, which is what PCRE matches.

[...]

\X an extended Unicode sequence

[...]

The \X escape matches any number of Unicode characters that form an extended Unicode sequence. \X is equivalent to

(?>\PM\pM*)

That is, it matches a character without the "mark" property, followed by zero or more characters with the "mark" property, and treats the sequence as an atomic group (see below). Characters with the "mark" property are typically accents that affect the preceding character. None of them have codepoints less than 256, so in 8-bit non-UTF-8 mode \X matches any one character.

Note that recent versions of Perl have changed \X to match what Unicode calls an "extended grapheme cluster", which has a more complicated definition.

Later version of PCRE man pages (2015):

Extended grapheme clusters

The \X escape matches any number of Unicode characters that form an "extended grapheme cluster", and treats the sequence as an atomic group (see below). Up to and including release 8.31, PCRE matched an ear- lier, simpler definition that was equivalent to

(?>\PM\pM*)

That is, it matched a character without the "mark" property, followed by zero or more characters with the "mark" property. Characters with the "mark" property are typically non-spacing accents that affect the preceding character.

This simple definition was extended in Unicode to include more compli- cated kinds of composite character by giving each character a grapheme breaking property, and creating rules that use these properties to define the boundaries of extended grapheme clusters. In releases of PCRE later than 8.31, \X matches one of these clusters.

\X always matches at least one character. Then it decides whether to add additional characters according to the following rules for ending a cluster:

  1. End at the end of the subject string.

  2. Do not end between CR and LF; otherwise end after any control char- acter.

  3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters are of five types: L, V, T, LV, and LVT. An L character may be followed by an L, V, LV, or LVT character; an LV or V character may be followed by a V or T character; an LVT or T character may be follwed only by a T character.

  4. Do not end before extending characters or spacing marks. Characters with the "mark" property always have the "extend" grapheme breaking property.

  5. Do not end after prepend characters.

  6. Otherwise, end the cluster.

like image 159
Qtax Avatar answered Nov 15 '22 21:11

Qtax


The site's description is pretty good:

\X Matches a single Unicode grapheme, whether encoded as a single code point or multiple code points using combining marks. A grapheme most closely resembles the everyday concept of a "character". \X matches à encoded as U+0061 U+0300, à encoded as U+00E0, ©, etc.

So, the thing that makes it Unicode-aware is that it can match several code points when those combine to a single visible "thing" (grapheme).

See Wikipedia's page on Combining Characters for more detail, it lists the U+0300 codepoint mentioned above, for instance.

like image 34
unwind Avatar answered Nov 15 '22 21:11

unwind