Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is there no definition for std::regex_traits<char32_t> (and thus no std::basic_regex<char32_t>) provided?

Tags:

c++

regex

c++11

I would like to use regular expressions on UTF-32 codepoints and found this reference stating that std::regex_traits has to be defined by the user, so that std::basic_regex can be used at all. There seems to be no changes planned in the future for this.

  1. Why is this even the case?

  2. Does this have to do with the fact that Unicode says combined codepoint have to be treated equal to the single-code point representation (like the umlaut 'ä' represented as a single codepoint or with the a and the dots as two separate ones) ?

  3. Given the simplification that only single-codepoint characters would be supported, could this trait be defined easily or would this be either non-trivial nevertheless or require further limitations?

like image 342
Ident Avatar asked Nov 14 '15 13:11

Ident


1 Answers

  1. Some aspects of regex matching are locale-aware, with the result that a std::regex_traits object includes or references an instance of a std::locale object. The C++ standard library only provides locales for char and wchar_t characters, so there is no standard locale for char32_t (unless it happens to be the same as wchar_t), and this restriction carries over into regexes.

  2. Your description is imprecise. Unicode defines canonical equivalence relationship between two strings, which is based on normalizing the two strings, using either NFC or NFD, and then codepoint-by-codepoint comparing the normalized values. It does not defined canonical equivalence simply as an equivalence between a codepoint and a codepoint sequence, because normalization cannot simply be done character-by-character. Normalisation may require reordering composing characters into the canonical order (after canonical (de)composition). As such, it does not easily fit into the C++ model of locale transformations, which are generally single-character.

    The C++ standard library does not implement any Unicode normalization algorithm; in C++, as in many other languages, the two strings L"\u00e4" (ä) and L"\u0061\u0308" (ä) will compare as different, although they are canonically equivalent, and look to the human reader like the same grapheme. (On the machine I'm writing this answer, the rendering of those two graphemes is subtly different; if you look closely, you'll see that the umlaut in the second one is slightly displaced from its visually optimal position. That violates the Unicode requirement that canonically equivalent string have precisely the same rendering.)

    If you want to check for canonical equivalence of two strings, you need to use a Unicode normalisation library. Unfortunately, the C++ standard library does not include any such API; you could look at ICU (which also includes Unicode-aware regex matching).

    In any case, regular expression matching -- to the extent that it is specified in the C++ standard -- does not normalize the target string. This is permitted by the Unicode Technical Report on regular expressions, which recommends that the target string be explicitly normalized to some normalization form and the pattern written to work with strings normalized to that form:

    For most full-featured regular expression engines, it is quite difficult to match under canonical equivalence, which may involve reordering, splitting, or merging of characters.… In practice, regex APIs are not set up to match parts of characters or handle discontiguous selections. There are many other edge cases… It is feasible, however, to construct patterns that will match against NFD (or NFKD) text. That can be done by:

    • Putting the text to be matched into a defined normalization form (NFD or NFKD).
    • Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
    • Applying the matching algorithm on a code point by code point basis, as usual.
  3. The bulk of the work in creating a char32_t specialization of std::regex_traits would be creating a char32_t locale object. I've never tried doing either of these things; I suspect it would require a fair amount of attention to detail, because there are a lot of odd corner cases.


The C++ standard is somewhat vague about the details of regular expression matching, leaving the details to external documentation about each flavour of regular expression (and without a full explanation about how to apply such external specifications to character types other than the one each flavour is specified on). However, the fact that matching is character-by-character is possible to deduce. For example, in § 28.3, Requirements [re.req], Table 136 includes the locale method responsible for the character-by-character equivalence algorithm:

Expression: v.translate(c) Return type: X::char_type Assertion: Returns a character such that for any character d that is to be considered equivalent to c then v.translate(c) == v.translate(d).

Similarly, in the description of regular expression matching for the default "Modified ECMAScript" flavour (§ 28.13), the standard describes how the regular expression engine to matches two characters (one in the pattern and one in the target): (paragraph 14.1):

During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using the following rules:

  1. if (flags() & regex_constants::icase) the two characters are equal if traits_inst.translate_nocase(c) == traits_inst.translate_nocase(d);

  2. otherwise, if flags() & regex_constants::collate the two characters are equal if traits_inst.translate(c) == traits_inst.translate(d);

  3. otherwise, the two characters are equal if c == d.

like image 153
rici Avatar answered Nov 14 '22 22:11

rici