Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will [a-z] ever match accented characters in PREG/PCRE?

I'm already aware that \w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?

I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):

// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);

Is this true, or did someone simply get [a-z] confused with \w?

like image 688
Garrett Albright Avatar asked Dec 18 '09 20:12

Garrett Albright


3 Answers

Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.

The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ñ would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ñ is a separate letter in Spanish.

However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.

I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.

Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that [a-z] will only match the characters abcdefghijklmnopqrstuvwxyz a proverbial 99% of the time. That said — framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting [a-z] confused with \w", but instead a Drupal developer being unclear/unsure of how PCRE handled [a-z], and choosing the more specific form of abcdefghijklmnopqrstuvwxyz to ensure the specific behavior they wanted.

like image 139
Alan Storm Avatar answered Sep 30 '22 18:09

Alan Storm


The comment in Drupal's code is WRONG.

It's NOT true that "international characters (e.g. German umlauts)" might match [a-z].

If, e.g., you have the German locale available, you can check it like this:

setlocale(LC_ALL, 'de_DE'); // German locale (not needed, but you never know...)
echo preg_match('/^[a-z]+$/', 'abc') ? "yes\n" : "no\n";
echo preg_match('/^[a-z]+$/', "\xE4bc") ? "yes\n" : "no\n"; // äbc in ISO-8859-1
echo preg_match('/^[a-z]+$/',  "\xC3\xA4bc") ? "yes\n" : "no\n"; // äbc in UTF-8
echo preg_match('/^[a-z]+$/u', "\xC3\xA4bc") ? "yes\n" : "no\n"; // w/ PCRE_UTF8

Output (will not change if you replace de_DE with de_DE.UTF-8):

yes
no
no
no

The character class [abcdefghijklmnopqrstuvwxyz] is identical to [a-z] in both encodings the PCRE understands: ASCII-derived monobyte and UTF-8 (which is ASCII-derived too). In both of these encodings [a-z] is the same as [\x61-\x7A].

Things may have been different when the question was asked in 2009, but in 2014 there is no "weird configuration" that can make PHP's PCRE regex engine interpret [a-z] as a class of more than 26 characters (as long as [a-z] itself is written as 5 bytes in an ASCII-derived encoding, of course).

like image 25
Walter Tross Avatar answered Sep 30 '22 19:09

Walter Tross


Just an addition to both the already excellent, if contradicting, answers.

The documentation for the PCRE library has always stated that "Ranges operate in the collating sequence of character values". Which is somewhat vague, and yet very precise.

It refers to collating by the index of characters in PCRE's internal character tables, which can be set up to match the current locale using pcre_maketables. That function builds the tables in order of char value (tolower(i)/toupper(i))

In other words, it doesn't collate by actual cultural sort order (the locale collation info). As an example, while German treats ö the same as o in dictionary collation, ö has a value that makes it appear outside the a-z range in all the common character encodings used for German (ISO-8859-x, unicode encodings etc.) In this case, PCRE would base its determination of whether ö is in the range [a-z] on that code value, rather than any actual locale-defined sort order.

PHP has mostly copied PCRE's documentation verbatim in their docs. However, they've actually gone to pains changing the above statement to "Ranges operate in ASCII collating sequence". That statement has been in the docs at least since 2004.

In spite of the above, I'm not quite sure it's true, however.

Well, not in all cases, at least.

The one call PHP makes to pcre_maketables... From the PHP source:

#if HAVE_SETLOCALE
    if (strcmp(locale, "C"))
        tables = pcre_maketables();
#endif

In other words, if the environment for which PHP is compiled has setlocale and the (LC_CTYPE) locale isn't the POSIX/C locale, the runtime environment's POSIX/C locale's character order is used. Otherwise, the default PCRE tables are used - which are generated (by pcre_maketables) when PCRE is compiled - based on the compiler's locale:

This function builds a set of character tables for character values less than 256. These can be passed to pcre_compile() to override PCRE's internal, built-in tables (which were made by pcre_maketables() when PCRE was compiled). You might want to do this if you are using a non-standard locale. The function yields a pointer to the tables.

While German wouldn't be different for [a-z] in any common character encoding, if we were dealing with EBCDIC, for example, [a-z] would include ± and ~. Granted, EBCDIC is the one character encoding I can think of that doesn't place a-z and A-Z in uninterrupted sequence.

Unless PCRE does some magic when using EBCDIC (and it might), while it's highly unlikely you'd be including umlauts in anything but the most obscure PHP build or runtime environment (using your very own, very special, custom-made locale definition), you might, in the case of EBCDIC, include other unintended characters. And for other ranges, "collated in ASCII sequence" doesn't seem entirely accurate.

ETA: I could have saved some research by looking for Philip Hazel's own reply to a similar concern:

Another issue is with character classes ranges. You would think that [a-k] and [x-z] are well defined for latin scripts but that's not the case.

They are certainly well defined, being equivalent to [\x61-\x6b] and [\x78-\x7a], that is, related to code order, not cultural sorting order.

like image 32
JimmiTh Avatar answered Sep 30 '22 18:09

JimmiTh