Compilation failed: POSIX collating elements are not supported

Question

I've just installed a website & legacy CMS onto our server and I'm getting a POSIX compilation error. Luckily it's only appearing in the backend however the client's keen to get rid of it.

Warning: preg_match_all() [function.preg-match-all]: Compilation failed: 
POSIX collating elements are not supported at offset 32 in
/home/kwecars/public_html/webEdition/we/include/we_classes/SEEM/we_SEEM.class.php
on line 621

From what I can tell it's the newer version of PHP causing the issue. Here's the code:

function getAllHrefs($code){

$trenner = "[\040|
|	|
]*";

$pattern = "/<(a".$trenner."[^>]+href".$trenner."[=\"|=\'|=\\|=]*".$trenner.")
([^\'\">\040? \\]*)([^\"\' \040\\>]*)(".$trenner."[^>]*)>/sie";

preg_match_all($pattern, $code, $allLinks); // ---- line 621
return $allLinks;

}

How can I tweak this to work on the newer version of php on this server?

Thanks in advance, my voodoo just isn't strong enough ;)

tchrist · Accepted Answer

Your error message that “POSIX collating elements are not supported” deserves some explanation. After all, what in the world is a POSIX collating element anyway, and how can I avoid it?

The short answer is that you have an equals sign inside your square brackets in a place where its use is reserved for future use, assuming we ever get around to implementing it, which is anything but certain. You can tickle this in Perl on the command line this way, which gives a much better error message than PHP is providing:

% perl -le 'print "abc" =~ /[=foo=]/ || "Fail"'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[=foo=] <-- HERE / at -e line 1.

That’s the short answer; the longer answer follows.

Fancy POSIX Character Classes

Inside a square bracketed character class, POSIX admits three different nestedbracketed forms, all indicated using an extra symbol inside the brackets in pairs:

Named POSIX character classes, which are basically like Unicode properties, use an extra colon flanking: [:PROPERTY:], as in [:alpha:].
Collating elements intended to be treated as equivalent to each other, use an extra equals sign flanking them: [=ELEMENTS=], as in [=eéèëê=] in English or French, and [=vw=] in Swedish.
Polygraphs (digraphs, trigraphs, tetragraphs, etc), which are multicharacter elements meant to count as a single character, have an extra dot flanking them: [.DIGRAPH.], as in [.ch.] or [.ll.] per the traditional Spanish alphabet. These are sometimes known as contractions because two or more code points count as though that sequence were a single code point.

Perl supports only the first of these, not the second and third.

They are all awkward to use, because they must be nested inside an extra set of brackets, as in [[:punct:] to mean \pP or \p{punct}. You only need extra braces with Unicode properties when you are selecting one of many, as in [\pL\pN\pM\p{Pc}].

The Intent

The other two were an attempt to support locale-specific linguistic elements in a pre‐Unicode enviornment under legacy 8‑bit locales. For example, to express the traditional Spanish alphabet, which counts acute accents over vowels and diaereses over u’s as the same letter yet which counts a tilde over an n as a different letter altogether, and which furthermore has two digraphs each counting as a distinct letter, you would have to write this in POSIX:

[[=aá=]bc[.ch.]d[=eé=]fgh[=ií=]jkl[.ll.]mnñ[=oó=]pqrst[=uúü=]vwxyz]

You can and sometimes much combine these. For example, in German phonebooks where the three i‑mutated vowels can be spelt without diacritics by inserting a following e:

[a[=ä[.ae.]=]bcdefghijklmno[=ö[.oe.]=]pqrs[=ß[.ss.]=]tu[=ü[.ue.]=]vwxyz]

That way, assuming $ES and $DE are those languages’ respective alphabets, you could say something like

[$ES]{4}

and have it match words like guía, niño, llave, and choco in Spanish; or in German have

[$DE]{6}

and have it match words like tschüß or its uppercase undiacriticked equivalent, TSCHUESS.

The Unicode Way

This is awkward for various reasons, and not just those that are obvious from the two alphabets listed above. It does not admit the notion of combining characters, so you have to add those explicitly for non-normalized text, as in [=e\xE9[.e\x{301.]=].

Unicode has taken another path in how to implement linguistic elements like this. Fortunately, Unicode regular expressions per UTS#18 do not need to support language features tailored for specific languages or locales until Level 3. This is something no one yet has yet implemented.

Note that having SS and ß have the same casefold is not considered a locale tailoring. It is the full casefold for that code point no matter the linguistic context. So those are the same when case is ignored. Strange but true. Given that ß is code point U+00DF, we see that these are the same no matter the locale:

$ perl5.14.0 -E 'say "SS" =~ /^\xDF$/i ? "Pass" : "Fail"'
Pass
$ perl5.14.0 -E 'say "\xDF" =~ /^SS$/i ? "Pass" : "Fail"'
Pass

Although locale tailoring for patterns is still beyond us, collation has been implemented, including with locale support, and you can access it from Perl just fine.

However, PHP does not yet support Unicode collation.

References for Unicode collation include:

ICU’s Collation Concepts document
UTS#10: Unicode Collation Algorithm
Perl’s Unicode::Collate module.
Perl’s Unicode::Collate::Locale module.

Compilation failed: POSIX collating elements are not supported

Tags:

regex

php

posix

pcre

preg-match

philm

1 Answers

Fancy POSIX Character Classes

The Intent

The Unicode Way

tchrist

Recent Activity

Donate For Us

Compilation failed: POSIX collating elements are not supported

Tags:

regex

php

posix

pcre

preg-match

philm

1 Answers

Fancy POSIX Character Classes

The Intent

The Unicode Way

tchrist

Related questions

Recent Activity

Donate For Us