after some confusion in the comments to
I thought I make into a question. According to the PHP manual, a valid class name should match against [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*
. But apparently, this is not enforced, nor does it apply for anything else:
define('π', pi()); var_dump(π); class ␀ { private $␀ = TRUE; public function ␀() { return $this->␀; } } $␀ = new ␀; var_dump($␀ ); var_dump($␀->␀());
works fine (even though my IDE cannot show ␀). Can some erudite person clear this up for me? Can we use any Unicode? And if so, since when? Not that I would actually want to use anything but A-Za-z_
but I'm curious.
Clarification: I am not after a Regex to validate class names, nor do I know if PHP internally uses the Regex it suggests in the manual. The thing that confused me (and apparently the other guys in the linked question) is why things like $☂ = 1
can be used in PHP at all. PHP6 was suppposed to be the Unicode release but PHP6 is in hiatus. But if there is no Unicode support, why can I do this then?
This question starts to mention class names in the title, but then goes on to an example that includes exotic names for methods, constants, variables, and fields. There are actually different rules for these. Let's start with the case insensitive ones.
The general guideline here would be to use only printable ASCII characters. The reason is that these identifiers are normalized to their lowercase version, however, this conversion is locale-dependent. Consider the following PHP file, encoded in ISO-8859-1:
<?php function func_á() { echo "worked"; } func_Á();
Will this script work? Maybe. It depends on what tolower
(
193
)
will return, which is locale-dependent:
$ LANG=en_US.iso88591 php a.php worked $ LANG=en_US.utf8 php a.php Fatal error: Call to undefined function func_Á() in /home/glopes/a.php on line 3
Therefore, it's not a good idea to use non-ASCII characters. However, even ASCII characters may give trouble in some locales. See this discussion. It's likely that this will be fixed in the future by doing a locale-independent lowercasing that only works with ASCII characters.
In conclusion, if we use multi-byte encodings for these case-insensitive identifiers, we're looking for trouble. It's not just that we can't take advantage of the case insensitivity. We might actually run into unexpected collisions because all the bytes that compose a multi-byte character are individually turned into lowercase using locale rules. It's possible that two different multi-byte characters map to the same modified byte stream representation after applying the locale lowercase rules to each of the bytes.
The problem is less serious here, since these identifiers are case sensitive. However, they are just interpreted as bytestreams. This means that if we use Unicode, we must consistently use the same byte representation; we can't mix UTF-8 and UTF-16; we also can't use BOMs.
In fact, we must stick to UTF-8. Outside of the ASCII range, UTF-8 uses lead bytes from 0xc0 to 0xfd and the trail bytes are in the range 0x80 to 0xbf, which are in the allowed range per the manual. Now let's say we use the character "Ġ" in a UTF-16BE encoded file. This will translate to 0x01 0x20, so the second byte will be interpreted as a space.
Having multi-byte characters being read as if they were single-byte characters is, of course, no Unicode support at all. PHP does have some multi-byte support in the form of the compilation switch "--enable-zend-multibyte" (as of PHP 5.4, multibyte support is compiled in by default, but disabled; you can enable it with zend.multibyte=On
in php.ini). This allows you to declare the encoding of the the script:
<?php declare(encoding='ISO-8859-1'); // code here ?>
It will also handle BOMs, which are used to auto-detect the encoding and do not become part of the output. There are, however, a few downsides:
Finally, there is the problem of lack of normalization – the same character may be represented with different Unicode code points (independently of the encoding). This may lead to some very difficult to track bugs.
Your character is encoded as 0x80 0x90 0xe2
or something like that, thus it matches your regexp when not interpreting the unicode (working on single bytes).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With