Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort
for sort
).
All byte functions (eg, strlen
, strstr
, strpos
, and substr
) work like the corresponding character functions (eg, mb_strlen
, mb_strstr
, mb_strpos
, and mb_substr
).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u
tacked on implicitly, and things like \w
and \b
and \s
all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen
, grapheme_strstr
, grapheme_strpos
, and grapheme_substr
), and the regex stuff works on proper graphemes (ie, .
— or even [^abc]
— matches a Unicode grapheme cluster no matter how many code points it contains, etc).
That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.
So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.
One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring
extension (quoting) :
mbstring supports a 'function overloading' feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions.
For example,mb_substr()
is called instead ofsubstr()
if function overloading is enabled.
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
This isn't a good idea.
Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.
For example, spit out a header 'Content-Length: '.strlen($imageblob)
and you're going to get brokenness if that's suddenly using codepoint semantics.
You still need to have both mb_strlen
and strlen
, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.
This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.
(*: or UTF-16 code unit semantics if unlucky)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With