I come across following text from the Details of the String Type page from PHP Manual : <blockquote> Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. String will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1. </blockquote> So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII? Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? If no, then what's the reason? Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. Now, can I declare the encoding which is not a compatible superset of ASCII, such as UTF-16 or UTF-32 in the script file? If yes, then in this case what encoding the string literals would get encoded in? If no, then what's the reason? Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled? How to enable the Zend Multibyte? What's the main intention behind turning it On? When it is required to turn it On? It would be better if you could clear my doubts accompanied by suitable examples. Thank You.

String literals in PHP source code files are taken literally as the raw bytes which are present in the source code file. If you have bytes in your source code which represent a UTF-16 string or anything else really, then you can use them directly: <pre class="prettyprint lang-bash prettyprint-override"><code>$ echo -n '<?php echo "' > test.php $ echo -n 日本語 | iconv -t UTF-16 >> test.php $ echo '";' >> test.php $ cat test.php <?php echo "??e?g,??"; $ cat test.php | xxd 00000000: 3c3f 7068 7020 6563 686f 2022 feff 65e5 <?php echo "..e. 00000010: 672c 8a9e 223b 0a g,..";. $ php test.php ??e?g,??$ $ php test.php | iconv -f UTF-16 日本語 </code></pre> This demonstrates a source code file ostensibly written in ASCII, but containing a UTF-16 string literal in the middle, which is output as is. The bigger problem with this kind of source code is that it's difficult to work with. It's somewhere between a pain in the neck and impossible to get a text editor to treat the PHP code in one encoding and string literals in another. So typically, you want to keep the entire source code, including string literals, in one and the same encoding throughout. You can also easily get into trouble: <pre class="prettyprint lang-bash prettyprint-override"><code>$ echo -n '<?php echo "' > test.php $ echo -n 漢字 | iconv -t UTF-16 >> test.php $ echo '";' >> test.php $ cat test.php | xxd 00000000: 3c3f 7068 7020 6563 686f 2022 feff 6f22 <?php echo "..o" 00000010: 5b57 223b 0a [W";. </code></pre> "漢字" here is encoded to <code>feff 6f22 5b57</code>, which contains <code>22</code> or <code>"</code>, a string literal terminator, which means you have a syntax error now. By default the PHP interpreter expects the PHP code to be ASCII compatible, so if you want to keep your string literals and the rest of the source code in the same encoding, you're pretty much limited to ASCII compatible encodings. However, the Zend Multibyte extension allows you to use other encodings if you declare the used encoding accordingly (in php.ini if it's not ASCII compatible). So you could write your source code in, say, Shift-JIS throughout; probably even with string literals in some other encoding*. * (At which point I'll quit going into details because what is wrong with you?!) Summary: <ul> <li>PHP must understand all the PHP code; by default it understands ASCII, with Zend Multibyte it can understand other encodings as well.</li> <li>The string literals in your source code can contain any bytes you want, as long as PHP doesn't interpret them as special characters in the string literal (e.g. the <code>22</code> example above), in which case you need to escape them (with a backslash in the encoding of the general source code).</li> <li>The string value at runtime will be the raw byte sequence PHP read from the string literal.</li> </ul> Having said all this, it is typically a pain in the neck to diverge from ASCII compatible encodings. It's a pain in text editors and easily leads to mojibake if some tool in your workflow is treating the file incorrectly. At most I'd advice to use ASCII-compatible encodings, e.g.: <pre class="prettyprint"><code>echo "日本語"; // UTF-8 encoded (let's hope) </code></pre> If you must have a non-ASCII-compatible string literal, you should use byte notation: <pre class="prettyprint"><code>echo "\xfe\xff\x65\xe5\x67\x2c\x8a\x9e"; </code></pre> Or conversion: <pre class="prettyprint"><code>echo iconv('UTF-8', 'UTF-16', '日本語'); </code></pre> <blockquote> [..] will the strings literals encoded in such one of the non-ASCII compatible encoding work with <code>mb_string_*</code> functions? </blockquote> Sure, strings in PHP are raw byte arrays for all intents and purposes. It doesn't matter how you obtained that string. If you have a UTF-16 string obtained with any of the methods demonstrated above, including by hardcoding it in UTF-16 into the source code, you have a UTF-16 encoded string and you can put that through any and all string functions that know how to deal with it.

Is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1?

Tags:

php

encoding

ascii

utf-8

non-ascii-characters

I come across following text from the Details of the String Type page from PHP Manual :

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. String will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled; in that case, the script may be written in an arbitrary encoding (which is explicity declared or is detected) and then converted to a certain internal encoding, which is then the encoding that will be used for the string literals. Note that there are some constraints on the encoding of the script (or on the internal encoding, should Zend Multibyte be enabled) – this almost always means that this encoding should be a compatible superset of ASCII, such as UTF-8 or ISO-8859-1.

So my doubt is, is it true that string literals in PHP can only be encoded in an encoding which is a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 and not in an encoding which is not a compatible superset of ASCII?

Is it possible to encode string literals in PHP in some non-ASCII compatible encoding like UTF-16, UTF-32 or some other such non-ASCII compatible encoding? If yes then will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions? If no, then what's the reason?

Suppose, Zend Multibyte is enabled and I've set the internal encoding to a compatible superset of ASCII, such as UTF-8 or ISO-8859-1 or some other non-ASCII compatible encoding. Now, can I declare the encoding which is not a compatible superset of ASCII, such as UTF-16 or UTF-32 in the script file?

If yes, then in this case what encoding the string literals would get encoded in? If no, then what's the reason?

Also, explain me how does this encoding thing work for string literals if Zend Multibyte is enabled?

How to enable the Zend Multibyte? What's the main intention behind turning it On? When it is required to turn it On?

It would be better if you could clear my doubts accompanied by suitable examples.

Thank You.

568

asked Sep 23 '18 16:09

PHPNut

1 Answers

String literals in PHP source code files are taken literally as the raw bytes which are present in the source code file. If you have bytes in your source code which represent a UTF-16 string or anything else really, then you can use them directly:

$ echo -n '<?php echo "' > test.php
$ echo -n 日本語 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php 
<?php echo "??e?g,??";
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 65e5  <?php echo "..e.
00000010: 672c 8a9e 223b 0a                        g,..";.
$ php test.php 
??e?g,??$ 
$ php test.php | iconv -f UTF-16
日本語

This demonstrates a source code file ostensibly written in ASCII, but containing a UTF-16 string literal in the middle, which is output as is.

The bigger problem with this kind of source code is that it's difficult to work with. It's somewhere between a pain in the neck and impossible to get a text editor to treat the PHP code in one encoding and string literals in another. So typically, you want to keep the entire source code, including string literals, in one and the same encoding throughout.

You can also easily get into trouble:

$ echo -n '<?php echo "' > test.php
$ echo -n 漢字 | iconv -t UTF-16 >> test.php 
$ echo '";' >> test.php 
$ cat test.php | xxd
00000000: 3c3f 7068 7020 6563 686f 2022 feff 6f22  <?php echo "..o"
00000010: 5b57 223b 0a                             [W";.

"漢字" here is encoded to feff 6f22 5b57, which contains 22 or ", a string literal terminator, which means you have a syntax error now.

By default the PHP interpreter expects the PHP code to be ASCII compatible, so if you want to keep your string literals and the rest of the source code in the same encoding, you're pretty much limited to ASCII compatible encodings. However, the Zend Multibyte extension allows you to use other encodings if you declare the used encoding accordingly (in php.ini if it's not ASCII compatible). So you could write your source code in, say, Shift-JIS throughout; probably even with string literals in some other encoding*.

_{* (At which point I'll quit going into details because what is wrong with you?!)}

Summary:

PHP must understand all the PHP code; by default it understands ASCII, with Zend Multibyte it can understand other encodings as well.
The string literals in your source code can contain any bytes you want, as long as PHP doesn't interpret them as special characters in the string literal (e.g. the 22 example above), in which case you need to escape them (with a backslash in the encoding of the general source code).
The string value at runtime will be the raw byte sequence PHP read from the string literal.

Having said all this, it is typically a pain in the neck to diverge from ASCII compatible encodings. It's a pain in text editors and easily leads to mojibake if some tool in your workflow is treating the file incorrectly. At most I'd advice to use ASCII-compatible encodings, e.g.:

echo "日本語";  // UTF-8 encoded (let's hope)

If you must have a non-ASCII-compatible string literal, you should use byte notation:

echo "\xfe\xff\x65\xe5\x67\x2c\x8a\x9e";

Or conversion:

echo iconv('UTF-8', 'UTF-16', '日本語');

[..] will the strings literals encoded in such one of the non-ASCII compatible encoding work with mb_string_* functions?

Sure, strings in PHP are raw byte arrays for all intents and purposes. It doesn't matter how you obtained that string. If you have a UTF-16 string obtained with any of the methods demonstrated above, including by hardcoding it in UTF-16 into the source code, you have a UTF-16 encoded string and you can put that through any and all string functions that know how to deal with it.

answered Oct 22 '22 12:10

deceze

Related questions
                            
                                PHP 7 cannot find MySQLi
                            
                                Error 500 when uploading Laravel project to server
                            
                                Why does multi_query not work when I use transactions?
                            
                                What is Facebook app canvas and page tab?
                            
                                JS charCodeAt equivalent in PHP (with full unicode and emoji compatibility)
                            
                                How to generate a short token with laravel passport?
                            
                                Cakephp: how to call javascript function in link?
                            
                                Xdebug (inside Docker container) ignoring XDEBUG_CONFIG environment variable
                            
                                Laravel functional testing ajax control
                            
                                ZipArchive::close(): Failure to create temporary file in AWS EC2 Linux
                            
                                FatalErrorException in ProviderRepository.php line 208: Class 'Illuminate\Html\HtmlServiceProvider' not found
                            
                                different Timestamps when using strtotime() in PHP and UNIX_TIMESTAMP() in MySQL
                            
                                Setting up an in memory SQLite database for testing in Laravel 5.4
                            
                                php 7 mb_ (multibyte) functions are ~60% slower than in 5.3 (windows only issue)
                            
                                Laravel Eloquent delete() not working
                            
                                Dropzone no valid MIME type in backend Laravel?
                            
                                How to convert a file to UTF-8 in php?
                            
                                How to use relationship value in Laravel Form Facade?
                            
                                Laravel not using https for assets and dynamic routes
                            
                                A Colon cannot be used in an unquoted value Error in security.yml

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With