In PHP 5.6 onwards the default_charset
string is set to "UTF-8"
as explained e.g. in the php.ini
documentation. It says that the string is empty for earlier versions.
As I am creating a Java library to communicate with PHP, I need to know which values I should expect when a string is handled as bytes internally. What happens if the default_charset
string is empty and a (literal) string contains characters outside the range of ASCII? Should I expect the default character encoding of the platform, or the character encoding used for the source file?
For literal strings -- always source file encoding. default_charset
value does nothing here.
PHP strings are "binary safe" meaning they do not have any internal string encoding. Basically string in PHP are just buffers of bytes.
For literal strings e.g. $s = "Ä"
this means that string will contain whatever bytes were saved in file between quotes. If file was saved in UTF-8 this will be equivalent to $s = "\xc3\x84"
, if file was saved in ISO-8859-1 (latin1) this will be equivalent to $s = "\xc4"
.
Setting default_charset
value does not affect bytes stored in strings in any way.
default_charset
do then?Some functions, that have to deal with strings as text and are encoding aware, accept $encoding
as argument (usually optional). This tells the function what encoding the text is encoded in a string.
Before PHP 5.6 default value of these optional $encoding
arguments were either in function definition (e.g. htmlspecialchars()
) or configurable in various php.ini settings for each extension separately (e.g. mbstring.internal_encoding
, iconv.input_encoding
).
In PHP 5.6 new php.ini setting default_charset
was introduced. Old settings were deprecated and all functions that accept optional $encoding
argument should now default to default_charset
value when encoding is not specified explicitly.
However, developer is left responsible to make sure that text in string is actually encoded in encoding that was specified.
Links:
default_charset
at the time of writing).default_charset
option in PHP 5.6 release notes.default_chaset
option.It seems you should not rely on the internal encoding. The internal character encoding can be seen/set with mb_internal_encoding.
example phpinfo()
file1.php
<?php
$string = "e";
echo mb_internal_encoding(); //ISO-8859-1
file2.php
<?php
$string = "É";
echo mb_internal_encoding(); //ISO-8859-1
both files will output ISO-8859-1 if you do not change the internal encoding manually.
<?php
echo bin2hex("ö"); //c3b6 (utf-8)
Getting the hex of this character returns UTF-8 encoding. If you save the file using UTF-8 the string in this example will have 2 bytes, even if the internal encoding is not set to UTF-8. Therefore you should rely on the character encoding used for the source file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With