Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the character set if default_charset is empty

In PHP 5.6 onwards the default_charset string is set to "UTF-8" as explained e.g. in the php.ini documentation. It says that the string is empty for earlier versions.

As I am creating a Java library to communicate with PHP, I need to know which values I should expect when a string is handled as bytes internally. What happens if the default_charset string is empty and a (literal) string contains characters outside the range of ASCII? Should I expect the default character encoding of the platform, or the character encoding used for the source file?

like image 585
Maarten Bodewes Avatar asked Dec 26 '22 01:12

Maarten Bodewes


2 Answers

Short answer

For literal strings -- always source file encoding. default_charset value does nothing here.

Longer answer

PHP strings are "binary safe" meaning they do not have any internal string encoding. Basically string in PHP are just buffers of bytes.

For literal strings e.g. $s = "Ä" this means that string will contain whatever bytes were saved in file between quotes. If file was saved in UTF-8 this will be equivalent to $s = "\xc3\x84", if file was saved in ISO-8859-1 (latin1) this will be equivalent to $s = "\xc4".

Setting default_charset value does not affect bytes stored in strings in any way.

What does default_charset do then?

Some functions, that have to deal with strings as text and are encoding aware, accept $encoding as argument (usually optional). This tells the function what encoding the text is encoded in a string.

Before PHP 5.6 default value of these optional $encoding arguments were either in function definition (e.g. htmlspecialchars()) or configurable in various php.ini settings for each extension separately (e.g. mbstring.internal_encoding, iconv.input_encoding).

In PHP 5.6 new php.ini setting default_charset was introduced. Old settings were deprecated and all functions that accept optional $encoding argument should now default to default_charset value when encoding is not specified explicitly.

However, developer is left responsible to make sure that text in string is actually encoded in encoding that was specified.


Links:

  • Details of the String Type
    More details on nature of PHP strings (does not mention default_charset at the time of writing).
  • New features in PHP 5.6: Default character encoding
    Short introduction of new default_charset option in PHP 5.6 release notes.
  • Deprecated features in PHP 5.6: iconv and mbstring encoding settings
    List of deprecated php.ini options in favour of default_chaset option.
like image 78
Giedrius D Avatar answered Dec 31 '22 10:12

Giedrius D


It seems you should not rely on the internal encoding. The internal character encoding can be seen/set with mb_internal_encoding.

example phpinfo()

  • PHP Version 5.5.9-1ubuntu4.5
  • default_charset no value

file1.php

<?php
$string = "e";
echo mb_internal_encoding(); //ISO-8859-1

file2.php

<?php
$string = "É";
echo mb_internal_encoding(); //ISO-8859-1

both files will output ISO-8859-1 if you do not change the internal encoding manually.

<?php
echo bin2hex("ö"); //c3b6 (utf-8)

Getting the hex of this character returns UTF-8 encoding. If you save the file using UTF-8 the string in this example will have 2 bytes, even if the internal encoding is not set to UTF-8. Therefore you should rely on the character encoding used for the source file.

like image 25
oshell Avatar answered Dec 31 '22 10:12

oshell