Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting the right character encoding in PHP?

I'm trying to detect the character encoding of a string but I can't get the right result.
For example:

$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

That code outputs ISO-8859-1 but it should be Windows-1252.

What's wrong with this?

EDIT:
Updated example, in response to @raina77ow.

$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

I get the wrong result again.

like image 735
GetFree Avatar asked Nov 04 '22 00:11

GetFree


1 Answers

The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.

This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.

like image 114
scy Avatar answered Nov 09 '22 07:11

scy