Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 Encoding problem - With good examples

I have the following character encoding issue, somehow I have managed to save data with different character encoding into my database (UTF8) The code and outputs below show 2 sample strings and how they output. 1 of them would need to be changed to UTF8 and the other already is.

How do/should I go about checking if I should encode the string or not? e.g. I need each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?

I am using PHP 5.2, mysql myisam tables:

CREATE TABLE IF NOT EXISTS `entities` (   ....   `title` varchar(255) NOT NULL   .... ) ENGINE=MyISAM DEFAULT CHARSET=utf8;  <?php $text = $entity['Entity']['title']; echo 'Original : ', $text."<br />"; echo 'UTF8 Encode : ', utf8_encode($text)."<br />"; echo 'UTF8 Decode : ', utf8_decode($text)."<br />"; echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text)."<br />"; echo 'IGNORE TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//IGNORE//TRANSLIT", $text)."<br />"; echo 'IGNORE   : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text)."<br />"; echo 'Plain    : ', iconv("ISO-8859-1", "UTF-8", $text)."<br />"; ?> 

Output 1:

Original : France Télécom UTF8 Encode : France Télécom UTF8 Decode : France T�l�com TRANSLIT : France Télécom IGNORE TRANSLIT : France Télécom IGNORE : France Télécom Plain : France Télécom 

Output 2:###

Original : Cond� Nast Publications UTF8 Encode : Condé Nast Publications UTF8 Decode : Cond?ast Publications TRANSLIT : Condé Nast Publications IGNORE TRANSLIT : Condé Nast Publications IGNORE : Condé Nast Publications Plain : Condé Nast Publications 

Thanks for you time on this one. Character encoding and I don't get on very well!

UPDATE:

echo strlen($string)."|".strlen(utf8_encode($string))."|"; echo (strlen($string)!==strlen(utf8_encode($string))) ? $string : utf8_encode($string); echo "<br />"; echo strlen($string)."|".strlen(utf8_decode($string))."|"; echo (strlen($string)!==strlen(utf8_decode($string))) ? $string : utf8_decode($string); echo "<br />";  23|24|Cond� Nast Publications 23|21|Cond� Nast Publications  16|20|France Télécom 16|14|France Télécom 
like image 704
Lizard Avatar asked Nov 04 '10 10:11

Lizard


2 Answers

This may be a job for the mb_detect_encoding() function.

In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it should work.

<?php $text = $entity['Entity']['title'];  echo 'Original : ', $text."<br />"; $enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");  echo 'Detected encoding '.$enc."<br />";  echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";  ?> 

you may get incorrect results for strings that do not contain special characters, but that is not a problem.

like image 136
Pekka Avatar answered Oct 04 '22 19:10

Pekka


I made a function that addresses all this issues. It´s called Encoding::toUTF8().

<?php $text = $entity['Entity']['title']; echo 'Original : ', $text."<br />"; echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />"; ?> 

Output:

Original : France Télécom Encoding::toUTF8 : France Télécom  Original : Cond� Nast Publications Encoding::toUTF8 : Condé Nast Publications 

You dont need to know what the encoding of your strings is as long as you know it is either on Latin1 (iso 8859-1), Windows-1252 or UTF8. The string can have a mix of them too.

Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);  $latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string); 

Download:

http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string); 

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football"); 

will output:

Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football 
like image 43
Sebastián Grignoli Avatar answered Oct 04 '22 19:10

Sebastián Grignoli