I have the following character encoding issue, somehow I have managed to save data with different character encoding into my database (UTF8) The code and outputs below show 2 sample strings and how they output. 1 of them would need to be changed to UTF8 and the other already is.
How do/should I go about checking if I should encode the string or not? e.g. I need each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?
I am using PHP 5.2, mysql myisam tables:
CREATE TABLE IF NOT EXISTS `entities` ( .... `title` varchar(255) NOT NULL .... ) ENGINE=MyISAM DEFAULT CHARSET=utf8; <?php $text = $entity['Entity']['title']; echo 'Original : ', $text."<br />"; echo 'UTF8 Encode : ', utf8_encode($text)."<br />"; echo 'UTF8 Decode : ', utf8_decode($text)."<br />"; echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text)."<br />"; echo 'IGNORE TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//IGNORE//TRANSLIT", $text)."<br />"; echo 'IGNORE : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text)."<br />"; echo 'Plain : ', iconv("ISO-8859-1", "UTF-8", $text)."<br />"; ?>
Original : France Télécom UTF8 Encode : France Télécom UTF8 Decode : France T�l�com TRANSLIT : France Télécom IGNORE TRANSLIT : France Télécom IGNORE : France Télécom Plain : France Télécom
Original : Cond� Nast Publications UTF8 Encode : Condé Nast Publications UTF8 Decode : Cond?ast Publications TRANSLIT : Condé Nast Publications IGNORE TRANSLIT : Condé Nast Publications IGNORE : Condé Nast Publications Plain : Condé Nast Publications
Thanks for you time on this one. Character encoding and I don't get on very well!
UPDATE:
echo strlen($string)."|".strlen(utf8_encode($string))."|"; echo (strlen($string)!==strlen(utf8_encode($string))) ? $string : utf8_encode($string); echo "<br />"; echo strlen($string)."|".strlen(utf8_decode($string))."|"; echo (strlen($string)!==strlen(utf8_decode($string))) ? $string : utf8_decode($string); echo "<br />"; 23|24|Cond� Nast Publications 23|21|Cond� Nast Publications 16|20|France Télécom 16|14|France Télécom
This may be a job for the mb_detect_encoding()
function.
In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it should work.
<?php $text = $entity['Entity']['title']; echo 'Original : ', $text."<br />"; $enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1"); echo 'Detected encoding '.$enc."<br />"; echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />"; ?>
you may get incorrect results for strings that do not contain special characters, but that is not a problem.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
<?php $text = $entity['Entity']['title']; echo 'Original : ', $text."<br />"; echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />"; ?>
Output:
Original : France Télécom Encoding::toUTF8 : France Télécom Original : Cond� Nast Publications Encoding::toUTF8 : Condé Nast Publications
You dont need to know what the encoding of your strings is as long as you know it is either on Latin1 (iso 8859-1), Windows-1252 or UTF8. The string can have a mix of them too.
Encoding::toUTF8() will convert everything to UTF8.
I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.
Usage:
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string); $latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip
I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.
Usage:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With