I am in the process of fixing some bad UTF-8 encoding. I am currently using PHP 5 and MySQL.
In my database I have a few instances of bad encodings that print like: î
I need some sort of function that will help me map the instances of î, ÃÂ, ü and others like it to their proper accented UTF-8 characters.
This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.
The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.
If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe ’, quotation mark “, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.
Like this:
mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \ --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \ --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql
This was a 100% fix for my double encoded UTF-8.
Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/
If you utf8_encode()
on a string that is already UTF-8 then it looks garbled when it is encoded multiple times.
I made a function toUTF8()
that converts strings into UTF-8.
You don't need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a mix of these three.
I used this myself on a feed with mixed encodings in the same string.
Usage:
$utf8_string = Encoding::toUTF8($mixed_string); $latin1_string = Encoding::toLatin1($mixed_string);
My other function fixUTF8()
fixes garbled UTF8 strings if they were encoded into UTF8 multiple times.
Usage:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football
Download:
https://github.com/neitanod/forceutf8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With