Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing broken UTF-8 encoding

I am in the process of fixing some bad UTF-8 encoding. I am currently using PHP 5 and MySQL.

In my database I have a few instances of bad encodings that print like: î

  • The database collation is utf8_general_ci
  • PHP is using a proper UTF-8 header
  • Notepad++ is set to use UTF-8 without BOM
  • database management is handled in phpMyAdmin
  • not all cases of accented characters are broken

I need some sort of function that will help me map the instances of î, í, ü and others like it to their proper accented UTF-8 characters.

like image 910
Jayrox Avatar asked Aug 28 '09 02:08

Jayrox


People also ask

What does it mean if something is not UTF-8 encoded?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.

How do I fix encoding in Python?

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.


2 Answers

If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe ’, quotation mark “, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.

Like this:

mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \     --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql  mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \     --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql 

This was a 100% fix for my double encoded UTF-8.

Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

like image 167
jsdalton Avatar answered Oct 10 '22 07:10

jsdalton


If you utf8_encode() on a string that is already UTF-8 then it looks garbled when it is encoded multiple times.

I made a function toUTF8() that converts strings into UTF-8.

You don't need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a mix of these three.

I used this myself on a feed with mixed encodings in the same string.

Usage:

$utf8_string = Encoding::toUTF8($mixed_string);  $latin1_string = Encoding::toLatin1($mixed_string); 

My other function fixUTF8() fixes garbled UTF8 strings if they were encoded into UTF8 multiple times.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string); 

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football"); echo Encoding::fixUTF8("Fédération Camerounaise de Football"); 

will output:

Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football 

Download:

https://github.com/neitanod/forceutf8

like image 41
Sebastián Grignoli Avatar answered Oct 10 '22 08:10

Sebastián Grignoli