Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix double-encoded UTF8 characters (in an utf-8 table)

A previous LOAD DATA INFILE was run under the assumption that the CSV file is latin1-encoded. During this import the multibyte characters were interpreted as two single character and then encoded using utf-8 (again).

This double-encoding created anomalies like ñ instead of ñ.

How to correct these strings?

like image 273
vbence Avatar asked Jul 11 '12 15:07

vbence


People also ask

What characters are not allowed in UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

Is UTF-8 a character set?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters.


1 Answers

The following MySQL function will return the correct utf8 string after double-encoding:

CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8) 

It can be used with an UPDATE statement to correct the fields:

UPDATE tablename SET     field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8); 
like image 86
vbence Avatar answered Sep 20 '22 21:09

vbence