A previous LOAD DATA INFILE
was run under the assumption that the CSV file is latin1
-encoded. During this import the multibyte characters were interpreted as two single character and then encoded using utf-8 (again).
This double-encoding created anomalies like ñ
instead of ñ
.
How to correct these strings?
0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters.
The following MySQL function will return the correct utf8 string after double-encoding:
CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)
It can be used with an UPDATE
statement to correct the fields:
UPDATE tablename SET field = CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With