Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode a string that has been UTF-8 encoded twice to simple UTF-8?

Tags:

c#

mysql

utf-8

I have a huge MySQL table which has its rows encoded in UTF-8 twice. For example "Újratárgyalja" is stored as "Újratárgyalja".

The MySQL .Net connector downloads them this way. I tried lots of combinations with System.Text.Encoding.Convert() but none of them worked.

Sending set names 'utf8' (or other charset) won't solve it.

How can I decode them from double UTF-8 to UTF-8?

like image 472
RoliSoft Avatar asked Sep 19 '09 18:09

RoliSoft


1 Answers

Peculiar problem, but I think I can reproduce it by a suitably-unholy mix of UTF-8 and Latin-1 (not by just two uses of UTF-8 without an interspersed mis-step in Latin-1 though). Here's the whole weird round trip, "there and back again" (Python 2.* or IronPython should both be able to reproduce this):

# -*- coding: utf-8 -*-
uni = u'Újratárgyalja'
enc1 = uni.encode('utf-8')
enc2 = enc1.decode('latin-1').encode('utf-8')
dec3 = enc2.decode('utf-8')
dec4 = dec3.encode('latin-1').decode('utf-8')

for x in (uni, enc1, enc2, dec3, dec4):
  print repr(x), x

This is the interesting output...:

u'\xdajrat\xe1rgyalja' Újratárgyalja
'\xc3\x9ajrat\xc3\xa1rgyalja' Újratárgyalja
'\xc3\x83\xc2\x9ajrat\xc3\x83\xc2\xa1rgyalja' Ãjratárgyalja
u'\xc3\x9ajrat\xc3\xa1rgyalja' Ãjratárgyalja
u'\xdajrat\xe1rgyalja' Újratárgyalja

The weird string starting with à appears as enc2, i.e. two utf-8 encodings WITH an interspersed latin-1 decoding thrown into the mix. And as you can see it can be undone by the exactly-converse sequence of operations: decode as utf-8, re-encode as latin-1, re-decode as utf-8 again -- and the original string is back (yay!).

I believe that the normal round-trip properties of both Latin-1 (aka ISO-8859-1) and UTF-8 should guarantee that this sequence will work (sorry, no C# around to try in that language right now, but I would expect that the encoding/decoding sequences should not depend on the specific programming language in use).

like image 200
Alex Martelli Avatar answered Oct 21 '22 02:10

Alex Martelli