I have a huge MySQL table which has its rows encoded in UTF-8 twice. For example "Újratárgyalja" is stored as "Újratárgyalja".
The MySQL .Net connector downloads them this way. I tried lots of combinations with System.Text.Encoding.Convert()
but none of them worked.
Sending set names 'utf8'
(or other charset) won't solve it.
How can I decode them from double UTF-8 to UTF-8?
Peculiar problem, but I think I can reproduce it by a suitably-unholy mix of UTF-8 and Latin-1 (not by just two uses of UTF-8 without an interspersed mis-step in Latin-1 though). Here's the whole weird round trip, "there and back again" (Python 2.* or IronPython should both be able to reproduce this):
# -*- coding: utf-8 -*-
uni = u'Újratárgyalja'
enc1 = uni.encode('utf-8')
enc2 = enc1.decode('latin-1').encode('utf-8')
dec3 = enc2.decode('utf-8')
dec4 = dec3.encode('latin-1').decode('utf-8')
for x in (uni, enc1, enc2, dec3, dec4):
print repr(x), x
This is the interesting output...:
u'\xdajrat\xe1rgyalja' Újratárgyalja
'\xc3\x9ajrat\xc3\xa1rgyalja' Újratárgyalja
'\xc3\x83\xc2\x9ajrat\xc3\x83\xc2\xa1rgyalja' Ãjratárgyalja
u'\xc3\x9ajrat\xc3\xa1rgyalja' Ãjratárgyalja
u'\xdajrat\xe1rgyalja' Újratárgyalja
The weird string starting with Ã
appears as enc2, i.e. two utf-8 encodings WITH an interspersed latin-1 decoding thrown into the mix. And as you can see it can be undone by the exactly-converse sequence of operations: decode as utf-8, re-encode as latin-1, re-decode as utf-8 again -- and the original string is back (yay!).
I believe that the normal round-trip properties of both Latin-1 (aka ISO-8859-1) and UTF-8 should guarantee that this sequence will work (sorry, no C# around to try in that language right now, but I would expect that the encoding/decoding sequences should not depend on the specific programming language in use).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With