Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect and fix incorrect character encoding

A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.

The upstream service is out of my control. They may fix it, it may never be fixed.

I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?

Is there any way to detect this issue and fix the encoding only when I find a bad encoding?

I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.


When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â).

  • utf-8 string as bytes: e4 9c 94
  • bytes e4 9c 94 as latin1 string: â
  • utf-8 string â as bytes: c3 a2 c2 9c c2 94

I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.

like image 977
everett1992 Avatar asked Dec 30 '25 14:12

everett1992


1 Answers

Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text.

Here's an example in Python (you didn't mention a language...):

    # Assume bytes are latin1, but return encoded UTF-8.
    def bad(b):
        return b.decode('latin1').encode('utf8')
    
    # Assume bytes are UTF-8, and pass them along.
    def good(b):
        return b
    
    def decoder(b):
        try:
            return b.decode('utf8').encode('latin1').decode('utf8')
        except UnicodeError:
            return b.decode('utf8')
    
    b = '✔'.encode('utf8')
    print(decoder(bad(b)))
    print(decoder(good(b)))

Output:

✔
✔
like image 89
Mark Tolonen Avatar answered Jan 05 '26 03:01

Mark Tolonen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!