Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting bad UTF-8 encoding: list of bad characters to sniff?

I have a sql-server 2010 database shared between two applications. One application we have control over, and the other application is a third-party app that created the database in the first place. Our app is a CRM built on top of the third party webmail app.

The database contains varchar columns and is latin-1 encoded. The third-party app is written in php and doesn't care about correctly encoding the data, so it stuffs utf-8 encoded bytes into the varchar columns, where they are interpreted as latin-1 and look like garbage.

Our CRM app is written in .Net, which automagically detects that the database collation is different the the encoding of the string in memory, so when .Net writes to a database, it converts the bytes to match the database encoding.

So... data written to the db from our app looks correct in the db, but data from the third party app doesn't.

when our app writes FirstName = Céline, it is stored in the db as Céline

when the webmail app writes FirstName = Céline it is stored in the db as Céline

Our CRM app needs to display contacts that were created in either system. So I'm writing an EncodingSniffer class that looks for flagged characters that indicate its a poorly encoded string and converts them.

currently I have:

private static string[] _flaggedChars = new string[] { 
            "é" 
        };

which works great for displaying Céline as Céline, but I need to add to the list.

Does anyone know of a resource to get all the possible ways that utf-8 special chars could be interpreted as iso-8859-1?

Thanks

Clarification: Since i am working in .Net. The string, when loaded into memory from the database, is converted to Unicode UTF-16. So, regardless of if it was encoded correctly in the database. It is now represented as UTF16 bytes. I need to be able to analyze thes UTF-16 bytes, and determine if they are screwed up due to utf-8 bytes being stuffed into an iso-8859-1 database.... clear as mud right?

Here is what i have so far. It has cleaned up the display of most misencoded characters, but I am still having trouble with É for instance: Éric is stored in the db by webmail as Éric, but after detecting bad encoding and changing it back, it displays as �?ric Looking at a user who has 2500 contacts, hundreds of which had encoding issues, the É is the only thing that isn't displaying correctly...

public static Regex CreateRegex()
    {
        string specials = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö";

        List<string> flags = new List<string>();
        foreach (char c in specials)
        {
            string interpretedAsLatin1 = Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(c.ToString())).Trim();//take the specials, treat them as utf-8, interpret them as latin-1
            if (interpretedAsLatin1.Length > 0)//utf-8 chars made up of 2 bytes, interpreted as two single byte latin-1 chars.
                flags.Add(interpretedAsLatin1);
        }

        string regex = string.Empty;
        foreach (string s in flags)
        {
            if (regex.Length > 0)
                regex += '|';
            regex += s;
        }
        return new Regex("(" + regex + ")");
    }

    public static string CheckUTF(string data)
    {
        Match match = CreateRegex().Match(data);
        if (match.Success)
            return Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(data));//from iso-8859-1 (latin-1) to utf-8
        else
            return data;
    }

So: É is being converted to 195'Ã',8240'‰'

like image 925
Michael Avatar asked May 07 '12 15:05

Michael


People also ask

What characters are not allowed in UTF-8?

Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

How do I know my UTF-8 encoding?

If our byte is positive (8th bit set to 0), this mean that it's an ASCII character. if ( myByte >= 0 ) return myByte; Codes greater than 127 are encoded into several bytes. On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.

Can UTF-8 handle Chinese characters?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.


1 Answers

You should probably just try to decode the byte string as UTF-8, and if you get an error, assume it's ISO-8859-1 instead.

Text that is encoded as ISO-8859-1 rarely "happens" to also be valid UTF-8... unless it's ISO-8859-1 that only actually contains ASCII, but then in that case you don't have a problem at all, of course. So this method is reasonably robust.

Ignoring which characters occur more frequently than others in actual language, here is a naive analysis that assumes that each character occurs with the same frequency. Let's try to find out how frequently valid ISO-8859-1 can be mistaken for UTF-8 resulting in mojibake. I also assume that C1 control characters (U+0080 through U+009F) don't occur.

For any given byte in the byte string. If the byte is close to the end of the string then you are even more likely to detect malformed UTF-8 because some byte sequences will be known to be not long enough to be valid UTF-8. But assuming the byte is not near the end of the string:

  • p(byte decodes as ASCII) = 0.57. This gives no information about whether the string is ASCII, ISO-8859-1, or UTF-8.
  • If this byte is 0x80 through 0xc1 or 0xf8 through 0xff, it can't be UTF-8, so you'll detect that. p=0.33
  • If this first byte is 0xc2 through 0xdf (p=0.11) then it could be valid UTF-8, but only if it's followed by a byte with value between 0x80 and 0xbf. The probability that the next byte fails to be in that range is 192/224 = 0.86. So the probability that UTF-8 fails here is 0.09
  • If the first byte is 0xe0 through 0xef then it could be valid UTF-8 but only if it's followed by 2 continuation bytes. The probability that you will detect bad UTF-8 is thus (16/224)*(1-(0.14*0.14)) = 0.07
  • Similar for 0xf0 through 0xf7, the probability is (8/224)*(1-(0.14*0.14*0.14)) = 0.04.

At each byte in a long string, the probability of detecting bad UTF-8 is 0.33+0.09+0.07+0.04 = 0.53.

So for a long string, the probability that ISO-8859-1 will pass silently through a UTF-8 decoder is very small: it approximately halves for each additional character!

This analysis of course assumes random ISO-8859-1 characters. In practice the misdetection rate will not be quite as good as that (mostly on account of the fact that most bytes in real-world text are actually ASCII), but it will still be very good.

like image 165
Celada Avatar answered Oct 25 '22 20:10

Celada