Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect if a string was double-encoded in UTF-8

I need to process a large list of short strings (mostly in Russian, but any other language is possible, including random garbage from a cat walking on keyboard).

Some of these strings will be encoded in UTF-8 twice.

I need to reliably detect if a given string is double-encoded, and fix it. I should do this without using any external libraries, just by inspecting the bytes. The detection should be as fast as possible.

The question is: how to detect that a given string was encoded in UTF-8 twice?

Update:

Original strings are in UTF-8. Here is the AS3 code that does the second encoding (unfortunately I don't have control on the client code, so I can't fix this):

private function toUTF8(s : String) : String {
       var byteArray : ByteArray = new ByteArray();
       byteArray.writeUTFBytes(s);
       byteArray.position = 0;

       var res : String = "";

       while(byteArray.bytesAvailable){
           res += String.fromCharCode(byteArray.readUnsignedByte());
       }

       return res;
}

myString = toUTF8(("" + myString).toLowerCase().substr(0, 64));

Note toLowerCase() call. Maybe this may help?

like image 200
Alexander Gladysh Avatar asked Feb 17 '11 17:02

Alexander Gladysh


1 Answers

In principle you can't, especially allowing for cat-garbage.

You don't say what the original character encoding of the data was before it was UTF-8 encoded once or twice. I'll assume CP1251, (or at least that CP1251 is one of the possibilities) because it's quite a tricky case.

Take a non-ASCII character. UTF-8 encode it. You get some bytes, and all those bytes are valid characters in CP1251 unless one of them happens to be 0x98, the only hole in CP1251.

So, if you convert those bytes from CP1251 to UTF-8, the result is exactly the same as if you'd correctly UTF-8 encoded a CP1251 string consisting of those Russian characters. There's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters.

If you have some control over the original data, you could put a BOM at the start of it. Then when it comes back to you, inspect the initial bytes to see whether you have a UTF-8 BOM, or the result of incorrectly double-encoding a BOM. But I guess you probably don't have that kind of control over the original text.

In practice you can guess - UTF-8 decode it and then:

(a) look at the character frequencies, character pair frequencies, numbers of non-printable characters. This might allow you to tentatively declare it nonsense, and hence possibly double-encoded. With enough non-printable characters it may be so nonsensical that you couldn't realistically type it even by mashing at the keyboard, unless maybe your ALT key was stuck.

(b) attempt the second decode. That is, starting from the Unicode code points that you got by decoding your UTF-8 data, first encode it to CP1251 (or whatever) and then decode the result from UTF-8. If either step fails (due to invalid sequences of bytes), then it definitely wasn't double-encoded, at least not using CP1251 as the faulty interpretation.

This is more or less what you do if you have some bytes that might be UTF-8 or might be CP1251, and you don't know which.

You'll get some false positives for single-encoded cat-garbage indistinguishable from double-encoded data, and maybe a very few false negatives for data that was double-encoded but that after the first encode by fluke still looked like Russian.

If your original encoding has more holes in it than CP1251 then you'll have fewer false negatives.

Character encodings are hard.

like image 173
Steve Jessop Avatar answered Oct 12 '22 16:10

Steve Jessop