I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form.
My question is quite simply "is this a bad idea"?
A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.
Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form.
Am I missing something. Is there a hidden danger I haven't considered?
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD). If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values. Save this answer.
UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8. UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
Yes, this is a bad idea.
Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters.
The canonical example that caused many problems: '\xC0\xBCscript>'
. ‘Fix’ the overlong sequence to plain ASCII <
and you have accidentally created a security hole.
No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.
I don't think this is a bad idea from a security or usability perspective.
From security perspective you should be sanitizing user input before use. So you can run your clean up routines, and then make sure the data doesn't contain greater-than/less-than symbols <>
before it is printed out. You should also make sure you call mysql_real_escape_string() before inserting it into the database. Keep in mind that language encoding issues such as GBK vs Latin1 can lead to sql injection when you aren't using mysql_real_escape_string(). (This function name should be pretty similar regardless of your platform specific mysql library bindings)
Sanitizing all user input is generally a terrible idea because you don't know how the specific variable will be used. For instance sql injection and xss have very different control characters involved and the same sensitization for both often leads to vulnerabilities.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With