Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I convert overlong UTF-8 strings to their shortest normal form?

I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form.

My question is quite simply "is this a bad idea"?

A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.

Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form.

Am I missing something. Is there a hidden danger I haven't considered?

like image 944
Grant McLean Avatar asked Apr 30 '10 10:04

Grant McLean


People also ask

What is difference between UTF-8 and UTF-16?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Is ASCII or UTF-8 more efficient?

There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD). If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values. Save this answer.

Should I use UTF-8 or UTF-16?

UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8. UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.

What number of UTF-8 characters in file are of which size in bytes?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.


2 Answers

Yes, this is a bad idea.

Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters.

The canonical example that caused many problems: '\xC0\xBCscript>'. ‘Fix’ the overlong sequence to plain ASCII < and you have accidentally created a security hole.

No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.

like image 62
bobince Avatar answered Sep 20 '22 15:09

bobince


I don't think this is a bad idea from a security or usability perspective.

From security perspective you should be sanitizing user input before use. So you can run your clean up routines, and then make sure the data doesn't contain greater-than/less-than symbols <> before it is printed out. You should also make sure you call mysql_real_escape_string() before inserting it into the database. Keep in mind that language encoding issues such as GBK vs Latin1 can lead to sql injection when you aren't using mysql_real_escape_string(). (This function name should be pretty similar regardless of your platform specific mysql library bindings)

Sanitizing all user input is generally a terrible idea because you don't know how the specific variable will be used. For instance sql injection and xss have very different control characters involved and the same sensitization for both often leads to vulnerabilities.

like image 26
rook Avatar answered Sep 19 '22 15:09

rook