I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form. My question is quite simply "is this a bad idea"? A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe. Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form. Am I missing something. Is there a hidden danger I haven't considered?

Yes, this is a bad idea. Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters. The canonical example that caused many problems: <code>'\xC0\xBCscript>'</code>. ‘Fix’ the overlong sequence to plain ASCII <code><</code> and you have accidentally created a security hole. No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.

Should I convert overlong UTF-8 strings to their shortest normal form?

Tags:

security

encoding

utf-8

perl

I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form.

My question is quite simply "is this a bad idea"?

A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.

Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form.

Am I missing something. Is there a hidden danger I haven't considered?

944

asked Apr 30 '10 10:04

Grant McLean

2 Answers

Yes, this is a bad idea.

Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters.

The canonical example that caused many problems: '\xC0\xBCscript>'. ‘Fix’ the overlong sequence to plain ASCII < and you have accidentally created a security hole.

No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.

answered Sep 20 '22 15:09

bobince

I don't think this is a bad idea from a security or usability perspective.

From security perspective you should be sanitizing user input before use. So you can run your clean up routines, and then make sure the data doesn't contain greater-than/less-than symbols <> before it is printed out. You should also make sure you call mysql_real_escape_string() before inserting it into the database. Keep in mind that language encoding issues such as GBK vs Latin1 can lead to sql injection when you aren't using mysql_real_escape_string(). (This function name should be pretty similar regardless of your platform specific mysql library bindings)

Sanitizing all user input is generally a terrible idea because you don't know how the specific variable will be used. For instance sql injection and xss have very different control characters involved and the same sensitization for both often leads to vulnerabilities.

answered Sep 19 '22 15:09

rook

Related questions
                            
                                Perl Rover v3 pass environment variable to in the Rulesets
                            
                                How do I perform decimal arithmetic in Perl?
                            
                                Passing a quoted string to system() keeping quotes intact
                            
                                Why is Devel::LeakTrace leaking memory?
                            
                                How do I set up Visual Studio 2008 to program in Perl?
                            
                                What is a good way to force Perl 5 to run out of memory quickly on OS X?
                            
                                Dependency injection for Moose classes
                            
                                Need an advice of framework for path on map validation
                            
                                How can I use Unicode characters in Perl POD-derived man pages?
                            
                                Write to a CSV file from a hash perl
                            
                                Are Parse::Yapp, Parse::Lex or Marpa::R2 still used?
                            
                                Expression for setting lowest n bits that works even when n equals word size
                            
                                Detecting Overridden Methods in Perl
                            
                                Is_prime function via regex in python (from perl)
                            
                                Scalar::Util looks_like_number returning number types
                            
                                How to use new syntax features in Mojolicious templates
                            
                                Elastic Search: use filter and should bool query
                            
                                How to find current package name from perl XS?
                            
                                Mojo::UserAgent and JavaScript
                            
                                Programmable transparent forward proxy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With