List of Unicode characters that should be filtered in output?

Tags:

Recently I hit a bug due to data quality with browser support, and I am looking for a safe rule for applying string escape without double size unless required.

A UTF-8 byte sequence "E2-80-A8" (U+2028, LINE SEPARATOR), a perfectly valid character in a Unicode database. However, that sequence represents a line-separator (Yes, other then "0A").

And badly, many browser (including Chrome, Firefox, and Safari; I didn't test others), failed to process a JSONP callback which has a string that contains that Unicode character. The JSONP was included by a non-Unicode HTML which I did not have any control.

The browsers simply reported INVALID CODE/syntax error on such JavaScript which looks valid from debug tools and all text editors. What I guess is that it may try to convert "E2-80-A8" to BIG-5 and broke JS syntax.

The above is only an example of how Unicode can break your system unexpected. As far as I know, some hacker can use RTL and other control characters for their good. And there are many "quotes", "spaces", "symbols" and "controls" in Unicode specification.

QUESTION:

Is there a list of Unicode characters for every programmer to know about hidden features (and bugs) which we might not want them effective in our application. (e.g. Windows disable RTL in filename).

EDIT:

I am not asking for JSON nor JavaScript. I am asking for general best practice of Unicode handing in all programs.

838

asked May 11 '12 18:05

Dennis C

1 Answers

It breaks javascript because strings cannot have newlines in them:

var myString = "

";

//SyntaxError: Unexpected token ILLEGAL

Now, the UTF-8 sequence "E2-80-A8" decodes to unicode code point U+2028, which is treated similar to newline in javascript:

 var myString = " ";

//Syntax Error

It is however, safe to write

var myString = "\u2028";
//you can now log myString in console and get real representation of this character

which is what properly encoded JSON will have. I'd look into properly encoding JSON instead of keeping a blacklist of unsafe characters. (which are U+2028 and U+2029 AFAIK).

In PHP:

echo json_encode( chr(0xe2). chr(0x80).chr(0xA8 ) );
//"\u2028"

answered Sep 23 '22 02:09

Esailija

Related questions
                            
                                Is there an equivalent of vector::reserve() for an std::list?
                            
                                Is there any format convention in query strings?
                            
                                Automatic Migration vs Code-base Migration
                            
                                When should I NOT use App Engine's Full Text Search API?
                            
                                How to notify all (same) Singleton beans in a Glassfish 3.1 Cluster?
                            
                                Does this code subvert the C++ type system?
                            
                                gradle task build already exists issue
                            
                                Quartz Performance
                            
                                What is the correct way of adding CSS or JS libraries as dependencies with Composer in Symfony2?
                            
                                How can I attach a database to an app in Heroku?
                            
                                When __builtin_memcpy is replaced with libc's memcpy
                            
                                copypng failed with exit code 1

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With