I'm seeing some weird behavior when I'm setting the title of an HTML page using JavaScript. If I insert html character references directly into the title the Unicode renders correctly, for instance:
<title>吧出</title>
But if I attempt to use html characters references via JavaScript, something seems to be converting the & to (& amp ;) (separating them so SO doesn't just turn it back into ampersand) and thus breaking the encoding, causing it to be rendered as the full coded string:
function execTitleChange() {
document.title = "吧出";
}
(I should note that this is a little bit of speculation; when I introspect the DOM using Firebug after executing this JavaScript function, that's where I see the & instead of &.)
If I use \u encoded Unicode characters when setting the value from JavaScript then everything works correctly again:
function execTitleChange() {
document.title = "\u5427\u51fa";
}
The fact that \u encoded characters work kind of makes sense to me since I think that's how JavaScript represents Unicode characters but I'm stumped as to why the behavior would be different when using the html character references.
You can enter any Unicode character in an HTML file by taking its decimal numeric character reference and adding an ampersand and a hash at the front and a semi-colon at the end, for example — should display as an em dash (—).
Unicode in Javascript source codeIn Javascript, the identifiers and string literals can be expressed in Unicode via a Unicode escape sequence. The general syntax is \uXXXX , where X denotes four hexadecimal digits. For example, the letter o is denoted as '\u006F' in Unicode.
The Unicode Standard has become a success and is implemented in HTML, XML, Java, JavaScript, E-mail, ASP, PHP, etc. The Unicode standard is also supported in many operating systems and all modern browsers.
An HTML document is a sequence of Unicode characters.
JavaScript string constants are parsed by the JavaScript parser. Text inside HTML tags is parsed by the HTML parser. The two languages (and, by extension, their parsers) are different, and in particular they have different ways of representing characters by character code.
Thus, what you've discovered is the way reality actually is :-) Use the \u
escape notation in JavaScript, and use HTML entities (&#nnnn;
) in HTML/XML.
edit — now the situation can get even more confusing when you're talking about creating/inserting HTML from JavaScript. When you use .innerHTML
to update the DOM from JavaScript, then you are basically handing over HTML source code to the HTML parser for interpretation. For that reason, you can use either JavaScript \u
escapes or HTML entities, and things will work (excepting painful issues of character encoding mismatches etc).
Finally, note that JavaScript also provides the String.fromCharCode()
function to construct strings from numeric character codes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With