This code breaks with nodejs v0.10.21
#!/usr/bin/env node
"use strict";
var urlEncoded = 'http://zh.wikipedia.org/wiki/%F0%A8%A8%8F';
var urlDecoded = decodeURI( urlEncoded );
var urlLeafEncoded = urlEncoded.substr( 29 );
var urlLeafDecoded = decodeURIComponent( urlLeafEncoded );
var urlLeafFirstCharacterDecoded = urlLeafDecoded.charAt( 0 );
var urlLeafFirstCharacterEncoded = encodeURIComponent( urlLeafFirstCharacterDecoded );
console.log( 'URL encoded = ' + urlEncoded );
console.log( 'URL decoded = ' + urlDecoded );
console.log( 'URL leaf encoded = ' + urlLeafEncoded );
console.log( 'URL leaf decoded = ' + urlLeafDecoded );
console.log( 'URL leaf first character encoded = ' + urlLeafEncoded );
console.log( 'URL leaf first character decoded = ' + urlLeafDecoded );
I get the following error
var urlLeafFirstCharacterEncoded = encodeURIComponent( urlLeafFirstCharacterDe
^
URIError: URI malformed
at encodeURIComponent (native)
at Object.<anonymous> (/media/data/tmp/mwoffliner/test.js:9:36)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)
at node.js:901:3
Javascript used to deal correctly with multibyte characters, but not in that case. It seems that although "%F0%A8%A8%8F" represents one Chinese character, javascript believes they are two of them. I'm puzzled if this is a bug in javascript runtime, somehow an encoding issue, or a misunderstood on my side.
𨨏
lies outside the BMP, and since Javascript only uses 2 bytes to store characters, is represented as a surrogate pair. While encodeURIComponent
can operate on surrogate pairs and produces the correct UTF8 encoding for them, it cannot read surrogates separately. Therefore, while encodeURIComponent("𨨏")
works fine, encodeURIComponent("𨨏".charAt(0))
will fail.
See http://mathiasbynens.be/notes/javascript-encoding for more details. Also, https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent documents this use case specifically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With