Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript charAt() breaking multibyte character string

This code breaks with nodejs v0.10.21

#!/usr/bin/env node
"use strict";

var urlEncoded = 'http://zh.wikipedia.org/wiki/%F0%A8%A8%8F';
var urlDecoded = decodeURI( urlEncoded );
var urlLeafEncoded = urlEncoded.substr( 29 );
var urlLeafDecoded = decodeURIComponent( urlLeafEncoded );
var urlLeafFirstCharacterDecoded = urlLeafDecoded.charAt( 0 );
var urlLeafFirstCharacterEncoded = encodeURIComponent( urlLeafFirstCharacterDecoded );

console.log( 'URL encoded = ' + urlEncoded );
console.log( 'URL decoded = ' + urlDecoded );
console.log( 'URL leaf encoded = ' + urlLeafEncoded );
console.log( 'URL leaf decoded = ' + urlLeafDecoded );
console.log( 'URL leaf first character encoded = ' + urlLeafEncoded );
console.log( 'URL leaf first character decoded = ' + urlLeafDecoded );

I get the following error

var urlLeafFirstCharacterEncoded = encodeURIComponent( urlLeafFirstCharacterDe
                               ^
URIError: URI malformed
    at encodeURIComponent (native)
    at Object.<anonymous> (/media/data/tmp/mwoffliner/test.js:9:36)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:901:3

Javascript used to deal correctly with multibyte characters, but not in that case. It seems that although "%F0%A8%A8%8F" represents one Chinese character, javascript believes they are two of them. I'm puzzled if this is a bug in javascript runtime, somehow an encoding issue, or a misunderstood on my side.

like image 269
user2949756 Avatar asked Oct 03 '22 12:10

user2949756


1 Answers

𨨏 lies outside the BMP, and since Javascript only uses 2 bytes to store characters, is represented as a surrogate pair. While encodeURIComponent can operate on surrogate pairs and produces the correct UTF8 encoding for them, it cannot read surrogates separately. Therefore, while encodeURIComponent("𨨏") works fine, encodeURIComponent("𨨏".charAt(0)) will fail.

See http://mathiasbynens.be/notes/javascript-encoding for more details. Also, https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent documents this use case specifically.

like image 62
georg Avatar answered Oct 13 '22 10:10

georg