Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JavaScript strings outside of the BMP

BMP being Basic Multilingual Plane

According to JavaScript: the Good Parts:

JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide.

This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF.

Further investigation confirms this:

> String.fromCharCode(0x20001); 

The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U+20001 (CJK unified ideograph 20001) instead returns U+0001.

Question: is it at all possible to handle post-BMP characters in JavaScript?


2011-07-31: slide twelve from Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly covers issues related to this quite well:

like image 203
Delan Azabani Avatar asked Sep 19 '10 06:09

Delan Azabani


People also ask

What surrounds a string in JavaScript?

We call these text values strings in JavaScript; think of them as a string of letters. To create a string, we surround text in quotation marks: "Hello World!"

Can we use \n in JavaScript?

The newline character is \n in JavaScript and many other languages. All you need to do is add \n character whenever you require a line break to add a new line to a string.

Are there strings in JavaScript?

In JavaScript, there are three ways to write a string — they can be written inside single quotes ( ' ' ), double quotes ( " " ), or backticks ( ` ` ). The type of quote used must match on both sides, however it is possible that all three styles can be used throughout the same script.


1 Answers

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.

But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, slice etc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.

If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:

String.prototype.getCodePointLength= function() {     return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1; };  String.fromCodePoint= function() {     var chars= Array.prototype.slice.call(arguments);     for (var i= chars.length; i-->0;) {         var n = chars[i]-0x10000;         if (n>=0)             chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));     }     return String.fromCharCode.apply(null, chars); }; 
like image 110
bobince Avatar answered Oct 03 '22 23:10

bobince