Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-16 to UTF-8 conversion in JavaScript

I have Base64 encoded data that is in UTF-16 I am trying to decode the data but most libraries only support UTF-8. I believe I have to drop the null bites but I am unsure how.

Currently I am using David Chambbers Polyfill for Base64, but I have also tried other libraries such as phpjs.org, none of which support UTF-16.

One thing to point out is on Chrome the atob method works with out problem, Firefox I get results described here, and in IE I am only returned the first character.

Any help is greatly appreciated

like image 551
Don P Avatar asked Jan 29 '13 21:01

Don P


People also ask

Does JavaScript use UTF-8 or UTF-16?

Most JavaScript engines use UTF-16 encoding, so let's detail into UTF-16. UTF-16 (the long name: 16-bit Unicode Transformation Format) is a variable-length encoding: Code points from BMP are encoded using a single code unit of 16-bit. Code points from astral planes are encoded using two code units of 16-bit each.

Why does JavaScript use UTF-16?

JS does require UTF-16, because the surrogate pairs of non-BMP characters are separable in JS strings. Any JS implementation using UTF-8 would have to convert to UTF-16 for proper answers to . length and array indexing on strings. Still doesn't mean that it has to store the strings in UTF-16.

What is utf8 in JavaScript?

UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for e-mail and web pages. UTF-16. 16-bit Unicode Transformation Format is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire.

Are JavaScript string utf8?

While a JavaScript source file can have any kind of encoding, JavaScript will then convert it internally to UTF-16 before executing it. JavaScript strings are all UTF-16 sequences, as the ECMAScript standard says: When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.


1 Answers

You want to decode UTF-16, not convert to UTF-8. Decoding means that the result is a string of abstract characters. Of course there is an internal encoding for strings as well, UTF-16 or UCS-2 in javascript, but that's an implementation detail.

With strings the goal is that you don't have to worry about encodings but just about manipulating characters "as they are". So you can write string methods that don't need to decode input at all. Of course there are many edge cases where this falls apart.

You cannot decode utf-16 just by removing nulls. I mean this will work fine for the first 256 code points of unicode, but you will get garbage when any of the other ~110000 characters in unicode are used. You cannot even get the most popular non-ASCII characters like em dash or any smart quotes working.

Also, looking at your example, it looks like UTF-16LE.

//Braindead decoder that assumes fully valid input
function decodeUTF16LE( binaryStr ) {
    var cp = [];
    for( var i = 0; i < binaryStr.length; i+=2) {
        cp.push( 
             binaryStr.charCodeAt(i) |
            ( binaryStr.charCodeAt(i+1) << 8 )
        );
    }

    return String.fromCharCode.apply( String, cp );
}

var base64decode = atob; //In chrome and firefox, atob is a native method available for base64 decoding

var base64 = "VABlAHMAdABpAG4AZwA";
var binaryStr = base64decode(base64);
var result = decodeUTF16LE(binaryStr);

Now you can even get smart quotes working:

var base64 = "HCBoAGUAbABsAG8AHSA="
var binaryStr = base64decode(base64);
var result = decodeUTF16LE(binaryStr);
//"“hello”"
like image 89
Esailija Avatar answered Sep 22 '22 10:09

Esailija