Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the default JavaScript character encoding?

While writing an encryption method in JavaScript, I came to wondering what character encoding my strings were using, and why.

What determines character encoding in JavaScript? Is it a standard? By the browser? Determined by the header of the HTTP request? In the <META> tag of HTML that encompasses it? The server that feeds the page?

By my empirical testing (changing different settings, then using charCodeAt on a sufficiently strange character and seeing which encoding the value matches up with) it appears to always be UTF-8 or UTF-16, but I'm not sure why.

After some frantic googling, I couldn't seem to find a conclusive answer to this simple question.

like image 419
Nick Avatar asked Jun 21 '12 15:06

Nick


People also ask

What is JavaScript character encoding?

Encoding is a way to convert one format of data into another. Character encoding is a way to convert a character that can be displayed on the screen into a binary representation so that it can be stored in memory or transferred over a network.

Is JavaScript UTF-8 or UTF-16?

Most JavaScript engines use UTF-16 encoding, so let's detail into UTF-16. UTF-16 (the long name: 16-bit Unicode Transformation Format) is a variable-length encoding: Code points from BMP are encoded using a single code unit of 16-bit. Code points from astral planes are encoded using two code units of 16-bit each.

Does JavaScript use UTF-8?

Encoding in Node is extremely confusing, and difficult to get right. It helps, though, when you realize that Javascript string types will always be encoded as UTF-16, and most of the other places strings in RAM interact with sockets, files, or byte arrays, the string gets re-encoded as UTF-8.

Is UTF-8 the default encoding?

Fortunately UTF-8 is the default per sé. When reading an XML document and writing it in another encoding, mostly this attribute will be patched too.


2 Answers

Section 8.4 of E262:

The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values (“elements”). The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a code unit value (see Clause 6). Each element is regarded as occupying a position within the sequence. These positions are indexed with nonnegative integers. The first element (if any) is at position 0, the next element (if any) at position 1, and so on. The length of a String is the number of elements (i.e., 16-bit values) within it. The empty String has length zero and therefore contains no elements.

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

That wording is kind of weaselly; it seems to mean that everything that counts treats strings as if each character is a UTF-16 character, but at the same time nothing ensures that it'll all be valid.

To be clear, the intention is that strings consist of UTF-16 code points. In ES2015, the definition of "string value" includes this note:

A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

So a string is still a string even when it contains values that don't work as correct Unicode characters.

like image 80
Pointy Avatar answered Oct 15 '22 11:10

Pointy


There is no default character encoding for JavaScript as such. A JavaScript program is, as far as specifications are concerned, a sequence of abstract characters. When transmitted over a network, or just stored in a computer, the abstract characters must be encoded somehow, but the mechanisms for it are not controlled by the ECMAScript standard.

Section 6 of the ECMAScript standard uses UTF-16 as a reference encoding, but does not designate it as default. Using UTF−16 as reference is logically unnecessary (it would suffice to refer to Unicode numbers) but it was probably assumed to help people.

This issue should not be confused with the interpretation of string literals or strings in general. A literal like 'Φ' needs to be in some encoding, along with the rest of the program; this can be any encoding, but after the encoding has been resolved, the literal will be interpreted as an integer according to the Unicode number of the character.

When a JavaScript program is transmitted as such (as an “external JavaScript file”) over the Internet, RFC 4329, Scripting Media Types, applies. Clause 4 defines the mechanism: Primarily, headers such as HTTP headers are checked, and a charset parameter there will be trusted on. (In practice, web servers usually don’t specify such a parameter for JavaScript programs.) Second, BOM detection is applied. Failing that, UTF-8 is implied.

The first part of the mechanism is somewhat ambiguous. It might be interpreted as relating to charset parameter in an actual HTTP header only, or might might be extended to charset parameters in script elements.

If a JavaScript program appears as embedded in HTML, either via a script element or some event attribute, then its character encoding is of course the same as that of the HTML document. Section Specifying the character encoding of the HTML 4.01 spec defines the resolution mechanism, in this order: charset in HTTP header, charset in meta, charset in a link that was followed to access the document, and finally heuristics (guesswork), which may involved many things; cf. to the complex resolution mechanism in the HTML5 draft.

like image 20
Jukka K. Korpela Avatar answered Oct 15 '22 11:10

Jukka K. Korpela