According to its language specification JavaScript has some problems with Unicode (if I understand it correctly), as text is always handled as one character consisting of 16 bits internally. JavaScript: The Good Parts speaks out in a similar way. When you search Google for V8's support of UTF-8, you get contradictory statements. So: What is the state of Unicode support in Node.js (0.10.26 was the current version when this question was asked)? Does it handle UTF-8 will all possible codepoints correctly, or doesn't it? If not: What are possible workarounds?

The two sources you cite, the language specification and Crockford's “JavaScript: The Good Parts” (page 103) say the same thing, although the latter says it much more concisely (and clearly, if you already know the subject). For reference I'll cite Crockford: <blockquote> JavaScript was designed at a time when Unicode was expected to have at most 65,536 characters. It has since grown to have a capacity of more than 1 million characters. JavaScript's characters are 16 bits. That is enough to cover the original 65,536 (which is now known as the Basic Multilingual Plane). Each of the remaining million characters can be represented as a pair of characters. Unicode considers the pair to be a single character. JavaScript thinks the pair is two distinct characters. </blockquote> The language specification calls the 16-bit unit a “character” and a “code unit”. A “Unicode character”, or “code point”, on the other hand, can (in rare cases) need two 16-bit “code units” to be represented. All of JavaScript's string properties and methods, like <code>length</code>, <code>substr()</code>, etc., work with 16-bit “characters” (it would be very inefficient to work with 16-bit/32-bit Unicode characters, i.e., UTF-16 characters). E.g., this means that, if you are not careful, with <code>substr()</code> you can leave one half of a 32-bit UTF-16 Unicode character alone. JavaScript won't complain as long as you don't display it, and maybe won't even complain if you do. This is because, as the specification says, JavaScript does not check that the characters are valid UTF-16, it only assumes they are. In your question you ask <blockquote> Does [Node.js] handle UTF-8 will all possible codepoints correctly, or doesn't it? </blockquote> Since all possible UTF-8 codepoints are converted to UTF-16 (as one or two 16-bit “characters”) in input before anything else happens, and vice versa in output, the answer depends on what you mean by “correctly”, but if you accept JavaScript's interpretation of this “correctly”, the answer is “yes”.

How well is Node.js' support for Unicode?

1 Answers

The two sources you cite, the language specification and Crockford's “JavaScript: The Good Parts” (page 103) say the same thing, although the latter says it much more concisely (and clearly, if you already know the subject). For reference I'll cite Crockford:

JavaScript was designed at a time when Unicode was expected to have at most 65,536 characters. It has since grown to have a capacity of more than 1 million characters.

JavaScript's characters are 16 bits. That is enough to cover the original 65,536 (which is now known as the Basic Multilingual Plane). Each of the remaining million characters can be represented as a pair of characters. Unicode considers the pair to be a single character. JavaScript thinks the pair is two distinct characters.

The language specification calls the 16-bit unit a “character” and a “code unit”. A “Unicode character”, or “code point”, on the other hand, can (in rare cases) need two 16-bit “code units” to be represented.

All of JavaScript's string properties and methods, like length, substr(), etc., work with 16-bit “characters” (it would be very inefficient to work with 16-bit/32-bit Unicode characters, i.e., UTF-16 characters). E.g., this means that, if you are not careful, with substr() you can leave one half of a 32-bit UTF-16 Unicode character alone. JavaScript won't complain as long as you don't display it, and maybe won't even complain if you do. This is because, as the specification says, JavaScript does not check that the characters are valid UTF-16, it only assumes they are.

In your question you ask

Does [Node.js] handle UTF-8 will all possible codepoints correctly, or doesn't it?

Since all possible UTF-8 codepoints are converted to UTF-16 (as one or two 16-bit “characters”) in input before anything else happens, and vice versa in output, the answer depends on what you mean by “correctly”, but if you accept JavaScript's interpretation of this “correctly”, the answer is “yes”.

153

answered Oct 10 '22 05:10

Walter Tross

Related questions
                            
                                Node.js EPIPE error
                            
                                How to limit Javascript's window.find to a particular DIV?
                            
                                Accessing Objects inside Array inside JSON object using Javascript
                            
                                Analyze audio input from microphone Javascript
                            
                                Javascript: Detecting if text would wrap
                            
                                WebDriver: Change event not firing
                            
                                jQuery.ajax() - How to handle timeouts best?
                            
                                How does minification work and does it affect angular nested objects?
                            
                                Iframe enter fullscreen mode
                            
                                jquery assumes element id? normal behaviour? [duplicate]
                            
                                Measure CPU performance via JS
                            
                                Android WebView + AJAX local files
                            
                                Does valueOf always override toString in javascript?
                            
                                Send byte array in node.js to server
                            
                                Twitter Bootstrap affix events not firing
                            
                                What is the use of a synchronous callback function in JavaScript?
                            
                                How to store a password as securely in Chrome Extension?
                            
                                Explanation for Timespan Differences Between C# and JavaScript
                            
                                What is the difference between $window.load and window.onload?
                            
                                Detecting doubletap on app top in mobile Safari via JavaScript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How well is Node.js' support for Unicode?

Tags:

javascript

node.js

unicode

v8

Golo Roden

People also ask

1 Answers

Walter Tross

Recent Activity

Donate For Us