Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What characters are grouped with Array.from?

I've been playing around with JS and can't figure out how JS decides which elements to add to the created array when using Array.from(). For example, the following emoji 👍 has a length of 2, as it is made of two code points, but, Array.from() treats these two code points as one, giving an array with one element:

const emoji = '👍';
console.log(Array.from(emoji)); // Output: ["👍"]

However, some other characters also have two code points such as this character षि (also has a .length of 2). However, Array.from doesn't "group" this character and instead produces two elements:

const str = 'षि';
console.log(Array.from(str)); // Output: ["ष", "ि"]

My question is: What determines whether the character is broken up (like in example two) or treated as one single element (like in example one) when the character consists of two code points?

like image 311
Shnick Avatar asked Feb 04 '20 08:02

Shnick


3 Answers

Array.from first tries to invoke the iterator of the argument if it has one, and strings do have iterators, so it invokes String.prototype[Symbol.iterator], so let's look up how the prototype method works. It's described in the specification here:

  1. Let O be ? RequireObjectCoercible(this value).
  2. Let S be ? ToString(O).
  3. Return CreateStringIterator(S).

Looking up CreateStringIterator eventually takes you to 21.1.5.2.1 %StringIteratorPrototype%.next ( ), which does:

  1. Let cp be ! CodePointAt(s, position).
  2. Let resultString be the String value containing cp.[[CodeUnitCount]] consecutive code units from s beginning with the code unit at index position.
  3. Set O.[[StringNextIndex]] to position + cp.[[CodeUnitCount]].
  4. Return CreateIterResultObject(resultString, false).

The CodeUnitCount is what you're interested in. This number comes from CodePointAt :

  1. Let first be the code unit at index position within string.
  2. Let cp be the code point whose numeric value is that of first.
  3. If first is not a leading surrogate or trailing surrogate, then

    a. Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: false }.

  4. If first is a trailing surrogate or position + 1 = size, then

    a.Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }.

  5. Let second be the code unit at index position + 1 within string.

  6. If second is not a trailing surrogate, then

    a. Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }.

  7. Set cp to ! UTF16DecodeSurrogatePair(first, second).

  8. Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 2, [[IsUnpairedSurrogate]]: false }.

So, when iterating over a string with Array.from, it returns a CodeUnitCount of 2 only when the character in question is the start of a surrogate pair. Characters that are interpreted as surrogate pairs are described here:

Such operations apply special treatment to every code unit with a numeric value in the inclusive range 0xD800 to 0xDBFF (defined by the Unicode Standard as a leading surrogate, or more formally as a high-surrogate code unit) and every code unit with a numeric value in the inclusive range 0xDC00 to 0xDFFF (defined as a trailing surrogate, or more formally as a low-surrogate code unit) using the following rules..:

षि is not a surrogate pair:

console.log('षि'.charCodeAt()); // First character code: 2359, or 0x937
console.log('षि'.charCodeAt(1)); // Second character code: 2367, or 0x93F

But 👍's characters are:

console.log('👍'.charCodeAt()); // 55357, or 0xD83D
console.log('👍'.charCodeAt(1)); // 56397, or 0xDC4D

The first character code of '👍' is, in hex, D83D, which is within the range of 0xD800 to 0xDBFF of leading surrogates. In contrast, the first character code of 'षि' is much lower, and is not. So the 'षि' gets split apart, but '👍' doesn't.

षि is composed of two separate characters: , Devanagari Letter Ssa, and ि, Devanagari Vowel Sign I. When next to each other in this order, they get graphically combined into a single character visually, despite being composed of two separate characters.

In contrast, the character codes of 👍 only make sense when together as a single glyph. If you try to use a string with either code point without the other, you'll get a nonsense symbol:

console.log('👍'[0]);
console.log('👍'[1]);
like image 120
CertainPerformance Avatar answered Nov 10 '22 19:11

CertainPerformance


UTF-16 (the encoding used for strings in js) uses 16bit units. So every unicode that can be represented using 15 bit is represented as one code point, everything else as two, known as surrogate pairs. The iterator of strings iterates over code points.

UTF-16 on Wikipedia

like image 27
Jonas Wilms Avatar answered Nov 10 '22 20:11

Jonas Wilms


It's all about the code behind the characters. Some are coded in two bytes (UTF-16) and are interpreted by Array.from as two characters. Gotta check the list of the characters :

http://www.fileformat.info/info/charset/UTF-8/list.htm

http://www.fileformat.info/info/charset/UTF-16/list.htm

function displayHexUnicode(s) {
  console.log(s.split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),""));
}

displayHexUnicode('षि');

console.log(Array.from('षि').forEach(x => displayHexUnicode(x)));

function displayHexUnicode(s) {
  console.log(s.split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),""));
}

displayHexUnicode('👍');

console.log(Array.from('👍').forEach(x => displayHexUnicode(x)));

For the function that displays the hex code :

Javascript: Unicode string to hex

like image 8
Orelsanpls Avatar answered Nov 10 '22 19:11

Orelsanpls