Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the nth (Unicode) character from a string in JavaScript

Suppose we have a string with some (astral) Unicode characters:

const s = 'Hi πŸ‘‹ Unicode!'

The [] operator and .charAt() method don't work for getting the 4th character, which should be "πŸ‘‹":

> s[3]
'οΏ½'
> s.charAt(3)
'οΏ½'

The .codePointAt() does get the correct value for the 4th character, but unfortunately it's a number and has to be converted back to a string using String.fromCodePoint():

> String.fromCodePoint(s.codePointAt(3))
'πŸ‘‹'

Similarly, converting the string into an array using splats yields valid Unicode characters, so that's another way of getting the 4th one:

> [...s][3]
'πŸ‘‹'

But i can't believe that going from string to number back to string, or having to split the string into an array are the only ways of doing this seemingly trivial thing. Isn't there a simple method for doing this?

> s.simpleMethod(3)
'πŸ‘‹'

Note: i know that the definition of "character" is somewhat fuzzy, but for the purpose of this question a character is simply the symbol that corresponds to a Unicode codepoint (no combining characters, no grapheme clusters, etc).

Update: the String.fromCodePoint(str.codePointAt(n)) method is not really viable, since the nth position there doesn't take previous astral symbols into account: String.fromCodePoint('πŸ‘‹πŸ™ˆ'.codePointAt(1)) // => 'οΏ½'


(I feel kinda dumb asking this; like i'm probably missing something obvious. But previous answers to this questions don't work on strings with Unicode simbols on astral planes.)

like image 877
epidemian Avatar asked Sep 11 '17 14:09

epidemian


People also ask

How do you find the Nth character of a string?

Use the charAt() method to get the nth character in a string, e.g. str. charAt(1) gets the second character in the string. The only parameter the charAt method takes is the index of the character to be returned. If the index does not exist in the string, an empty string is returned.

Can we use \n in JavaScript?

The newline character is \n in JavaScript and many other languages. All you need to do is add \n character whenever you require a line break to add a new line to a string.

What is the Unicode for N?

Unicode Character β€œN” (U+004E)

Can JavaScript read Unicode?

Unicode in Javascript source codeIn Javascript, the identifiers and string literals can be expressed in Unicode via a Unicode escape sequence. The general syntax is \uXXXX , where X denotes four hexadecimal digits. For example, the letter o is denoted as '\u006F' in Unicode.


1 Answers

The string iterator is the only thing that iterates through code points rather than UCS-2/UTF-16 code units. So:

const string = 'Hi πŸ‘‹ Unicode!';
for (const symbol of string) {
  console.log(symbol);
}

So to get a specific code point based on its index from a string:

const string = 'Hi πŸ‘‹ Unicode!';
// Note: The spread operator uses the string iterator under the hood.
const symbols = [...string]; 
symbols[3]; // 'πŸ‘‹'

Still, this would break with grapheme clusters, or emoji sequences such as πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ (πŸ‘¨ + U+200D ZERO WIDTH JOINER + πŸ‘© + U+200D ZERO WIDTH JOINER + πŸ‘§ + U+200D ZERO WIDTH JOINER + πŸ‘¦). Text segmentation helps with that.

Do you actually need to get the 4th code point in the string, though? What’s your use case?

like image 68
Mathias Bynens Avatar answered Sep 20 '22 17:09

Mathias Bynens