Suppose we have a string with some (astral) Unicode characters:
const s = 'Hi π Unicode!'
The []
operator and .charAt()
method don't work for getting the 4th character, which should be "π":
> s[3]
'οΏ½'
> s.charAt(3)
'οΏ½'
The .codePointAt()
does get the correct value for the 4th character, but unfortunately it's a number and has to be converted back to a string using String.fromCodePoint()
:
> String.fromCodePoint(s.codePointAt(3))
'π'
Similarly, converting the string into an array using splats yields valid Unicode characters, so that's another way of getting the 4th one:
> [...s][3]
'π'
But i can't believe that going from string to number back to string, or having to split the string into an array are the only ways of doing this seemingly trivial thing. Isn't there a simple method for doing this?
> s.simpleMethod(3)
'π'
Note: i know that the definition of "character" is somewhat fuzzy, but for the purpose of this question a character is simply the symbol that corresponds to a Unicode codepoint (no combining characters, no grapheme clusters, etc).
Update: the String.fromCodePoint(str.codePointAt(n))
method is not really viable, since the n
th position there doesn't take previous astral symbols into account: String.fromCodePoint('ππ'.codePointAt(1)) // => 'οΏ½'
(I feel kinda dumb asking this; like i'm probably missing something obvious. But previous answers to this questions don't work on strings with Unicode simbols on astral planes.)
Use the charAt() method to get the nth character in a string, e.g. str. charAt(1) gets the second character in the string. The only parameter the charAt method takes is the index of the character to be returned. If the index does not exist in the string, an empty string is returned.
The newline character is \n in JavaScript and many other languages. All you need to do is add \n character whenever you require a line break to add a new line to a string.
Unicode Character βNβ (U+004E)
Unicode in Javascript source codeIn Javascript, the identifiers and string literals can be expressed in Unicode via a Unicode escape sequence. The general syntax is \uXXXX , where X denotes four hexadecimal digits. For example, the letter o is denoted as '\u006F' in Unicode.
The string iterator is the only thing that iterates through code points rather than UCS-2/UTF-16 code units. So:
const string = 'Hi π Unicode!';
for (const symbol of string) {
console.log(symbol);
}
So to get a specific code point based on its index from a string:
const string = 'Hi π Unicode!';
// Note: The spread operator uses the string iterator under the hood.
const symbols = [...string];
symbols[3]; // 'π'
Still, this would break with grapheme clusters, or emoji sequences such as π¨βπ©βπ§βπ¦
(π¨ + U+200D ZERO WIDTH JOINER + π© + U+200D ZERO WIDTH JOINER + π§ + U+200D ZERO WIDTH JOINER + π¦). Text segmentation helps with that.
Do you actually need to get the 4th code point in the string, though? Whatβs your use case?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With