Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).

To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.

I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).

Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.

For the purposes of this question I do not require splitting by grapheme cluster.

like image 709
hippietrail Avatar asked Jan 28 '14 05:01

hippietrail


People also ask

How to divide a string into substrings in JavaScript?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

How do I convert a string to an array in JavaScript?

The string in JavaScript can be converted into a character array by using the split() and Array. from() functions.

What does the split method do in JavaScript?

The split() method takes a pattern and divides a String into an ordered list of substrings by searching for the pattern, puts these substrings into an array, and returns the array.

What is surrogate pairs JavaScript?

Surrogate pair is a representation for a single abstract character that consists of a sequence of code units of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit. An astral code point requires two code units ā€” a surrogate pair.


2 Answers

@bobince's answer has (luckily) become a bit dated; you can now simply use

var chars = Array.from( text )

to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

like image 163
John Frazer Avatar answered Oct 16 '22 06:10

John Frazer


Along the lines of @John Frazer's answer, one can use this even succincter form of string iteration:

const chars = [...text]

e.g., with:

const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "š‘Ø", "B", "š‘©", "C", "š‘Ŗ"]
like image 21
Brett Zamir Avatar answered Oct 16 '22 06:10

Brett Zamir