Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a regular expression which matches a single grapheme cluster?

Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.

From Unicode® Standard Annex #29:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"\r\n".match(/*?*/)[0] === "\r\n"
"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"
like image 592
brainkim Avatar asked Nov 07 '18 21:11

brainkim


1 Answers

Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use \P{M}\p{M}+ or (?>\P{M}\p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>\P{M}\p{M}*)+ as a substitute for \X+.

\X is the closest, and does not exist in any version through ES6. \P{M}\p{M}+ approximates \X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(\P{Mark})(\p{Mark}+)/gu.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.

like image 162
bishop Avatar answered Sep 28 '22 09:09

bishop