Find decomposed unicode characters to be replaced by precomposed equivalents

Question

Targeting ECMAScript 2019 where we have quite decent unicode support. I'm creating a simple text viewer where I want to inform the user that a certain decomposed character could be converted into a precomposed one. How can I find these characters?

An example. Consider the letter below. First we have the decomposed version, then the precomposed.

ä - 0061 0308 in UTF-16
ä - 00E4 in UTF-16

Now, mixing these seemingly identical characters implies some problems. When the user searches for "ä" not all expected occurrences will be found, as this regex demonstrates:

3 matches?!

Here we got three matches. Confusing! Similarly, searching this text for "ä" would only give one match.

The question. To help the user understand what's going on, I want to highlight any decomposed characters that have a suitable precomposed version. Thus I need to find the start and end of these character groups.

How can I accomplish this?

Wiktor Stribiżew · Accepted Answer

You can use a letter + a diacritic mark pattern with the ECMAScript 2018+ compliant RegExp:

const re = /\p{Alphabetic}\p{M}+/ug;
const matches = "ständig".matchAll(re); // With decomposed/multibyte char
console.log([...matches].map(x=>[x.index, x.index+x[0].length]))
// => [ [2,4] ]

Here,

\p{Alphabetic} - matches any letter
\p{M}+ - any one or more diacritic marks.

Find decomposed unicode characters to be replaced by precomposed equivalents

Tags:

javascript

regex

l33t

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

Find decomposed unicode characters to be replaced by precomposed equivalents

Tags:

javascript

regex

l33t

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us