Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find decomposed unicode characters to be replaced by precomposed equivalents

Targeting ECMAScript 2019 where we have quite decent unicode support. I'm creating a simple text viewer where I want to inform the user that a certain decomposed character could be converted into a precomposed one. How can I find these characters?

An example. Consider the letter below. First we have the decomposed version, then the precomposed.

  • - 0061 0308 in UTF-16
  • ä - 00E4 in UTF-16

Now, mixing these seemingly identical characters implies some problems. When the user searches for "ä" not all expected occurrences will be found, as this regex demonstrates:

3 matches?!

Here we got three matches. Confusing! Similarly, searching this text for "ä" would only give one match.

The question. To help the user understand what's going on, I want to highlight any decomposed characters that have a suitable precomposed version. Thus I need to find the start and end of these character groups.

How can I accomplish this?

like image 562
l33t Avatar asked Sep 07 '25 04:09

l33t


1 Answers

You can use a letter + a diacritic mark pattern with the ECMAScript 2018+ compliant RegExp:

const re = /\p{Alphabetic}\p{M}+/ug;
const matches = "ständig".matchAll(re); // With decomposed/multibyte char
console.log([...matches].map(x=>[x.index, x.index+x[0].length]))
// => [ [2,4] ]

Here,

  • \p{Alphabetic} - matches any letter
  • \p{M}+ - any one or more diacritic marks.
like image 52
Wiktor Stribiżew Avatar answered Sep 10 '25 09:09

Wiktor Stribiżew