Targeting ECMAScript 2019 where we have quite decent unicode support. I'm creating a simple text viewer where I want to inform the user that a certain decomposed character could be converted into a precomposed one. How can I find these characters?
An example. Consider the letter below. First we have the decomposed version, then the precomposed.
ä - 0061 0308 in UTF-16ä - 00E4 in UTF-16Now, mixing these seemingly identical characters implies some problems. When the user searches for "ä" not all expected occurrences will be found, as this regex demonstrates:

Here we got three matches. Confusing! Similarly, searching this text for "ä" would only give one match.
The question. To help the user understand what's going on, I want to highlight any decomposed characters that have a suitable precomposed version. Thus I need to find the start and end of these character groups.
How can I accomplish this?
You can use a letter + a diacritic mark pattern with the ECMAScript 2018+ compliant RegExp:
const re = /\p{Alphabetic}\p{M}+/ug;
const matches = "ständig".matchAll(re); // With decomposed/multibyte char
console.log([...matches].map(x=>[x.index, x.index+x[0].length]))
// => [ [2,4] ]
Here,
\p{Alphabetic} - matches any letter\p{M}+ - any one or more diacritic marks.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With