Targeting ECMAScript 2019
where we have quite decent unicode support. I'm creating a simple text viewer where I want to inform the user that a certain decomposed character could be converted into a precomposed one. How can I find these characters?
An example. Consider the letter below. First we have the decomposed version, then the precomposed.
ä
- 0061 0308
in UTF-16ä
- 00E4
in UTF-16Now, mixing these seemingly identical characters implies some problems. When the user searches for "ä" not all expected occurrences will be found, as this regex demonstrates:
Here we got three matches. Confusing! Similarly, searching this text for "ä" would only give one match.
The question. To help the user understand what's going on, I want to highlight any decomposed characters that have a suitable precomposed version. Thus I need to find the start and end of these character groups.
How can I accomplish this?
You can use a letter + a diacritic mark pattern with the ECMAScript 2018+ compliant RegExp:
const re = /\p{Alphabetic}\p{M}+/ug;
const matches = "ständig".matchAll(re); // With decomposed/multibyte char
console.log([...matches].map(x=>[x.index, x.index+x[0].length]))
// => [ [2,4] ]
Here,
\p{Alphabetic}
- matches any letter\p{M}+
- any one or more diacritic marks.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With