I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that:
'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.
I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/
Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic combining characters not work this easily? Thank you
To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0.
As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.
\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.
Flag u enables the support of Unicode in regular expressions. That means two things: Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters. Unicode properties can be used in the search: \p{…} .
Normally the solution would be to use Unicode properties and/or scripts, but JavaScript does not support them natively.
But there exists the lib XRegExp that adds this support. With this lib you can use
\p{L}
: to match any kind of letter from any language.
\p{M}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
So your character class would look like this:
[\p{L}\p{M}]+
that would match all possible letters that are in the Unicode table.
If you want to limit it, you can have a look at Unicode scripts and replace \p{L}
by a script, they collect all letters from certain languages. e.g. \p{Latin}
for all Latin letters or \p{Cyrillic}
for all Cyrillic letters.
Usually this is made by combining an 'é' with a '\u0323' under dot diacritic
However, that isn't what you have here:
'ẹ́'
that's not U+0065,U+0323 but U+1EB9,U+0301 - combining an ẹ
with an acute diacritic.
The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the comparison.
I don't just want to match e. I want to match all combinations
Matching without diacriticals is typically done by normalising to Normal Form D and removing all the combining diacritical characters.
Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a combining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.
Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With