Javascript Regex + Unicode Diacritic Combining Characters`

Tags:

I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that:

'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.

I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/

Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic combining characters not work this easily? Thank you

251

asked Jun 28 '13 05:06

user2530580

2 Answers

Normally the solution would be to use Unicode properties and/or scripts, but JavaScript does not support them natively.

But there exists the lib XRegExp that adds this support. With this lib you can use

\p{L}: to match any kind of letter from any language.

\p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

So your character class would look like this:

[\p{L}\p{M}]+

that would match all possible letters that are in the Unicode table.

If you want to limit it, you can have a look at Unicode scripts and replace \p{L} by a script, they collect all letters from certain languages. e.g. \p{Latin} for all Latin letters or \p{Cyrillic} for all Cyrillic letters.

153

answered Sep 21 '22 13:09

stema

Usually this is made by combining an 'é' with a '\u0323' under dot diacritic

However, that isn't what you have here:

'ẹ́'

that's not U+0065,U+0323 but U+1EB9,U+0301 - combining an ẹ with an acute diacritic.

The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the comparison.

I don't just want to match e. I want to match all combinations

Matching without diacriticals is typically done by normalising to Normal Form D and removing all the combining diacritical characters.

Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a combining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.

Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.

answered Sep 18 '22 13:09

bobince

Related questions
                            
                                Handle I/O requests in amazon ec2 instances
                            
                                JavaScript not working on Internet Explorer 10
                            
                                DHTMLX and Angular.js integration
                            
                                Export FASTREPORT print as PDF
                            
                                How to set marker in google map using javascript
                            
                                Efficient way of filtering arrays in javascript
                            
                                Javascript: Parsing a txt file, passing the data to an array
                            
                                crypto.createCipheriv -> cipher.update + cipher.final does not return a Buffer?
                            
                                Preventing javascript:void(0) links from showing link address on hover
                            
                                Using Javascript To Add A Table Row Into Middle Of Table
                            
                                How is this javascript function called, and what pattern is it using?
                            
                                Compiling trivial python program to javascript using cython and emscripten on mac
                            
                                jQuery Calculating Width and Height After All Active CSS Transitions Complete
                            
                                Javascript variable with leading zeroes
                            
                                Jquery detect user interaction
                            
                                Javascript functions run automatically vs only when called [duplicate]
                            
                                Javascript dedicated web worker to send messages on demand
                            
                                Map() is not defined in Google Chrome
                            
                                Is this jQuery related, and what does this mean? [duplicate]
                            
                                Changing the behaviour of the typeof operator in Javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Javascript Regex + Unicode Diacritic Combining Characters`

Tags:

javascript

regex

unicode

diacritics

user2530580

People also ask

2 Answers

stema

bobince

Recent Activity

Donate For Us