I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?" I'm forcing a field in a UI to match the format: <code>last_name, first_name</code> (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms. This was my original version, until I wanted to add diacritic support: <code>/^[a-zA-Z]+,\s[a-zA-Z]+$/</code> Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are: <h3>Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):</h3> <hr> <pre class="prettyprint"><code>var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜ&Yuml;çÇßØøÅåÆæ&oelig;"; // Build the full regex var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$"; // Create a RegExp from the string version regexCompiled = new RegExp(regex); // regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜ&Yuml;çÇßØøÅåÆæ&oelig;]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜ&Yuml;çÇßØøÅåÆæ&oelig;]+$/ </code></pre> <ul> <li>This correctly matches a last/first name with any of the supported accented characters in <code>accentedCharacters</code>.</li> </ul> <hr> <h3>My other approach was to use the <code>.</code> character class, to have a simpler expression:</h3> <pre class="prettyprint"><code>var regex = /^.+,\s.+$/; </code></pre> <ul> <li>This would match for just about anything, at least in the form of: <code>something, something</code>. That's alright I suppose...</li> </ul> <hr> <h3>The last approach, which I just found might be simpler...</h3> <pre class="prettyprint"><code>/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/ </code></pre> <ul> <li>It matches a range of Unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.</li> </ul> <hr> Here are my concerns: <ol> <li> The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical. </li> <li> The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what <code>.</code> matches, just the generalization of "any character except the newline character" (from a table on the MDN). </li> <li> The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, <code>\u00C0-\u017F</code> seems to be pretty solid, at least for my expected input. </li> </ol> <ul> <li>Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don't have to worry about out-of-Latin-character-set characters</li> </ul> <hr> Which of these three approaches is most suited for the task? Or are there better solutions?

The accented Latin range <code>\u00C0-\u017F</code> was not quite enough for my database of names, so I extended the regex to <pre class="prettyprint"><code>[a-zA-Z\u00C0-\u024F] [a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars </code></pre> I added these code blocks (<code>\u00C0-\u024F</code> includes three adjacent blocks at once): <ul> <li> <code>\u00C0-\u00FF</code> Latin-1 Supplement </li> <li> <code>\u0100-\u017F</code> Latin Extended-A </li> <li> <code>\u0180-\u024F</code> Latin Extended-B </li> <li> <code>\u1E00-\u1EFF</code> Latin Extended Additional </li> </ul> Note that <code>\u00C0-\u00FF</code> is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × <code>\u00D7</code> and divide ÷ <code>\u00F7</code>. <pre class="prettyprint"><code>[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷ </code></pre> If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser. The original regex stopping at <code>\u017F</code> borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is <code>\u0218</code>, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S <code>\u015E</code>, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")

<blockquote> Which of these three approaches is most suited for the task? </blockquote> Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the <code>\S</code> character class. <blockquote> I'm forcing a field in a UI to match the format: <code>last_name, first_name</code> (last [comma space] first) </blockquote> The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name: <pre class="prettyprint"><code>/[^,]+,\s[^,]+/ </code></pre> But your second solution with the <code>.</code> character class is just as fine, you only might need to care about multiple commata then.

The XRegExp library has a plugin named Unicode that helps solve tasks like this. <pre class="prettyprint"><code><script src="xregexp.js"></script> <script src="addons/unicode/unicode-base.js"></script> <script> var unicodeWord = XRegExp("^\\p{L}+$"); unicodeWord.test("Русский"); // true unicodeWord.test("日本語"); // true unicodeWord.test("العربية"); // true </script> </code></pre>

You can use this: <pre class="prettyprint"><code>/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/ </code></pre>

<pre class="prettyprint lang-none prettyprint-override"><code>/^[\pL\pM\p{Zs}.-]+$/u </code></pre> Explanation: <ul> <li> <code>\pL</code> - matches any kind of letter from any language</li> <li> <code>\pM</code> - matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)</li> <li> <code>\p{Zs}</code> - matches a whitespace character that is invisible, but does take up space</li> <li> <code>u</code> - Pattern and subject strings are treated as UTF-8</li> </ul> Unlike other proposed regex (such as <code>[A-Za-zÀ-ÖØ-öø-ÿ]</code>), this will work with all language specific characters, e.g. <code>&Scaron;&scaron;</code> is matched by this rule, but not matched by others on this page. Unfortunately, natively JavaScript does not support these classes. However, you can use <code>xregexp</code>, e.g. <pre class="prettyprint lang-js prettyprint-override"><code>const XRegExp = require('xregexp'); const isInputRealHumanName = (input: string): boolean => { return XRegExp('^[\\pL\\pM-]+ [\\pL\\pM-]+$', 'u').test(input); }; </code></pre>

You can use this: <pre class="prettyprint"><code>^([a-zA-Z]|[à-ú]|[À-Ú])+$ </code></pre> It will match every word with accented characters or not.

You can remove the diacritics from alphabets by using: <pre class="prettyprint lang-js prettyprint-override"><code>var str = "résumé" str.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // returns resume </code></pre> It will remove all the diacritical marks, and then perform your regex on it. Reference: Searching and sorting text with diacritical marks in JavaScript

From Wikipedia: Basic Latin For Latin letters, I use <pre class="prettyprint"><code>/^[A-zÀ-ÖØ-öø-ÿ]+$/ </code></pre> It avoids hyphens and specials characters.

Concrete Javascript Regex for Accented Characters (Diacritics)

Tags:

javascript

regex

unicode

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?"

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.

This was my original version, until I wanted to add diacritic support:

/^[a-zA-Z]+,\s[a-zA-Z]+$/

Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/

This correctly matches a last/first name with any of the supported accented characters in accentedCharacters.

My other approach was to use the `.` character class, to have a simpler expression:

var regex = /^.+,\s.+$/;

This would match for just about anything, at least in the form of: something, something. That's alright I suppose...

The last approach, which I just found might be simpler...

/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/

It matches a range of Unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.

Here are my concerns:

The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what . matches, just the generalization of "any character except the newline character" (from a table on the MDN).
The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F seems to be pretty solid, at least for my expected input.

Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don't have to worry about out-of-Latin-character-set characters

Which of these three approaches is most suited for the task? Or are there better solutions?

355

asked Dec 19 '13 19:12

Chris Cirefice

9 Answers

The easier way to accept all accents is this:

[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷

See Unicode Character Table for characters listed in numeric order.

answered Oct 03 '22 05:10

Maycow Moura

The accented Latin range \u00C0-\u017F was not quite enough for my database of names, so I extended the regex to

[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars

I added these code blocks (\u00C0-\u024F includes three adjacent blocks at once):

\u00C0-\u00FF Latin-1 Supplement
\u0100-\u017F Latin Extended-A
\u0180-\u024F Latin Extended-B
\u1E00-\u1EFF Latin Extended Additional

Note that \u00C0-\u00FF is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7 and divide ÷ \u00F7.

[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷

If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.

The original regex stopping at \u017F borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")

answered Oct 03 '22 04:10

Chaim Leib Halbert

Which of these three approaches is most suited for the task?

Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)

The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:

/[^,]+,\s[^,]+/

But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.

answered Oct 03 '22 05:10

Bergi

The XRegExp library has a plugin named Unicode that helps solve tasks like this.

<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
  var unicodeWord = XRegExp("^\\p{L}+$");

  unicodeWord.test("Русский"); // true
  unicodeWord.test("日本語"); // true
  unicodeWord.test("العربية"); // true
</script>

answered Oct 03 '22 04:10

thorn0

You can use this:

/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/

answered Oct 03 '22 04:10

alchn

/^[\pL\pM\p{Zs}.-]+$/u

Explanation:

\pL - matches any kind of letter from any language
\pM - matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
\p{Zs} - matches a whitespace character that is invisible, but does take up space
u - Pattern and subject strings are treated as UTF-8

Unlike other proposed regex (such as [A-Za-zÀ-ÖØ-öø-ÿ]), this will work with all language specific characters, e.g. Šš is matched by this rule, but not matched by others on this page.

Unfortunately, natively JavaScript does not support these classes. However, you can use xregexp, e.g.

const XRegExp = require('xregexp');

const isInputRealHumanName = (input: string): boolean => {
  return XRegExp('^[\\pL\\pM-]+ [\\pL\\pM-]+$', 'u').test(input);
};

answered Oct 03 '22 05:10

Gajus

You can use this:

^([a-zA-Z]|[à-ú]|[À-Ú])+$

It will match every word with accented characters or not.

answered Oct 03 '22 03:10

Javier Pallarés

You can remove the diacritics from alphabets by using:

var str = "résumé"
str.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // returns resume

It will remove all the diacritical marks, and then perform your regex on it.

Reference:

Searching and sorting text with diacritical marks in JavaScript

answered Oct 03 '22 03:10

Fawaz Ahmed

From Wikipedia: Basic Latin

For Latin letters, I use

/^[A-zÀ-ÖØ-öø-ÿ]+$/

It avoids hyphens and specials characters.

answered Oct 03 '22 04:10

Phil

Related questions
                            
                                How to communicate between iframe and the parent site?
                            
                                How to detect the screen resolution with JavaScript?
                            
                                clear javascript console in Google Chrome
                            
                                How can I make setInterval also work when a tab is inactive in Chrome?
                            
                                Set attribute without value
                            
                                Chrome debugging - break on next click event
                            
                                Get a CSS value with JavaScript
                            
                                Default argument values in JavaScript functions [duplicate]
                            
                                Make function wait until element exists
                            
                                Regular Expression to get a string between parentheses in Javascript
                            
                                Adding two numbers concatenates them instead of calculating the sum
                            
                                When is JavaScript synchronous?
                            
                                angular ng-repeat in reverse
                            
                                How do I break a string across more than one line of code in JavaScript?
                            
                                Escape quotes in JavaScript
                            
                                checking for typeof error in JS
                            
                                parseInt(null, 24) === 23... wait, what?
                            
                                Validate that a string is a positive integer
                            
                                What does "export default" do in JSX?
                            
                                How to go to a URL using jQuery? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Concrete Javascript Regex for Accented Characters (Diacritics)

Tags:

javascript

regex

unicode

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

My other approach was to use the `.` character class, to have a simpler expression:

The last approach, which I just found might be simpler...

Chris Cirefice

People also ask

9 Answers

Maycow Moura

Chaim Leib Halbert

Bergi

thorn0

alchn

Gajus

Javier Pallarés

Fawaz Ahmed

Phil

Recent Activity

Donate For Us

Concrete Javascript Regex for Accented Characters (Diacritics)

Tags:

javascript

regex

unicode

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

My other approach was to use the . character class, to have a simpler expression:

The last approach, which I just found might be simpler...

Chris Cirefice

People also ask

9 Answers

Maycow Moura

Chaim Leib Halbert

Bergi

thorn0

alchn

Gajus

Javier Pallarés

Fawaz Ahmed

Phil

Related questions

Recent Activity

Donate For Us

My other approach was to use the `.` character class, to have a simpler expression: