I want to do some basic String testing in Node.js. Assume I have a form where users enter their name and I wanna check if it's just rubbish or a real name. Happily (or sadly for my check) I get users from all around the world which means that their names contain non-english characters, like <code>ä ö ü ß é</code>. I was used to use <code>/[A-Za-z -]{2,}/</code> but this doesn't match names like <code>"Jan Buschtöns"</code>. Do I have to manually add every possible non-english but latin character to my RegEx to work? I don't want a 100+ characters long RegEx like <code>/[A-Za-z -äöüÄÖÜßéÉèÈêÊ...]{2,}/</code>.

Check http://www.regular-expressions.info/unicode.html and http://xregexp.com/plugins/ You would need to use <code>\p{L}</code> to match any letter character if you want to include unicode. Speaking unicode, alternative of <code>\w</code> is <code>[\p{L}\p{N}_]</code> then.

Update: As of ES2018, JavaScript supports Unicode property escapes such as <code>\p{L}</code>, which matches anything that Unicode considers to be a letter. All modern browsers support this feature, so that's probably the way to go as long as you don't care about ancient browsers. Old answer for pre-ES2018 browsers: The answer depends on exactly what you want to do. As you have noticed, <code>[A-Za-z]</code> only matches Latin letters without diacritics. If you only care about German diacritics and the ß ligature, then you can just replace that part with <code>[A-Za-zÄÖÜäöüß]</code>, e.g.: <pre class="prettyprint"><code>/[A-Za-zÄÖÜäöüß -]{2,}/ </code></pre> But that probably isn’t what you want to do. You probably want to match Latin letters with any diacritics, not just those used in German. Or perhaps you want to match any letters from any alphabet, not just Latin. Other regular expression dialects have character classes to help you with problems like this, but unfortunately JavaScript’s regular expression dialect has very few character classes and none of them help you here. (In case you don’t know, a “character class” is an expression that matches any character that is a member of a predefined group of characters. For example, <code>\w</code> is a character class that matches any ASCII letter, or digit, or an underscore, and <code>.</code> is a character class that matches any character.) This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match. A quick and dirty solution might be to say <code>[a-zA-Z\u0080-\uFFFF]</code>, or in full: <pre class="prettyprint"><code>/[A-Za-z\\u0080-\\uFFFF -]{2,}/ </code></pre> This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range. This includes all possible alphabetic characters with or without diacritics in any script. However, it also includes a lot of characters that are not letters. Non-letters in the ASCII range are excluded, but non-letters outside the ASCII range are included. The above might be good enough for your purposes, but if it isn’t then you will have to figure out which character ranges you need and specify those explicitly.

RegEx with extended latin alphabet (ä ö ü è ß)

Tags:

javascript

regex

node.js

utf-8

I want to do some basic String testing in Node.js. Assume I have a form where users enter their name and I wanna check if it's just rubbish or a real name.

Happily (or sadly for my check) I get users from all around the world which means that their names contain non-english characters, like ä ö ü ß é. I was used to use /[A-Za-z -]{2,}/ but this doesn't match names like "Jan Buschtöns".

Do I have to manually add every possible non-english but latin character to my RegEx to work? I don't want a 100+ characters long RegEx like /[A-Za-z -äöüÄÖÜßéÉèÈêÊ...]{2,}/.

971

asked Jul 28 '12 19:07

buschtoens

2 Answers

Check http://www.regular-expressions.info/unicode.html and http://xregexp.com/plugins/

You would need to use \p{L} to match any letter character if you want to include unicode.

Speaking unicode, alternative of \w is [\p{L}\p{N}_] then.

176

answered Sep 20 '22 04:09

Ωmega

Update: As of ES2018, JavaScript supports Unicode property escapes such as \p{L}, which matches anything that Unicode considers to be a letter. All modern browsers support this feature, so that's probably the way to go as long as you don't care about ancient browsers.

Old answer for pre-ES2018 browsers:

The answer depends on exactly what you want to do.

As you have noticed, [A-Za-z] only matches Latin letters without diacritics.

If you only care about German diacritics and the ß ligature, then you can just replace that part with [A-Za-zÄÖÜäöüß], e.g.:

/[A-Za-zÄÖÜäöüß -]{2,}/

But that probably isn’t what you want to do. You probably want to match Latin letters with any diacritics, not just those used in German. Or perhaps you want to match any letters from any alphabet, not just Latin.

Other regular expression dialects have character classes to help you with problems like this, but unfortunately JavaScript’s regular expression dialect has very few character classes and none of them help you here.

(In case you don’t know, a “character class” is an expression that matches any character that is a member of a predefined group of characters. For example, \w is a character class that matches any ASCII letter, or digit, or an underscore, and . is a character class that matches any character.)

This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match.

A quick and dirty solution might be to say [a-zA-Z\u0080-\uFFFF], or in full:

/[A-Za-z\\u0080-\\uFFFF -]{2,}/

This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range. This includes all possible alphabetic characters with or without diacritics in any script. However, it also includes a lot of characters that are not letters. Non-letters in the ASCII range are excluded, but non-letters outside the ASCII range are included.

The above might be good enough for your purposes, but if it isn’t then you will have to figure out which character ranges you need and specify those explicitly.

answered Sep 23 '22 04:09

Daniel Cassidy

Related questions
                            
                                React-Native fetch XML data
                            
                                Baking transforms into SVG Path Element commands
                            
                                PolyFill/Shim for CSS transitions and animations
                            
                                Backbone change model of View
                            
                                Any fallback client-side solutions for the html5 download attribute?
                            
                                Current State of Javascript WebRTC Libraries? [closed]
                            
                                Cannot assign to read only property 'props' of #<Object> in react native
                            
                                Import ReactJS component from another file?
                            
                                Unobtrusive Javascript rich text editor? [closed]
                            
                                What's the difference between "DOMContent event" and "load event"
                            
                                Closures vs. classes for encapsulation?
                            
                                Javascript variable access in HTML
                            
                                Performance using JS querySelector [closed]
                            
                                Are Up, Down, Left, Right Arrow KeyCodes always the same?
                            
                                Failed to construct Notification: Illegal constructor
                            
                                Setting a length (height or width) for one element minus the variable length of another, i.e. calc(x - y), where y is unknown
                            
                                What is the difference between build and dist folder?
                            
                                Displaying pdf from arraybuffer
                            
                                How to remove a specific element in array in JavaScript [duplicate]
                            
                                Memory leak risk in JavaScript closures

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With