Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Include special characters like ö,ä,ü in regular expressions [duplicate]

I have the following Regular Expression that does this:

pattern='^.*(?=.{8,})(?=.*[a-zA-Z])(?=.*\d).*$'
  • at least 8 characters
  • at least 1 number
  • at least 1 letter (upper or lower case)

unfortunately, special characters like the german ä,ö,ü are not included, so patterns like 1234567ä will fail. Does anyone know how to get them into this Expression? I guess that it should probably be in the (?=.*[a-zA-Z]) section. Thank you in advance for your effort

like image 785
md7 Avatar asked Apr 01 '16 21:04

md7


1 Answers

The answer depends on exactly what you want to do.

As you have noticed, [a-zA-Z] only matches Latin letters without diacritics.

If you only care about German diacritics and the ß ligature, then you can just replace that part with [a-zA-ZäöüÄÖÜß], e.g.:

pattern='^.*(?=.{8,})(?=.*[a-zA-ZäöüÄÖÜß])(?=.*\d).*$'

But that probably isn’t what you want to do. You probably want to match Latin letters with any diacritics, not just those used in German. Or perhaps you want to match any letters from any alphabet, not just Latin.

Other regular expressions dialects have character classes to help you with problems like this, but unfortunately JavaScript’s regular expression dialect has very few character classes and none of them help you here.

(In case you don’t know, a “character class” is an expression that matches any character that is a member of a predefined group of characters. For example, \w is a character class that matches any ASCII letter, or digit, or an underscore, and . is a character class that matches any character.)

This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match.

A quick and dirty solution might be to say [a-zA-Z\u0080-\uFFFF], or in full:

pattern='^.*(?=.{8,})(?=.*[a-zA-Z\\u0080-\\uFFFF])(?=.*\d).*$'

This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range. This includes all possible alphabetic characters with or without diacritics in any script. However, it also includes a lot of characters that are not letters. Non-letters in the ASCII range are excluded, but non-letters outside the ASCII range are included.

The above might be good enough for your purposes, but if it isn’t then you will have to figure out which character ranges you need and specify those explicitly.

like image 160
Daniel Cassidy Avatar answered Nov 20 '22 15:11

Daniel Cassidy