Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking for diacritics with a regular expression

Simple problem: an existing project allows me to add additional fields (with additional checks on those fields as regular expressions) to support custom input forms. And I need to add a new form but cannot change how this project works. This form allows a visitor to enter his first and last name plus initials. So the RegEx ^[a-zA-Z.]*$ worked just fine for now.
Then someone noticed that it wouldn't accept diacritic characters as input. A Turkish name like Ömür was not accepted as valid. It needs to be accepted, though.

So I have two options:

  1. Remove the check completely, which would allow users to enter garbage.
  2. Write a regular expression that would also include diacritic letters but still no digits, spaces or other non-letters.

Since I cannot change the code of the project, I only have these two options. I would prefer option 2 but now wonder what the proper RegEx should be. (The project is written in C# 4.0.)

like image 488
Wim ten Brink Avatar asked Jan 19 '12 09:01

Wim ten Brink


People also ask

How do I check if a word is in regular expression?

To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.

What is the regex for special characters?

Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ).

How do I find a specific character in a regular expression?

There is a method for matching specific characters using regular expressions, by defining them inside square brackets. For example, the pattern [abc] will only match a single a, b, or c letter and nothing else.

What does regex (? S match?

(? s) for "single line mode" makes the dot match all characters, including line breaks. (? m) for "multi-line mode" makes the caret and dollar match at the start and end of each line in the subject string.


1 Answers

You can use the specific Unicode escape for letters - \p{L} (this will include the A-Za-z ranges):

^[.\p{L}]*$

See on regularexpressions.info:

\p{L} or \p{Letter}

Matches a single Unicode code point that has the property "letter". See Unicode Character Properties in the tutorial for a complete list of properties. Each Unicode code point has exactly one property. Can be used inside character classes.

like image 99
Oded Avatar answered Oct 17 '22 02:10

Oded