Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for validating names and surnames?

Although this seems like a trivial question, I am quite sure it is not :)

I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:

^[a-z -']+$

However, I need to support also these cases:

  • other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
  • different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
  • no numbers or symbols or unnecessary punctuation or runes, etc..
  • titles, middle initials, suffixes are not part of this data
  • names are already separated by surnames.
  • we are prepared to force ultra rare names to be simplified (there's a person named '@' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
  • note that many countries have laws about names so there are standards to follow

Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?

I would be looking for something similar to the many "email address" regexes that you can find on google.

like image 226
Sklivvz Avatar asked May 20 '09 16:05

Sklivvz


4 Answers

I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.

Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.

Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.

like image 166
Chris Cudmore Avatar answered Nov 12 '22 03:11

Chris Cudmore


I would just allow everything (except an empty string) and assume the user knows what his name is.

There are 2 common cases:

  1. You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
  2. You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.

In case (1), you can allow all characters because you're checking against a paper document.

In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".

like image 15
user9876 Avatar answered Nov 12 '22 01:11

user9876


I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like @#$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name.

EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on its own. However, this article has a section on filtering that is a starting point if you want to go that route:

s/[\<\>\"\'\%\;\(\)\&\+]//g;

"Secure Programming for Linux and Unix HOWTO" by David A. Wheeler, v3.010 Edition (2003)

v3.72, 2015-09-19 is a more recent version.

like image 13
kscott Avatar answered Nov 12 '22 02:11

kscott


I'll try to give a proper answer myself:

The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.

Regarding numbers, there's only one case with an 8. I think I can safely disallow that.

Regarding letters, any letter is valid.

I also want to include space.

This would sum up to this regex:

^[\p{L} \.'\-]+$

This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.

So the validation code should be something like this (untested):

var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$")) 
    throw new ArgumentException("nameParam");
name = name.Replace("'", "&#39;");  //&apos; does not work in IE

Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?


complete tested solution

using System;
using System.Text.RegularExpressions;

namespace test
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            var names = new string[]{"Hello World", 
                "John",
                "João",
                "タロウ",
                "やまだ",
                "山田",
                "先生",
                "мыхаыл",
                "Θεοκλεια",
                "आकाङ्क्षा",
                "علاء الدين",
                "אַבְרָהָם",
                "മലയാളം",
                "상",
                "D'Addario",
                "John-Doe",
                "P.A.M.",
                "' --",
                "<xss>",
                "\""
            };
            foreach (var nameParam in names)
            {
                Console.Write(nameParam+" ");
                var name = nameParam.Trim();
                if (!Regex.IsMatch(name, @"^[\p{L}\p{M}' \.\-]+$"))
                {
                    Console.WriteLine("fail");
                    continue;
                }
                name = name.Replace("'", "&#39;");
                Console.WriteLine(name);
            }
        }
    }
}
like image 18
Sklivvz Avatar answered Nov 12 '22 03:11

Sklivvz