Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to check valid property name in c#

Tags:

c#

regex

I need to validate user input for a property name to retrieve.

For example user can type "Parent.Container" property for windows forms control object or just "Name" property. Then I use reflection to get value of the property.

What I need is to check if user typed legal symbols of c# property (or just legal word symbols like \w) and also this property can be composite (contain two or more words separated with dot).

I have this as of now, is this a right solution?

^([\w]+\.)+[\w]+$|([\w]+)

I used Regex.IsMatch method and it returned true when I passed "?someproperty", though "\w" does not include "?"

like image 902
Vasiliy Avatar asked Nov 14 '15 17:11

Vasiliy


2 Answers

I was looking for this too, but I knew none of the existing answers are complete. After a little digging, here's what I found.

Clarifying what we want

First we need to know which valid we want: valid according to the runtime or valid according to the language? Examples:

  • Foo\u0123Bar is a valid property name for the C# language but not for the runtime. The difference is smoothed over by the compiler, which quietly converts the identifier to FooģBar.
  • For verbatim identifiers (@ prefix) the language treats the @ as part of the identifier, but the runtime doesn't see it.

Either could make sense depending on your needs. If you're feeding the validated text into Reflection methods such as GetProperty(string), you'll need the runtime-valid version. If you want the syntax that's more familiar to C# developers, though, you'd want the language- valid version.

"Valid" based on the runtime

C# version 5 is (as of 7/2018) the latest version with formal standards: the ECMA 334 spec. Its rule says:

The rules for identifiers given in this subclause correspond exactly to those recommended by the Unicode Standard Annex 15 except that underscore is allowed as an initial character (as is traditional in the C programming language), Unicode escape sequences are permitted in identifiers, and the “@” character is allowed as a prefix to enable keywords to be used as identifiers.

The "Unicode Standard Annex 15" mentioned is Unicode TR 15, Annex 7, which formalizes the basic pattern as:

<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*

<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]

<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]

The {codes in curly braces} are Unicode classes, which map directly to Regex via \p{category}. So (after a little simplification) the basic regex to check for "valid" according to the runtime would be:

@"^[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$"

All the ugly details

The C# spec also requires that identifiers be in Unicode Normalization Form C. It doesn't require that the compiler actually enforces it, though. At least the Roslyn C# compiler allows non-normal-form identifiers (e.g., E\u0304\u0306) and treats them as distinct from equivalent normal-form identifiers (e.g., \u0100\u0306). And anyway, to my knowledge there's no sane way to represent such a rule with a regex. If you don't need/want the user to be able to differentiate properties that look exactly the same, my suggestion is to just run string.Normalize() on the user's input to be done with it.

The C# spec says that two identifiers are equivalent if they only differ by formatting characters. For example, Elmo (four characters) and El­mo (El\u00ADmo) are the same identifier. (Note: that's the soft-hyphen, which is normally invisible; some fonts may display it, though.) If the presence of invisible characters would cause you trouble, you can drop the \p{Cf} from the regex. That doesn't reduce which identifiers you accept—just which formats you accept.

The C# spec reserves identifiers containing "__" for its own use. Depending on your needs you may want to exclude that. That should likely be an operation separate from the regex.

Nesting, generics, etc.

Reflection, Type, IL, and perhaps other places sometimes show class names or method names with extra symbols. For example, a type name may be given as X`1+Y[T]. That extra stuff is not part of the identifier—it's an unrelated way of representing type information.

"Valid" based on the language

This is just the previous regex but also allowing for:

  • Prefixed @
  • Unicode escape sequences

The first is a trivial modification: just add @?.

Unicode escape sequences are of form @"\\[Uu][\dA-Fa-f]{4}". We may be tempted to wedge that into both [...] pairs and call it done, but that would incorrectly allow (for example) \u0000 as an identifier. We need to limit the escape sequences to ones that produce otherwise-acceptable characters. One way to do that is to do a pre-pass to convert the escape sequences: replace all \\[Uu][\dA-Fa-f]{4} with the corresponding character.

So putting it all together, a check for whether a string is valid from a C# language standpoint would be:

bool IsValidIdentifier(string input)
{
    if (input is null) { throw new ArgumentNullException(); }

    // Technically the input must be in normal form C. Implementations aren't required
    // to verify that though, so you could remove this check if your runtime doesn't
    // mind.
    if (!input.IsNormalized())
    {
        return false;
    }

    // Convert escape sequences to the characters they represent. The only allowed escape
    // sequences are of form \u0000 or \U0000, where 0 is a hex digit.
    MatchEvaluator replacer = (Match match) =>
        {
            string hex = match.Groups[1].Value;
            var codepoint = int.Parse(hex, NumberStyles.HexNumber);
            return new string((char)codepoint, 1);
        };
    var escapeSequencePattern = @"\\[Uu]([\dA-Fa-f]{4})";
    var withoutEscapes = Regex.Replace(input, escapeSequencePattern, replacer, RegexOptions.CultureInvariant);
    withoutEscapes.Dump();

    // Now do the real check.
    var isIdentifier = @"^@?[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$";
    return Regex.IsMatch(withoutEscapes, isIdentifier, RegexOptions.CultureInvariant);
}

Back to the original question

The asker is long gone, but I feel obliged to include an answer to the actual question:

string[] parts = input.Split();
return parts.Length == 2
  && IsValidIdentifier(parts[0])
  && IsValidIdentifier(parts[1]);

Sources

ECMA 334 § 7.4.3; ECMA 335 § I.10; Unicode TR 15 Annex 7

like image 133
Matt Tsōnto Avatar answered Sep 24 '22 02:09

Matt Tsōnto


Not the best, but this will work. Demo here.

^@?[a-zA-Z_]\w*(\.@?[a-zA-Z_]\w*)*$

Note that
* Number 0-9 is not allowed as first character
* @ is allowed only as first character, but not anywhere else (compiler will strip off though)
* _ is allowed

Edit

Looking at your requirement, the below Regex will be more useful, as input property name need not have @ in it. Check here.

^[a-zA-Z_]\w*(\.[a-zA-Z_]\w*)*$
like image 31
Arghya C Avatar answered Sep 25 '22 02:09

Arghya C