Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.Net regex: what is the word character \w?

Tags:

c#

.net

regex

Simple question:
What is the pattern for the word character \w in c#, .net?

My first thought was that it matches [A-Za-z0-9_] and the documentation tells me:

Character class    Description          Pattern     Matches
\w                 Matches any          \w          "I", "D", "A", "1", "3"
                   word character.                  in "ID A1.3"

which is not very helpful.
And \w seems to match äöü, too. What else? Is there a better (exact) definition available?

like image 736
tanascius Avatar asked Jun 08 '10 14:06

tanascius


People also ask

What is the use of class W in regex?

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts.

Which pattern is used to match any non-word character is \W?

The \W metacharacter matches non-word characters: A word character is a character a-z, A-Z, 0-9, including _ (underscore).

Does W include underscore?

A domain name may include lowercase and uppercase letters, numbers, period signs and dashes, but no underscores. \w includes all of the above, plus an underscore.


3 Answers

From the documentation:

Word Character: \w

\w matches any word character. A word character is a member of any of the Unicode categories listed in the following table.

  • Ll (Letter, Lowercase)
  • Lu (Letter, Uppercase)
  • Lt (Letter, Titlecase)
  • Lo (Letter, Other)
  • Lm (Letter, Modifier)
  • Nd (Number, Decimal Digit)
  • Pc (Punctuation, Connector)
    • This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.

If ECMAScript-compliant behavior is specified, \w is equivalent to [a-zA-Z_0-9].

See also

  • Unicode Character Database
  • Unicode Characters in the 'Punctuation, Connector' Category
like image 63
polygenelubricants Avatar answered Sep 25 '22 15:09

polygenelubricants


Basically it matches everything that can be considered the intuitive definition of letter in various scripts – plus the underscore and a few other oddballs.

You can find a complete list (at least for the BMP) with the following tiny PowerShell snippet:

0..65535 | ?{([char]$_) -match '\w'} | %{ "$_`: " + [char]$_ }
like image 34
Joey Avatar answered Sep 23 '22 15:09

Joey


So after some research using '\w' in .NET is equivalent to:

public static class Extensions { 
    /// <summary>
    /// The word categories.
    /// </summary>
    [NotNull]
    private static readonly HashSet<UnicodeCategory> _wordCategories = new HashCollection<UnicodeCategory>(
                new[]
                {
            UnicodeCategory.DecimalDigitNumber,
            UnicodeCategory.UppercaseLetter,
            UnicodeCategory.ConnectorPunctuation,
            UnicodeCategory.LowercaseLetter,
            UnicodeCategory.OtherLetter,
            UnicodeCategory.TitlecaseLetter,
            UnicodeCategory.ModifierLetter,
            UnicodeCategory.NonSpacingMark,
                });

    /// <summary>
    /// Determines whether the specified character is a word character (equivalent to '\w').
    /// </summary>
    /// <param name="c">The c.</param>
    public static bool IsWord(this char c) => _wordCategories.Contains(char.GetUnicodeCategory(c));
}

I've written this as an extension method to be easy to use on any character c just invoke c.IsWord() which will return true if the character is a word character. This should be significantly quicker than using a Regex.

Interestingly, this doesn't appear to match the .NET specification, in fact '\w' match 938 'NonSpacingMark' characters, which are not mentioned.

In total this matches 49,760 of the 65,535 characters, so the simple regex's often shown on the web are incomplete.

like image 6
thargy Avatar answered Sep 25 '22 15:09

thargy