Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I specify a wildcard (for ANY character) in a c# regex statement?

Tags:

c#

.net

regex

Trying to use a wildcard in C# to grab information from a webpage source, but I cannot seem to figure out what to use as the wildcard character. Nothing I've tried works!

The wildcard only needs to allow for numbers, but as the page is generated the same every time, I may as well allow for any characters.

Regex statement in use:

Regex guestbookWidgetIDregex = new Regex("GuestbookWidget(' INSERT WILDCARD HERE ', '(.*?)', 500);", RegexOptions.IgnoreCase);

If anyone can figure out what I'm doing wrong, it would be greatly appreciated!

like image 781
Scott Avatar asked Jun 14 '10 00:06

Scott


2 Answers

The wildcard character is ..
To match any number of arbitrary characters, use .* (which means zero or more .) or .+ (which means one or more .)

Note that you need to escape your parentheses as \\( and \\). (or \( and \) in an @"" string)

like image 167
SLaks Avatar answered Oct 27 '22 03:10

SLaks


On the dot

In regular expression, the dot . matches almost any character. The only characters it doesn't normally match are the newline characters. For the dot to match all characters, you must enable what is called the single line mode (aka "dot all").

In C#, this is specified using RegexOptions.Singleline. You can also embed this as (?s) in the pattern.

References

  • regular-expressions.info/The Dot Matches (Almost) Any Character

On metacharacters and escaping

The . isn't the only regex metacharacters. They are:

(   )   {   }   [   ]   ?   *   +   -   ^   $   .   |   \

Depending on where they appear, if you want these characters to mean literally (e.g. . as a period), you may need to do what is called "escaping". This is done by preceding the character with a \.

Of course, a \ is also an escape character for C# string literals. To get a literal \, you need to double it in your string literal (i.e. "\\" is a string of length one). Alternatively, C# also has what is called @-quoted string literals, where escape sequences are not processed. Thus, the following two strings are equal:

"c:\\Docs\\Source\\a.txt"
@"c:\Docs\Source\a.txt"

Since \ is used a lot in regular expression, @-quoting is often used to avoid excessive doubling.

References

  • regular-expressions.info/Metacharacters
  • MSDN - C# Programmer's Reference - string

On character classes

Regular expression engines allow you to define character classes, e.g. [aeiou] is a character class containing the 5 vowel letters. You can also use - metacharacter to define a range, e.g. [0-9] is a character classes containing all 10 digit characters.

Since digit characters are so frequently used, regex also provides a shorthand notation for it, which is \d. In C#, this will also match decimal digits from other Unicode character sets, unless you're using RegexOptions.ECMAScript where it's strictly just [0-9].

References

  • regular-expressions.info/Character Classes
  • MSDN - Character Classes - Decimal Digit Character

Related questions

  • .NET regex: What is the word character \w

Putting it all together

It looks like the following will work for you:

      @-quoting          digits_      _____anything but ', captured
          |                   / \    /     \
new Regex(@"GuestbookWidget\('\d*', '([^']*)', 500\);", RegexOptions.IgnoreCase);
                           \/                     \/
                         escape (              escape )

Note that I've modified the pattern slightly so that it uses negated character class instead of reluctance wildcard matching. This causes a slight difference in behavior if you allow ' to be escaped in your input string, but neither pattern handle this case perfectly. If you're not allowing ' to be escaped, however, this pattern is definitely better.

References

  • regular-expressions.info/An Alternative to Laziness and Capturing Groups
like image 36
polygenelubricants Avatar answered Oct 27 '22 02:10

polygenelubricants