Here is the input data:
*** INVOICE ***
THE BIKE SHOP
1 NEW ROAD, TOWNVILLE,
SOMEWHERE, UK, AB1 2CD
TEL 01234-567890
To: COUNTER SALE No: 243529 Page: 1
Date: 04/06/10 12:00
Ref: Aiden
Cust No: 010000
Here is a regex that works (Options: singleline, ignorewhitespace, compiled) - it matches immediately and the groups are properly populated:
\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust
As soon as I add the 'N' out of Cust No into the rex, parsing the input hangs forever:
\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust N
If I add something like "any character" :
\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust .
It works, but as soon as I add a fixed character, the rex hangs again:
\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust ..:
Can anyone advise why adding something so trivial would cause it to fall over? Can I enable some kind of tracing to watch the matching activity to see if it is getting stuck in a catastrophic backtrack?
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.
Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself. Example: "a\+" matches "a+" and not a series of one or "a"s. ^ the caret is the anchor for the start of the string, or the negation symbol.
The $ number language element includes the last substring matched by the number capturing group in the replacement string, where number is the index of the capturing group. For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group.
With RegexOptions.IgnorePatternWhitespace
, you're telling the engine to ignore whitespaces in your pattern. Thus, when you write Cust No
in the pattern, it really means CustNo
, which doesn't match the input. This is the cause of the problem.
From the documentation:
By default, white space in a regular expression pattern is significant; it forces the regular expression engine to match a white-space character in the input string. [...]
The
RegexOptions.IgnorePatternWhitespace
option, or thex
inline option, changes this default behavior as follows:
- Unescaped white space in the regular expression pattern is ignored. To be part of a regular expression pattern, white-space characters must be escaped (e.g. as
\s
or"\ "
).
So instead of Cust No
, in IgnorePatternWhitespace
mode, you must write Cust\ No
, because otherwise it's interpreted as CustNo
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With