Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding a single character to my .NET RegEx causes it to hang

Tags:

.net

regex

freeze

Here is the input data:

                                *** INVOICE ***                                

                              THE BIKE SHOP                              
                      1 NEW ROAD, TOWNVILLE,                       
                          SOMEWHERE, UK, AB1 2CD                          
                        TEL 01234-567890  

 To: COUNTER SALE                                   No:  243529 Page: 1

                                                    Date: 04/06/10 12:00

                                                    Ref:    Aiden   

 Cust No: 010000                 

Here is a regex that works (Options: singleline, ignorewhitespace, compiled) - it matches immediately and the groups are properly populated:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust 

As soon as I add the 'N' out of Cust No into the rex, parsing the input hangs forever:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust N

If I add something like "any character" :

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust .

It works, but as soon as I add a fixed character, the rex hangs again:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust ..:

Can anyone advise why adding something so trivial would cause it to fall over? Can I enable some kind of tracing to watch the matching activity to see if it is getting stuck in a catastrophic backtrack?

like image 340
Matt Avatar asked Jun 04 '10 13:06

Matt


People also ask

How do you handle special characters in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

Why is regex so difficult?

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.

What does \+ mean in regex?

Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself. Example: "a\+" matches "a+" and not a series of one or "a"s. ^ the caret is the anchor for the start of the string, or the negation symbol.

What does $1 do in regex?

The $ number language element includes the last substring matched by the number capturing group in the replacement string, where number is the index of the capturing group. For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group.


1 Answers

With RegexOptions.IgnorePatternWhitespace, you're telling the engine to ignore whitespaces in your pattern. Thus, when you write Cust No in the pattern, it really means CustNo, which doesn't match the input. This is the cause of the problem.

From the documentation:

By default, white space in a regular expression pattern is significant; it forces the regular expression engine to match a white-space character in the input string. [...]

The RegexOptions.IgnorePatternWhitespace option, or the x inline option, changes this default behavior as follows:

  • Unescaped white space in the regular expression pattern is ignored. To be part of a regular expression pattern, white-space characters must be escaped (e.g. as \s or "\ ").

So instead of Cust No, in IgnorePatternWhitespace mode, you must write Cust\ No, because otherwise it's interpreted as CustNo.

like image 166
polygenelubricants Avatar answered Sep 27 '22 01:09

polygenelubricants