Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex get stuck for some records

Some times Regex got stuck on some values although it is gives result for most of the documents.

I am talking about when scenerio when it got stuck.

  1- collection = Regex.Matches(document, pattern,RegexOptions.Compiled);
  2-  if (collection.Count > 0) //This Line
            {

I Debugged the solution and wanted to see the Values of collection in watch window. I saw following result for most properties.

Function evaluation disabled because a previous function evaluation timed out. You must continue execution to reenable function evaluation.

Later it got stuck on 2nd line.

I can see there is some problem with regex so it went into the loop.

Question: I don't get any exception for this .Is there any way i can get exception after timeout so my tool can carry on with rest of the work.

 Regex:      @"""price"">(.|\r|\n)*?pound;(?<data>.*?)</span>"

 Part of Document : </span><span>1</span></a></li>\n\t\t\t\t<li>\n\t\t\t\t\t<span class=\"icon icon_floorplan touchsearch-icon touchsearch-icon-floorplan none\">Floorplans: </span><span>0</span></li>\n\t\t\t\t</ul>\n\t\t</div>\n    </div>\n\t</div>\n<div class=\"details clearfix\">\n\t\t<div class=\"price-new touchsearch-summary-list-item-price\">\r\n\t<a href=\"/commercial-property-for-sale/property-47109002.html\">POA</a></div>\r\n<p class=\"price\">\r\n\t\t\t<span>POA</span>\r\n\t\t\t\t</p>\r\n\t<h2 class=\"address bedrooms\">\r\n\t<a id=\"standardPropertySummary47109002\"
like image 839
Charlie Avatar asked Jun 04 '15 05:06

Charlie


People also ask

What does ?= Mean in regex?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

How do you search for multiple words in a regular expression?

However, to recognize multiple words in any order using regex, I'd suggest the use of quantifier in regex: (\b(james|jack)\b. *){2,} . Unlike lookaround or mode modifier, this works in most regex flavours.

Can you use or in regex?

Alternation is the term in regular expression that is actually a simple “OR”. In a regular expression it is denoted with a vertical line character | . For instance, we need to find programming languages: HTML, PHP, Java or JavaScript.


1 Answers

How do I get an exception for when a Regex search takes unreasonably long?

Please read below on setting a timeout on your regex searches.

MSDN: Regex.MatchTimeout Property

The MatchTimeout property defines the approximate maximum time interval for a Regex instance to execute a single matching operation before the operation times out. The regular expression engine throws a RegexMatchTimeoutException exception during its next timing check after the time-out interval has elapsed. This prevents the regular expression engine from processing input strings that require excessive backtracking. For more information, see Backtracking in Regular Expressions and Best Practices for Regular Expressions in the .NET Framework.

    public static void Main()
    {
        AppDomain domain = AppDomain.CurrentDomain;
        // Set a timeout interval of 2 seconds.
        domain.SetData("REGEX_DEFAULT_MATCH_TIMEOUT", TimeSpan.FromSeconds(2));
        Object timeout = domain.GetData("REGEX_DEFAULT_MATCH_TIMEOUT");
        Console.WriteLine("Default regex match timeout: {0}",
                            timeout == null ? "<null>" : timeout);

        Regex rgx = new Regex("[aeiouy]");
        Console.WriteLine("Regular expression pattern: {0}", rgx.ToString());
        Console.WriteLine("Timeout interval for this regex: {0} seconds",
                            rgx.MatchTimeout.TotalSeconds);
    }

    // The example displays the following output: 
    //       Default regex match timeout: 00:00:02 
    //       Regular expression pattern: [aeiouy] 
    //       Timeout interval for this regex: 2 seconds

Why does my Regex get stuck?

First of all, try to optimize your Regex, minimize back-referencing if you can. stribizhev commented with an improvement, so kudos to him.

Another thing: your regex is actually equivalent to "price">[\s\S]?pound;(?.?) (C# declaration: @"""price"">[\s\S]?pound;(?.?)"). It is much better since there is much less backtracking. – stribizhev Jun 4 at 9:23

Secondly, if you're having problems with specific values, the first thing you could do to track them down is to make logic per iteration (match) instead of grabbing all matches with a one-liner.

MSDN: Match.NextMatch Method

   public static void Main()
   {
      string pattern = "a*";
      string input = "abaabb";

      Match m = Regex.Match(input, pattern);
      while (m.Success) {
         Console.WriteLine("'{0}' found at index {1}.", 
                           m.Value, m.Index);
         m = m.NextMatch();
      }
   }

To improve benchmark performance without working with the pattern, it is common to put your Regex objects in a static class and instantiate them only once, and add RegexOptions.Compiled to your Regex when instantiating it (which you've done). (Source)

PS. It could be handy to be able to deliberately cause a timeout that is always reproducible, aka an Infinity Loop. I'll share it below.

string pattern = @"/[a-zA-Z0-9]+(\[([^]]*(]"")?)+])?$";
string input = "/aaa/bbb/ccc[@x='1' and @y=\"/aaa[name='z'] \"]";
like image 116
William S Avatar answered Nov 10 '22 00:11

William S