Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamically Built Regular Expressions are running extremely slow!

I'm generating regular expressions dynamically by running through some xml structure and building up the statement as I shoot through its node types. I'm using this regular expression as part of a Layout type that I defined. I then parse through a text file that has an Id in the beginning of each line. This id points me to a specific layout. I then try to match the data in that row against its regex.

Sounds fine and dandy right? The only problem is it is matching strings extremely slow. I have them set as compiled to try and speed things up a bit, but to no avail. What is baffling is that these expressions aren't that complex. I am by no means a RegEx guru, but I know a decent amount about them to get things going well.

Here is the code that generates the expressions...

StringBuilder sb = new StringBuilder();
//get layout id and memberkey in there...
sb.Append(@"^([0-9]+)[ \t]{1,2}([0-9]+)"); 
foreach (ColumnDef c in columns)
{
    sb.Append(@"[ \t]{1,2}");
    switch (c.Variable.PrimType)
    {
        case PrimitiveType.BIT:
            sb.Append("(0|1)");
            break;
        case PrimitiveType.DATE:
            sb.Append(@"([0-9]{2}/[0-9]{2}/[0-9]{4})");
            break;
        case PrimitiveType.FLOAT:
            sb.Append(@"([-+]?[0-9]*\.?[0-9]+)");
            break;
        case PrimitiveType.INTEGER:
            sb.Append(@"([0-9]+)");
            break;
        case PrimitiveType.STRING:
            sb.Append(@"([a-zA-Z0-9]*)");
            break;
    }
}
sb.Append("$");
_pattern = new Regex(sb.ToString(), RegexOptions.Compiled);

The actual slow part...

public System.Text.RegularExpressions.Match Match(string input)
{
    if (input == null)
       throw new ArgumentNullException("input");

    return _pattern.Match(input);
}

A typical "_pattern" may have about 40-50 columns. I'll save from pasting the entire pattern. I try to group each case so that I can enumerate over each case in the Match object later on.

Any tips or modifications that could drastically help? Or is this running slowly to be expected?

EDIT FOR CLARITY: Sorry, I don't think I was clear enough the first time around.

I use an xml file to generate regex's for a specific layout. I then run through a file for a data import. I need to make sure that each line in the file matches the pattern it says its supposed to be. So, patterns could be checked against multiple times, possible thousands.

like image 306
Nicholas Mancuso Avatar asked Apr 29 '09 19:04

Nicholas Mancuso


2 Answers

You are parsing a 50 column CSV file (that uses tabs) with regex?

You should just remove duplicate tabs, then split the text on \t. Now you have all of your columns in an array. You can use your ColumnDef object collection to tell you what each column is.

Edit: Once you have things split up, you could optionally use regex to verify each value, this should be much faster than using the giant single regex.

Edit2: You also get an additional benefit of knowing exactly what column(s) is badly formated and you can produce an error like "Sytax error in column 30 on line 12, expected date format."

like image 197
JasonMArcher Avatar answered Nov 02 '22 23:11

JasonMArcher


Some performance thoughts:

  • use [01] instead of (0|1)
  • use non-capturing groups (?:expr) instead of capturing groups (if you really need grouping)

Edit   As it seems that your values are separated by whitespace, why don’t you split it up there?

like image 30
Gumbo Avatar answered Nov 02 '22 23:11

Gumbo