Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex performance degrades

I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down.

For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why?

It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could.

Some of my regexes need to vary e.g. depending on the user's name it might be mike said (\w*) or john said (\w*)

My understanding is that it is not possible to compile those regexes and pass in parameters (e.g saidRegex.Match(inputString, userName)).

Does anyone have any suggestions?

[Edited to accurately reflect speed - was per 1000 strings, not per string]

like image 422
mike1952 Avatar asked Feb 11 '13 17:02

mike1952


People also ask

Is regex bad for performance?

In General, the Longer Regex Is the Better Regex Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.

Why is regex so slow?

The current std::regex design and implementation are slow, mostly because the RE pattern is parsed and compiled at runtime. Users often don't need a runtime RE parser engine as the pattern is known during compilation in many common use cases.

Is there anything faster than regex?

String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.

Is regex resource intensive?

Regular Expression processors can be resource intensive, i.e. High CPU usage. This can cause the File Reader to do more work than necessary. Limit the scope, and change the string to a range instead. {0,10} will look for between 0 and 10 characters.


1 Answers

This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following:

Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields:

Player_Name | Monetary_Value

If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste).

Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job.

In other words - consider the following:

  1. Create a database table as follows:

    CREATE TABLE PlayerActions (
        RowID          INT PRIMARY KEY IDENTITY,
        Player_Name    VARCHAR(50) NOT NULL,
        Monetary_Value MONEY       NOT NULL
    )
    
  2. Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.

    public void ParseFullDataSet(IEnumerable<String> dataSource) {
        var rowCount = dataSource.Count();
        var setCount = Math.Floor(rowCount / 100000) + 1;
    
        if (rowCount % 100000 != 0)
            setCount++;
    
        for (int i = 0; i < setCount; i++) {
            var set = dataSource.Skip(i * 100000).Take(100000);
            ParseSet(set);
        }
    }
    
    public void ParseSet(IEnumerable<String> dataSource) {
        String playerName = String.Empty;
        decimal monetaryValue = 0.0m;
    
        // Assume here that the method reflects your RegEx generator.
        String regex = RegexFactory.Generate();
    
        for (String data in dataSource) {
            Match match = Regex.Match(data, regex);
            if (match.Success) {
                playerName = match.Groups[1].Value;
    
                // Might want to add error handling here.
                monetaryValue = Convert.ToDecimal(match.Groups[2].Value);
    
                db.PlayerActions.Add(new PlayerAction() {
                    // ID = ..., // Set at DB layer using Auto_Increment
                    Player_Name = playerName,
                    Monetary_Value = monetaryValue
                });
                db.SaveChanges();
    
                // If not using Entity Framework, use another method to insert
                // a row to your database table.
            }
        }
    }
    
  3. Run the above one time to get all of your pre-existing data loaded up.

  4. Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:

    ParseSet(new List<String>() { newValue });
    

    or if multiples are created at once, call:

    ParseSet(newValues); // Where newValues is an IEnumerable<String>
    

Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.

like image 169
Troy Alford Avatar answered Oct 05 '22 03:10

Troy Alford