I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down. For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why? It was worse, but I improved it by using instances of <code>RegEx</code> instead of the cached version and compiling the expressions that I could. Some of my regexes need to vary e.g. depending on the user's name it might be <code>mike said (\w*)</code> or <code>john said (\w*)</code> My understanding is that it is not possible to compile those regexes and pass in parameters (e.g <code>saidRegex.Match(inputString, userName)</code>). Does anyone have any suggestions? [Edited to accurately reflect speed - was per 1000 strings, not per string]

This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following: Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields: <pre class="prettyprint"><code>Player_Name | Monetary_Value </code></pre> If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste). Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job. In other words - consider the following: <ol> <li> Create a database table as follows: <pre class="prettyprint"><code>CREATE TABLE PlayerActions ( RowID INT PRIMARY KEY IDENTITY, Player_Name VARCHAR(50) NOT NULL, Monetary_Value MONEY NOT NULL ) </code></pre> </li> <li> Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption. <pre class="prettyprint"><code>public void ParseFullDataSet(IEnumerable<String> dataSource) { var rowCount = dataSource.Count(); var setCount = Math.Floor(rowCount / 100000) + 1; if (rowCount % 100000 != 0) setCount++; for (int i = 0; i < setCount; i++) { var set = dataSource.Skip(i * 100000).Take(100000); ParseSet(set); } } public void ParseSet(IEnumerable<String> dataSource) { String playerName = String.Empty; decimal monetaryValue = 0.0m; // Assume here that the method reflects your RegEx generator. String regex = RegexFactory.Generate(); for (String data in dataSource) { Match match = Regex.Match(data, regex); if (match.Success) { playerName = match.Groups[1].Value; // Might want to add error handling here. monetaryValue = Convert.ToDecimal(match.Groups[2].Value); db.PlayerActions.Add(new PlayerAction() { // ID = ..., // Set at DB layer using Auto_Increment Player_Name = playerName, Monetary_Value = monetaryValue }); db.SaveChanges(); // If not using Entity Framework, use another method to insert // a row to your database table. } } } </code></pre> </li> <li>Run the above one time to get all of your pre-existing data loaded up.</li> <li> Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call: <pre class="prettyprint"><code>ParseSet(new List<String>() { newValue }); </code></pre> or if multiples are created at once, call: <pre class="prettyprint"><code>ParseSet(newValues); // Where newValues is an IEnumerable<String> </code></pre> </li> </ol> Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.

regex performance degrades

Tags:

performance

c#

regex

I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down.

For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why?

It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could.

Some of my regexes need to vary e.g. depending on the user's name it might be mike said (\w*) or john said (\w*)

My understanding is that it is not possible to compile those regexes and pass in parameters (e.g saidRegex.Match(inputString, userName)).

Does anyone have any suggestions?

[Edited to accurately reflect speed - was per 1000 strings, not per string]

422

asked Feb 11 '13 17:02

mike1952

1 Answers

This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following:

Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields:

Click to copy

Player_Name | Monetary_Value

If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste).

Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job.

In other words - consider the following:

Create a database table as follows:

Click to copy

CREATE TABLE PlayerActions (
    RowID          INT PRIMARY KEY IDENTITY,
    Player_Name    VARCHAR(50) NOT NULL,
    Monetary_Value MONEY       NOT NULL
)

Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.

Click to copy

public void ParseFullDataSet(IEnumerable<String> dataSource) {
    var rowCount = dataSource.Count();
    var setCount = Math.Floor(rowCount / 100000) + 1;

    if (rowCount % 100000 != 0)
        setCount++;

    for (int i = 0; i < setCount; i++) {
        var set = dataSource.Skip(i * 100000).Take(100000);
        ParseSet(set);
    }
}

public void ParseSet(IEnumerable<String> dataSource) {
    String playerName = String.Empty;
    decimal monetaryValue = 0.0m;

    // Assume here that the method reflects your RegEx generator.
    String regex = RegexFactory.Generate();

    for (String data in dataSource) {
        Match match = Regex.Match(data, regex);
        if (match.Success) {
            playerName = match.Groups[1].Value;

            // Might want to add error handling here.
            monetaryValue = Convert.ToDecimal(match.Groups[2].Value);

            db.PlayerActions.Add(new PlayerAction() {
                // ID = ..., // Set at DB layer using Auto_Increment
                Player_Name = playerName,
                Monetary_Value = monetaryValue
            });
            db.SaveChanges();

            // If not using Entity Framework, use another method to insert
            // a row to your database table.
        }
    }
}

Run the above one time to get all of your pre-existing data loaded up.
Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:

Click to copy
```
ParseSet(new List<String>() { newValue });
```
or if multiples are created at once, call:

Click to copy
```
ParseSet(newValues); // Where newValues is an IEnumerable<String>
```

Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.

169

answered Oct 05 '22 03:10

Troy Alford

Related questions
                            
                                ASP.NET and C# Page View Counter: Using a Database
                            
                                What is the point of streamwriters having Close() and Dispose()?
                            
                                Visual Studio find Code Paths/Paths of Execution
                            
                                XML Not Sent To Client Through Response Object Correctly In IE
                            
                                How can I monitor a folder for changes in the background of a Windows App?
                            
                                Label inadvertently disappearing after certain length
                            
                                Deploying an MVC4 C# application to Azure via GitHub. What should be in my .gitignore?
                            
                                Is there an equivalent of DependencyObjectCollection<T> for WPF .NET4.0?
                            
                                How to achieve good sharpness with twain/emgu/open cv?
                            
                                Create a faster regular expression clr function
                            
                                How to handle float values that are greater than the maximum value of double
                            
                                can I set a global object factory for Protobuf-net?
                            
                                how to change isolation level?
                            
                                Draw styled lines in WPF
                            
                                Can't add new monogame content project to solution in VS2012
                            
                                Prolonging asp.net session enddate?
                            
                                Is there any way to reduce duplication in these two linq queries
                            
                                Text not getting to the console
                            
                                Do I have to implement Stop method in a windows service?
                            
                                ManagementException - Invalid Class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

regex performance degrades

Tags:

performance

c#

regex

mike1952

People also ask

1 Answers

Troy Alford

Recent Activity

Donate For Us