We have a Windows Service created in .Net 4.0, the services parses large text files that are made up of lines of comma separated values (Several million lines, with between 5-10 values), no problem here, we can read the lines, split them into a Key/Value collection and process the values. To validate the values we are using Data Paralellism to pass the Values, which is basically an array of values in specific formats, to a method that performs RegEx validation on individual values.
Up until now we have used static Regular Expressions, not the static RegEx.IsMatch method but a static RegEx property with the RegexOption defined as RegexOptions.Compiled, as detailed below.
private static Regex clientIdentityRegEx = new Regex("^[0-9]{4,9}$", RegexOptions.Compiled);
Using this method we had a pretty standard memory footprint, the memory increased marginally with the greater number of values in each line, the time taken was more or less linear to the total number of lines.
To allow the Regular Expression to be used in other projects, of varying Framework versions, we recently moved the static RegEx properties to a common utilities project that is now compiled using the .Net 2.0 CLR (the actual Regular Expressions have not changed), the number of RegEx properties exposed has increased to about 60, from 25 or so. Since doing this we have started running into memory issues, an increase in memory 3 or more times that of the original project. When we profile the running service we can see the memory appears to be "leaking" from the RegEx.IsMatch, not any specific RegEx but various depending on which are called.
I found the following comment on a old MSDN blog post from one of the BCL team relating to .Net 1.0/1.1 RegEx.
There are even more costs for compilation that should mentioned, however. Emitting IL with Reflection.Emit loads a lot of code and uses a lot of memory, and that's not memory that you'll ever get back. In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode. We've fixed that problem in Whidbey. But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly.
I will add we have profiled "most" of the common RegEx calls and cannot replicate the issue individually.
Is this a known issue with the .Net 2.0 CLR?
In the article are the writers states "But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly", what is likely to be the finite number of expressions used in this manner, and is this likely to be a cause?
Update: In line with answer from @Henk Holterman is there any best practices for benchmark testing Regular Expressions, specifically RegEx.IsMatch, other than using sheer brute force by volume and parameter format?
Answer: Hanks answer of "The scenario calls for a limited, fixed number of RegEx objects" was pretty much spot on, we added the static RegEx'es to the class until we isolated the expressions with a notible increase in memory usage, these were migrated to separate static classes which seems to have solved some of the memory issues.
It appears, although I cannot cofirm this, there is a difference between compiled RegEx usage between the .Net 2.0 CLR and the .Net 4.0 CLR as the memory issues do not occur when the complied solely for the .Net 4.0 framework. (Any confirmations?)
The scenario calls for a limited, fixed number of RegEx objects. That shouldn't leak. You should verify that in the new situation the RegEx objects are still being reused.
The other possibility is the increased number (60 from 25) expressions. Could just one of them maybe be a little more complex, leading to excessive backtracking?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With