Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Multiple Regex Replaces on String - Too Much Memory

Basically what I would like to do is run multiple (15-25) regex replaces on a single string with the best possible memory management.

Overview: Streams a text only file (sometimes html) via ftp appending to a StringBuilder to get a very large string. The file size ranges from 300KB to 30MB.

The regular expressions are semi-complex, but require multiple lines of the file (identifying sections of a book for example), so arbitrarily breaking the string, or running the replace on every download loop is out of the answer.

A sample replace:

Regex re = new Regex("<A.*?>Table of Contents</A>", RegexOptions.IgnoreCase);
source = re.Replace(source, "");

With each run of a replace the memory sky rockets, I know this is because string are immutable in C# and it needs to make a copy - even if I call GC.Collect() it still doesn't help enough for a 30MB file.

Any advice on a better way to approach, or a way to perform multiple regex replaces using constant memory (make 2 copies (so 60MB in memory), perform search, discard copy back to 30MB)?

Update:

There does not appear to be a simple answer but for future people looking at this I ended up using a combination of all the answers below to get it to an acceptable state:

  1. If possible split the string into chunks, see manojlds's answer for a way to that as the file is being read - looking for suitable end points.

  2. If you can't split as it streams, at least split it later if possible - see ChrisWue's answer for some external tools that may help with this process to piping to files.

  3. Optimize the regex, avoid greedy operators and try to limit what the engine has to do as much as possible - see Sylverdrag's answer.

  4. Combine the regex when possible, this cuts down the number of replaces for when the regexs are not based on each other (useful in this case for cleaning bad input) - see Brian Reichle's answer for a code sample.

Thank you all!

like image 282
WSkid Avatar asked Apr 16 '11 03:04

WSkid


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

Is C language easy?

Compared to other languages—like Java, PHP, or C#—C is a relatively simple language to learn for anyone just starting to learn computer programming because of its limited number of keywords.

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.


2 Answers

Depending on the nature of the RegEx's, you might be able to combine them into a single regular expression and use the overload of Replace() that takes in a MatchEvaluator delegate to determine the replacement from the matched string.

Regex re = new Regex("First Pattern|Second Pattern|Super(Mega)*Delux", RegexOptions.IgnoreCase);

source = re.Replace(source, delegate(Match m)
{
    string value = m.Value;

    if(value.Equals("first pattern", StringComparison.OrdinalIgnoreCase)
    {
        return "1st";
    }
    else if(value.Equals("second pattern", StringComparison.OrdinalIgnoreCase)
    {
        return "2nd";
    }
    else
    {
        return "";
    }
});

Of course this falls apart if latter patterns need to be able to match on the result of earlier replacements.

like image 90
Brian Reichle Avatar answered Sep 21 '22 20:09

Brian Reichle


Have a look at this post which talks about searching a stream using regular expressions rather than having to store in a string which consumes memory:

http://www.developer.com/design/article.php/3719741/Building-a-Regular-Expression-Stream-Search-with-the-NET-Framework.htm

like image 37
manojlds Avatar answered Sep 20 '22 20:09

manojlds