Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Has anyone implemented a Regex and/or Xml parser around StringBuilders or Streams?

I'm building a stress-testing client that hammers servers and analyzes responses using as many threads as the client can muster. I'm constantly finding myself throttled by garbage collection (and/or lack thereof), and in most cases, it comes down to strings that I'm instantiating only to pass them off to a Regex or an Xml parsing routine.

If you decompile the Regex class, you'll see that internally, it uses StringBuilders to do nearly everything, but you can't pass it a string builder; it helpfully dives down into private methods before starting to use them, so extension methods aren't going to solve it either. You're in a similar situation if you want to get an object graph out of the parser in System.Xml.Linq.

This is not a case of pedantic over-optimization-in-advance. I've looked at the Regex replacements inside a StringBuilder question and others. I've also profiled my app to see where the ceilings are coming from, and using Regex.Replace() now is indeed introducing significant overhead in a method chain where I'm trying to hit a server with millions of requests per hour and examine XML responses for errors and embedded diagnostic codes. I've already gotten rid of just about every other inefficiency that's throttling the throughput, and I've even cut a lot of the Regex overhead out by extending StringBuilder to do wildcard find/replace when I don't need capture groups or backreferences, but it seems to me that someone would have wrapped up a custom StringBuilder (or better yet, Stream) based Regex and Xml parsing utility by now.

Ok, so rant over, but am I going to have to do this myself?

Update: I found a workaround which lowered peak memory consumption from multiple gigabytes to a few hundred megs, so I'm posting it below. I'm not adding it as an answer because a) I generally hate to do that, and b) I still want to find out if someone takes the time to customize StringBuilder to do Regexes (or vice-versa) before I do.

In my case, I could not use XmlReader because the stream I am ingesting contains some invalid binary content in certain elements. In order to parse the XML, I have to empty out those elements. I was previously using a single static compiled Regex instance to do the replace, and this consumed memory like mad (I'm trying to process ~300 10KB docs/sec). The change that drastically reduced consumption was:

  1. I added the code from this StringBuilder Extensions article on CodeProject for the handy IndexOf method.
  2. I added a (very) crude WildcardReplace method that allows one wildcard character (* or ?) per invocation
  3. I replaced the Regex usage with a WildcardReplace() call to empty the contents of the offending elements

This is very unpretty and tested only as far as my own purposes required; I would have made it more elegant and powerful, but YAGNI and all that, and I'm in a hurry. Here's the code:

/// <summary>
/// Performs basic wildcard find and replace on a string builder, observing one of two 
/// wildcard characters: * matches any number of characters, or ? matches a single character.
/// Operates on only one wildcard per invocation; 2 or more wildcards in <paramref name="find"/>
/// will cause an exception.
/// All characters in <paramref name="replaceWith"/> are treated as literal parts of 
/// the replacement text.
/// </summary>
/// <param name="find"></param>
/// <param name="replaceWith"></param>
/// <returns></returns>
public static StringBuilder WildcardReplace(this StringBuilder sb, string find, string replaceWith) {
    if (find.Split(new char[] { '*' }).Length > 2 || find.Split(new char[] { '?' }).Length > 2 || (find.Contains("*") && find.Contains("?"))) {
        throw new ArgumentException("Only one wildcard is supported, but more than one was supplied.", "find");
    } 
    // are we matching one character, or any number?
    bool matchOneCharacter = find.Contains("?");
    string[] parts = matchOneCharacter ? 
        find.Split(new char[] { '?' }, StringSplitOptions.RemoveEmptyEntries) 
        : find.Split(new char[] { '*' }, StringSplitOptions.RemoveEmptyEntries);
    int startItemIdx; 
    int endItemIdx;
    int newStartIdx = 0;
    int length;
    while ((startItemIdx = sb.IndexOf(parts[0], newStartIdx)) > 0 
        && (endItemIdx = sb.IndexOf(parts[1], startItemIdx + parts[0].Length)) > 0) {
        length = (endItemIdx + parts[1].Length) - startItemIdx;
        newStartIdx = startItemIdx + replaceWith.Length;
        // With "?" wildcard, find parameter length should equal the length of its match:
        if (matchOneCharacter && length > find.Length)
            break;
        sb.Remove(startItemIdx, length);
        sb.Insert(startItemIdx, replaceWith);
    }
    return sb;
}
like image 320
Paul Smith Avatar asked Jul 18 '12 01:07

Paul Smith


3 Answers

Here try this. Everything's char based and relatively low level for efficiency. Any number of your *s or ?s can be used. However, your * is now and your ? is now . Around three days of work went into this to make it as clean as possible. You can even enter multiple queries on one sweep!

Example usage: wildcard(new StringBuilder("Hello and welcome"), "hello✪w★l", "be") results in "become".

////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////// Search for a string/s inside 'text' using the 'find' parameter, and replace with a string/s using the replace parameter
// ✪ represents multiple wildcard characters (non-greedy)
// ★ represents a single wildcard character
public StringBuilder wildcard(StringBuilder text, string find, string replace, bool caseSensitive = false)
{
    return wildcard(text, new string[] { find }, new string[] { replace }, caseSensitive);
}
public StringBuilder wildcard(StringBuilder text, string[] find, string[] replace, bool caseSensitive = false)
{
    if (text.Length == 0) return text;          // Degenerate case

    StringBuilder sb = new StringBuilder();     // The new adjusted string with replacements
    for (int i = 0; i < text.Length; i++)   {   // Go through every letter of the original large text

        bool foundMatch = false;                // Assume match hasn't been found to begin with
        for(int q=0; q< find.Length; q++) {     // Go through each query in turn
            if (find[q].Length == 0) continue;  // Ignore empty queries

            int f = 0;  int g = 0;              // Query cursor and text cursor
            bool multiWild = false;             // multiWild is ✪ symbol which represents many wildcard characters
            int multiWildPosition = 0;          

            while(true) {                       // Loop through query characters
                if (f >= find[q].Length || (i + g) >= text.Length) break;       // Bounds checking
                char cf = find[q][f];                                           // Character in the query (f is the offset)
                char cg = text[i + g];                                          // Character in the text (g is the offset)
                if (!caseSensitive) cg = char.ToLowerInvariant(cg);
                if (cf != '★' && cf != '✪' && cg != cf && !multiWild) break;        // Break search, and thus no match is found
                if (cf == '✪') { multiWild = true; multiWildPosition = f; f++; continue; }              // Multi-char wildcard activated. Move query cursor, and reloop
                if (multiWild && cg != cf && cf != '★') { f = multiWildPosition + 1; g++; continue; }   // Match since MultiWild has failed, so return query cursor to MultiWild position
                f++; g++;                                                           // Reaching here means that a single character was matched, so move both query and text cursor along one
            }

            if (f == find[q].Length) {          // If true, query cursor has reached the end of the query, so a match has been found!!!
                sb.Append(replace[q]);          // Append replacement
                foundMatch = true;
                if (find[q][f - 1] == '✪') { i = text.Length; break; }      // If the MultiWild is the last char in the query, then the rest of the string is a match, and so close off
                i += g - 1;                                                 // Move text cursor along by the amount equivalent to its found match
            }
        }
        if (!foundMatch) sb.Append(text[i]);    // If a match wasn't found at that point in the text, then just append the original character
    }
    return sb;
}
like image 107
Dan W Avatar answered Nov 14 '22 07:11

Dan W


XmlReader is a stream-based XML parser. See http://msdn.microsoft.com/en-us/library/756wd7zs.aspx

like image 29
Rafael Dowling Goodman Avatar answered Nov 14 '22 06:11

Rafael Dowling Goodman


The Mono project has switched the license for their core libraries to an MIT X11 license. If you need to create a regex library customized for performance in your particular application, you should be able to start with the latest code from Mono's implementation of the System library.

like image 1
Sam Harwell Avatar answered Nov 14 '22 08:11

Sam Harwell