Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A more efficient Regex or alternative?

Tags:

c#

.net

regex

I have a file with a little over a million lines.

 {<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceVolume> "693702"^^<xsd:long>}
 {<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceId> <uri::20fb8f7d-30ef-dd11-a78d-001f29e570a8>}

Each line is a statement.

struct Statement
    string C;
    string S;
    string P;
    string O;
    string T;

Currently I'm using a TextReader in a while loop and parsing each line with a regular expression:

Regex lineParse = new Regex(@"[^<|\""]*\w[^>\""]*", RegexOptions.Singleline | RegexOptions.Compiled);

It takes quite awhile to do this parsing and I'm hoping someone can point me to a more efficient parsing strategy.

Some lines have 5 matchs and some 4. Here is how each line is parsed:

{<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceVolume> "693702"^^<xsd:long>}

Statement()
    C = uri::rdfserver#null
    S = uri::d41d8cd98f00b204e9800998ecf8427e
    P = uri::TickerDailyPriceVolume
    O = 693702
    T = xsd:long

{<uri::rdfserver#null> <uri::d41d8cd98f00b204e9800998ecf8427e> <uri::TickerDailyPriceId> <uri::20fb8f7d-30ef-dd11-a78d-001f29e570a8>}

Statement()
    C = uri::rdfserver#null
    S = uri::d41d8cd98f00b204e9800998ecf8427e
    P = uri::TickerDailyPriceId
    O = uri::20fb8f7d-30ef-dd11-a78d-001f29e570a8

Additional information from the comments: "The poor performance I was seeing is actually due to a conditional break-point that I had set in the code. Without that break-point everything is pretty fast. Still if anyone has any improvement ideas I would be interested" -Eric Schoonover

like image 232
Eric Schoonover Avatar asked Nov 28 '22 16:11

Eric Schoonover


1 Answers

The fastest (as shown below) is a simple string split:

line.Split(new char[] { '{', '<', '>', '}', ' ', '^', '"' },
           StringSplitOptions.RemoveEmptyEntries);

The next fastest is an anchored regular expression (ugly):

Regex lineParse
    = new Regex(@"^\{(<([^>]+)>\s*){3,4}(""([^""]+)""\^\^<([^>]+)>\s*)?\}$",
                RegexOptions.Compiled);
Match m = lineParse.Match(line);
if (m.Groups[2].Captures.Count == 3)
{
    Data data = new Data { C = m.Groups[2].Captures[0].Value,
        S = m.Groups[2].Captures[1].Value, P = m.Groups[2].Captures[2].Value,
        O = m.Groups[4].Value, T = m.Groups[5].Value };
} else {
    Data data = new Data { C = m.Groups[2].Captures[0].Value,
        S = m.Groups[2].Captures[1].Value, P = m.Groups[2].Captures[2].Value,
        O = m.Groups[2].Captures[3].Value, T = String.Empty };
}

Timings for 1M lines of random data (String.Split as the baseline):

Method                #1  Wall (Diff)       #2  Wall (Diff)
------------------------------------------------------------
line.Split                3.6s (1.00x)         3.1s (1.00x)
myRegex.Match             5.1s (1.43x)         3.3s (1.10x)
itDependsRegex.Matches    6.8s (1.88x)         4.4s (1.44x)
stateMachine              8.4s (2.34x)         5.6s (1.82x)
alanM.Matches             9.1s (2.52x)         7.8s (2.52x)
yourRegex.Matches        18.3s (5.06x)        12.1s (3.90x)

Updated to include @AlanM and @itdepends regular expressions. It appears that Regex.Matches is slower than Regex.Match, however, the more context clues you give the parser, the better it performs. The single negative character class used by @AlanM is the simplest to read, yet slower than the most cryptic (mine). Hats off to @itdepends for the simplest regex that produces the fastest time. Ok, and while I thought it would be crazy to write a state machine to parse the line, it actually doesn't perform poorly at all...kudos to @RexM for the suggestion. I've also added times from my Q6600 at home (#2) v. an older Xeon at work (#1).

like image 70
9 revs, 2 users 94% Avatar answered Dec 20 '22 17:12

9 revs, 2 users 94%