Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bad Regex performance while searching for times (xx:xx:xx)

I have to work through a large file (several MB) and remove comments from it that are marked by a time. An example :

blablabla  12:10:40 I want to remove this
blablabla some more
even more bla

After filtering, I would like it to look like this :

blablabla
blablabla some more
even more bla

The nicest way to do it should be easing a Regex :

Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);

Now this works perfectly for my purposes, but it's a bit slow.. I'm assuming this is because the first two characters [012] and [0123456789] match with a lot of the data (it's an ASCII file containing hexadecimal data, so like "0045ab0123" etc..). So Regex is having a match on the first two characters way too often.

When I change the Regex to

Dataout = Regex.Replace(Datain, ":[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);

It get's an enormous speedup, probably because there's not many ':' in the file at all. Good! But I still need to check the two characters before the first ':' being numbers and then delete the rest of the line.

So my question boils down to :

  • how can I make Regex first search for least frequent occurences of ':' and only after having found a match, checking the two characters before that?

Or maybe there's even a better way?

like image 542
wvl_kszen Avatar asked Nov 01 '22 00:11

wvl_kszen


1 Answers

You could use both of the regular expressions in the question. First a match with the leading colon expression to quickly find or exclude possible lines. If that succeeds then use the full replace expression.

MatchCollection mc = Regex.Matches(Datain, ":[012345][0123456789]:[012345][0123456789].*"));

if ( mc != null && mc.Length > 0 )
{
    Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);
}
else
{
    Dataout = Datain;
}

A variation might be

Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");
Regex changer = new regex("[012][0123456789]:[012345][0123456789]:[012345][0123456789].*");

if ( finder.Match(Datain).Success)
{
    Dataout = changer.Replace(Datain, string.Empty);
}
else
{
    Dataout = Datain;
}

Another variation would be to use the finder as above. If the string is found then just check whether the previous two characters are digits.

Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");

Match m = finder.Match(Datain);
if ( m.Success && m.Index > 1)
{
    if ( char.IsDigit(DataIn[m.index-1]) && char.IsDigit(DataIn[m.index-2])
    {
        Dataout = m.Index-2 == 0 ? string.Empty : DataIn.Substring(0, m.Index-2);
    }
    else
    {
        Dataout = Datain;
    }
}
else
{
    Dataout = Datain;
}

In the second and third ideas the finder and changer should be declared and given values before reading any lines. There is no need to execute the new Regex(...) inside the line reading loop.

like image 122
AdrianHHH Avatar answered Nov 11 '22 17:11

AdrianHHH