I have to work through a large file (several MB) and remove comments from it that are marked by a time. An example :
blablabla 12:10:40 I want to remove this
blablabla some more
even more bla
After filtering, I would like it to look like this :
blablabla
blablabla some more
even more bla
The nicest way to do it should be easing a Regex :
Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);
Now this works perfectly for my purposes, but it's a bit slow.. I'm assuming this is because the first two characters [012] and [0123456789] match with a lot of the data (it's an ASCII file containing hexadecimal data, so like "0045ab0123" etc..). So Regex is having a match on the first two characters way too often.
When I change the Regex to
Dataout = Regex.Replace(Datain, ":[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);
It get's an enormous speedup, probably because there's not many ':' in the file at all. Good! But I still need to check the two characters before the first ':' being numbers and then delete the rest of the line.
So my question boils down to :
Or maybe there's even a better way?
You could use both of the regular expressions in the question. First a match with the leading colon expression to quickly find or exclude possible lines. If that succeeds then use the full replace expression.
MatchCollection mc = Regex.Matches(Datain, ":[012345][0123456789]:[012345][0123456789].*"));
if ( mc != null && mc.Length > 0 )
{
Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);
}
else
{
Dataout = Datain;
}
A variation might be
Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");
Regex changer = new regex("[012][0123456789]:[012345][0123456789]:[012345][0123456789].*");
if ( finder.Match(Datain).Success)
{
Dataout = changer.Replace(Datain, string.Empty);
}
else
{
Dataout = Datain;
}
Another variation would be to use the finder
as above. If the string is found then just check whether the previous two characters are digits.
Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");
Match m = finder.Match(Datain);
if ( m.Success && m.Index > 1)
{
if ( char.IsDigit(DataIn[m.index-1]) && char.IsDigit(DataIn[m.index-2])
{
Dataout = m.Index-2 == 0 ? string.Empty : DataIn.Substring(0, m.Index-2);
}
else
{
Dataout = Datain;
}
}
else
{
Dataout = Datain;
}
In the second and third ideas the finder
and changer
should be declared and given values before reading any lines. There is no need to execute the new Regex(...)
inside the line reading loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With