Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can this regex be further optimized?

I wrote this regex to parse entries from srt files.

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

I don't know if it matters, but this is done using Scala programming language (Java Engine, but literal strings so that I don't have to double the backslashes).

The s{1,2} is used because some files will only have line breaks \n and others will have line breaks and carriage returns \n\r The first (?s) enables DOTALL mode so that the third capturing group can also match line breaks.

My program basically breaks a srt file using \n\r?\n as a delimiter and use Scala nice pattern matching feature to read each entry for further processing:

val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r

def apply(string: String): Entry = string match {
  case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start),
    timeFormat.parse(end), text);
}

Sample entries:

One line:

1073
01:46:43,024 --> 01:46:45,015
I am your father.

Two Lines:

160
00:20:16,400 --> 00:20:19,312
<i>Help me, Obi-Wan Kenobi.
You're my only hope.</i>

The thing is, the profiler shows me that this parsing method is by far the most time consuming operation in my application (which does intensive time math and can even reencode the file several times faster than what it takes to read and parse the entries).

So any regex wizards can help me optimize it? Or maybe I should sacrifice regex / pattern matching succinctness and try an old school java.util.Scanner approach?

Cheers,

like image 493
Anthony Accioly Avatar asked Jan 18 '23 22:01

Anthony Accioly


1 Answers

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

In Java, $ means the end of input or the beginning of a line-break immediately preceding the end of input. \z means unambiguously end of input, so if that is also the semantics in Scala, then \r?$ is redundant and $ would do just as well. If you really only want a CR at the end and not CRLF then \r?\z might be better.

The (?s) should also make (.+)\r? redundant since the + is greedy, the . should always expand to include the \r. If you do not want the \r included in that third capturing group, then make the match lazy : (.+?) instead of (.+).

Maybe

(?s)^\d++\s\s?(.{12}) --> (.{12})\s\s?(.+?)\r?\z

Other fine high-performance alternatives to regular expressions that will run inside a JVM &| CLR include JavaCC and ANTLR. For a Scala only solution, see http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html

like image 90
Mike Samuel Avatar answered Jan 25 '23 22:01

Mike Samuel