Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve performance of a regexp

My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with common expressions that can be re-used to prepare a variety of formats. One common task is to remove crlf in specific parts of the files, but not in others. For instance, this:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence
    that should not contain
    any line break.
    </SOURCE>

Should become:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence that should not contain any line break.
    </SOURCE>

I have a rexep that does the job pretty nicely:

(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

The problem is that it is processing intensive and with files above 500kb, it can take 30+ seconds. (regex is compiled, in this case, uncompiled is much slower)

It's not a big issue, but I wonder is there is a better way to achieve the same results with Regex.

Thanks in advance for your suggestions.

like image 207
Sylverdrag Avatar asked Dec 06 '25 00:12

Sylverdrag


1 Answers

Try this:

\r\n(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)

It starts out by matching \r\n, then uses a lookahead to see if the match is between <SOURCE> and </SOURCE>. It does that by looking for a </SOURCE>, but if it finds <SOURCE> first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.

like image 148
Alan Moore Avatar answered Dec 07 '25 13:12

Alan Moore