My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with common expressions that can be re-used to prepare a variety of formats. One common task is to remove crlf in specific parts of the files, but not in others. For instance, this:
<TU>Lorem
Ipsum</TU>
<SOURCE>This is a sentence
that should not contain
any line break.
</SOURCE>
Should become:
<TU>Lorem
Ipsum</TU>
<SOURCE>This is a sentence that should not contain any line break.
</SOURCE>
I have a rexep that does the job pretty nicely:
(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))
The problem is that it is processing intensive and with files above 500kb, it can take 30+ seconds. (regex is compiled, in this case, uncompiled is much slower)
It's not a big issue, but I wonder is there is a better way to achieve the same results with Regex.
Thanks in advance for your suggestions.
Try this:
\r\n(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)
It starts out by matching \r\n, then uses a lookahead to see if the match is between <SOURCE> and </SOURCE>. It does that by looking for a </SOURCE>, but if it finds <SOURCE> first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With