Improve performance of a regexp

Question

My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with common expressions that can be re-used to prepare a variety of formats. One common task is to remove crlf in specific parts of the files, but not in others. For instance, this:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence
    that should not contain
    any line break.
    </SOURCE>

Should become:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence that should not contain any line break.
    </SOURCE>

I have a rexep that does the job pretty nicely:

(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(
))

The problem is that it is processing intensive and with files above 500kb, it can take 30+ seconds. (regex is compiled, in this case, uncompiled is much slower)

It's not a big issue, but I wonder is there is a better way to achieve the same results with Regex.

Thanks in advance for your suggestions.

Alan Moore · Accepted Answer

Try this:


(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)

It starts out by matching , then uses a lookahead to see if the match is between <SOURCE> and </SOURCE>. It does that by looking for a </SOURCE>, but if it finds <SOURCE> first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.

Improve performance of a regexp

Tags:

c#

.net

regex

optimization

Sylverdrag

1 Answers

Alan Moore

Recent Activity

Donate For Us

Improve performance of a regexp

Tags:

c#

.net

regex

optimization

Sylverdrag

1 Answers

Alan Moore

Related questions

Recent Activity

Donate For Us