Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex match anything up to word - without non-greedy operators

Tags:

c#

regex

I want to match anything up to a specific word (e.g., the closing comment in C */), however, due to performance reasons I don't want to make use of non-greedy operators.

E.g., to match C comments: /\*.*?\*/ is too slow for my files. Is there any possibility to improve performance?

like image 871
D.R. Avatar asked Nov 24 '15 14:11

D.R.


2 Answers

Sure, use unrolling-the-loop technique:

/\*[^*]*(?:\*(?!/)[^*]*)*\*/

See regex demo

The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

Which could means something like, match the normal case, if you find a special case, matched it than match the normal case again. You notice that part of this syntax could potentialy lead to a super-linear match. To avoid a neverending match to append, the following rules shoud be carefully applied:

  • the start of the special case and the normal case must be mutually exclusive
  • special must always match at least one character
  • the special expression must be atomic: be careful to the fact that ( special normal* )* could be reduced to (special)*, which if special is special*, this became similar to (a*)* which is a undeterministic expression.

C# pattern declaration (using a verbatim string literal):

var pattern = @"/\*[^*]*(?:\*(?!/)[^*]*)*\*/";

The regex breakdown:

  • /\* - literal /*
  • [^*]* - 0 or more character other than *
  • (?:\*(?!/)[^*]*)* - 0 or more sequences of...
    • \*(?!/) - a literal * not followed by /
    • [^*]* - 0 or more character other than *
  • \*/ - literal */

Here is a graph showing how efficient the 3 potentially identical regexps are (tested at regexhero.net*):

enter image description here

* Tested against /* Comment * Typical * Comment */

like image 194
Wiktor Stribiżew Avatar answered Oct 15 '22 15:10

Wiktor Stribiżew


Try this:

/\*(?:[^*]|\*(?!/))*\*/

I don't know whether it is faster than stribizhev's answer.

like image 42
Sebastian Schumann Avatar answered Oct 15 '22 13:10

Sebastian Schumann