Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correctly parsing comment in regex

Tags:

c++

c

regex

I am creating a compiler and am having trouble handling the commenting when dealing with multi-line comments (/* */). The issue is that my regex needs fixing. What I believe it does is that it looks for an opening comment token (/*) but will accept any closing comment token (*/) which might not even be part of the comment scope.

Also another issue is that within a string, it will still try to comment it out. This issue I haven't yet implemented but some help would be appreciated.

The regex I am using is:

[/][*](.|\n)*[*][/]

Examples:

input:

int main(/* text */) {
   int i = 0;
   /* hello world */
   return 1;
} 

output:

int main(

   return 1;
} 

And then for the strings, an input would be:

 int main() {
       printf("/* hi there */\n");
       return 1;
    } 

output:

int main() {
      printf("\n");
       return 1;
} 
like image 436
QQMusic Avatar asked Dec 03 '25 15:12

QQMusic


1 Answers

I'm not sure what regex library you're using, but you need what's called a non-greedy match.

Try this:

\/\*(.|\n)*?\*\/

The ? after .* makes the match ungreedy.

You can visualize this working here.

Note that this is Perl-Compatible Regular Expression (PCRE) syntax, which I am assuming you're using. If you're using POSIX Regular Expressions, this won't work.

You also don't need to put / and * inside a character class ([...]); you just need to escape them.

You also can use the PCRE_DOTALL flag to make . match \n or \r as well, which can simplify your regex.

PCRE_DOTALL
   If  this bit is set, a dot metacharacter in the pattern matches a char-
   acter of any value, including one that indicates a newline. However, it
   only  ever  matches  one character, even if newlines are coded as CRLF.
   Without this option, a dot does not match when the current position  is
   at a newline. This option is equivalent to Perl's /s option, and it can
   be changed within a pattern by a (?s) option setting. A negative  class
   such as [^a] always matches newline characters, independent of the set-
   ting of this option.

Then, our regex would be:

\/\*.*?\*\/

You can also make the entire regex ungreedy by using the PCRE_UNGREEDY flag:

PCRE_UNGREEDY

   This option inverts the "greediness" of the quantifiers  so  that  they
   are  not greedy by default, but become greedy if followed by "?". It is
   not compatible with Perl. It can also be set by a (?U)  option  setting
   within the pattern.

In this case, this will work:

\/\*.*\*\/
like image 67
Will Avatar answered Dec 06 '25 05:12

Will



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!