Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - regular expression finding comments in code

A little fun with Java this time. I want to write a program that reads a code from standard input (line by line, for example), like:

// some comment
class Main {
    /* blah */
    // /* foo
    foo();
    // foo */
    foo2();
    /* // foo2 */
}

finds all comments in it and removes them. I'm trying to use regular expressions, and for now I've done something like this:

private static String ParseCode(String pCode)
{
    String MyCommentsRegex = "(?://.*)|(/\\*(?:.|[\\n\\r])*?\\*/)";
    return pCode.replaceAll(MyCommentsRegex, " ");
}

but it seems not to work for all the cases, e.g.:

System.out.print("We can use /* comments */ inside a string of course, but it shouldn't start a comment");

Any advice or ideas different from regex? Thanks in advance.

like image 693
brovar Avatar asked Nov 01 '09 12:11

brovar


People also ask

Which regex is used to comment?

comment ) construct lets you include an inline comment in a regular expression. The regular expression engine does not use any part of the comment in pattern matching, although the comment is included in the string that is returned by the Regex. ToString method. The comment ends at the first closing parenthesis.

What does \\ mean in Java regex?

The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.


1 Answers

You may have already given up on this by now but I was intrigued by the problem.

I believe this is a partial solution...

Native regex:

//.*|("(?:\\[^"]|\\"|.)*?")|(?s)/\*.*?\*/

In Java:

String clean = original.replaceAll( "//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/", "$1 " );

This appears to properly handle comments embedded in strings as well as properly escaped quotes inside strings. I threw a few things at it to check but not exhaustively.

There is one compromise in that all "" blocks in the code will end up with space after them. Keeping this simple and solving that problem would be very difficult given the need to cleanly handle:

int/* some comment */foo = 5;

A simple Matcher.find/appendReplacement loop could conditionally check for group(1) before replacing with a space and would only be a handful of lines of code. Still simpler than a full up parser maybe. (I could add the matcher loop too if anyone is interested.)

like image 83
PSpeed Avatar answered Oct 14 '22 11:10

PSpeed