Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex question: stripping multi-line comments but maintaining a line break

I'm parsing a source code file, and I want to remove all line comments (i.e. starting with "//") and multi-line comments (i.e. /..../). However, if the multi-line comment has at least one line-break in it (\n), I want the output to have exactly one line break instead.

For example, the code:

qwe /* 123
456 
789 */ asd

should turn exactly into:

qwe
asd

and not "qweasd" or:

qwe

asd

What would be the best way to do so? Thanks


EDIT: Example code for testing:

comments_test = "hello // comment\n"+\
                "line 2 /* a comment */\n"+\
                "line 3 /* a comment*/ /*comment*/\n"+\
                "line 4 /* a comment\n"+\
                "continuation of a comment*/ line 5\n"+\
                "/* comment */line 6\n"+\
                "line 7 /*********\n"+\
                "********************\n"+\
                "**************/\n"+\
                "line ?? /*********\n"+\
                "********************\n"+\
                "********************\n"+\
                "********************\n"+\
                "********************\n"+\
                "**************/\n"+\
                "line ??"

Expected results:

hello 
line 2 
line 3  
line 4
line 5
line 6
line 7
line ??
line ??
like image 363
Roee Adler Avatar asked Dec 03 '22 08:12

Roee Adler


1 Answers

comment_re = re.compile(
    r'(^)?[^\S\n]*/(?:\*(.*?)\*/[^\S\n]*|/[^\n]*)($)?',
    re.DOTALL | re.MULTILINE
)

def comment_replacer(match):
    start,mid,end = match.group(1,2,3)
    if mid is None:
        # single line comment
        return ''
    elif start is not None or end is not None:
        # multi line comment at start or end of a line
        return ''
    elif '\n' in mid:
        # multi line comment with line break
        return '\n'
    else:
        # multi line comment without line break
        return ' '

def remove_comments(text):
    return comment_re.sub(comment_replacer, text)
  • (^)? will match if the comment starts at the beginning of a line, as long as the MULTILINE-flag is used.
  • [^\S\n] will match any whitespace character except newline. We don't want to match line breaks if the comment starts on it's own line.
  • /\*(.*?)\*/ will match a multi-line comment and capture the content. Lazy matching, so we don't match two or more comments. DOTALL-flag makes . match newlines.
  • //[^\n] will match a single-line comment. Can't use . because of the DOTALL-flag.
  • ($)? will match if the comment stops at the end of a line, as long as the MULTILINE-flag is used.

Examples:

>>> s = ("qwe /* 123\n"
         "456\n"
         "789 */ asd /* 123 */ zxc\n"
         "rty // fgh\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(s).splitlines()
... ) + '"'
"qwe"
"asd zxc"
"rty"
>>> comments_test = ("hello // comment\n"
...                  "line 2 /* a comment */\n"
...                  "line 3 /* a comment*/ /*comment*/\n"
...                  "line 4 /* a comment\n"
...                  "continuation of a comment*/ line 5\n"
...                  "/* comment */line 6\n"
...                  "line 7 /*********\n"
...                  "********************\n"
...                  "**************/\n"
...                  "line ?? /*********\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "**************/\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(comments_test).splitlines()
... ) + '"'
"hello"
"line 2"
"line 3 "
"line 4"
"line 5"
"line 6"
"line 7"
"line ??"
"line ??"

Edits:

  • Updated to new specification.
  • Added another example.
like image 193
Markus Jarderot Avatar answered Dec 20 '22 06:12

Markus Jarderot