Python regex question: stripping multi-line comments but maintaining a line break

Question

I'm parsing a source code file, and I want to remove all line comments (i.e. starting with "//") and multi-line comments (i.e. /..../). However, if the multi-line comment has at least one line-break in it ( ), I want the output to have exactly one line break instead.

For example, the code:

qwe /* 123
456 
789 */ asd

should turn exactly into:

qwe
asd

and not "qweasd" or:

qwe

asd

What would be the best way to do so? Thanks

EDIT: Example code for testing:

comments_test = "hello // comment
"+\
                "line 2 /* a comment */
"+\
                "line 3 /* a comment*/ /*comment*/
"+\
                "line 4 /* a comment
"+\
                "continuation of a comment*/ line 5
"+\
                "/* comment */line 6
"+\
                "line 7 /*********
"+\
                "********************
"+\
                "**************/
"+\
                "line ?? /*********
"+\
                "********************
"+\
                "********************
"+\
                "********************
"+\
                "********************
"+\
                "**************/
"+\
                "line ??"

Expected results:

hello 
line 2 
line 3  
line 4
line 5
line 6
line 7
line ??
line ??

Markus Jarderot · Accepted Answer

comment_re = re.compile(
    r'(^)?[^\S
]*/(?:\*(.*?)\*/[^\S
]*|/[^
]*)($)?',
    re.DOTALL | re.MULTILINE
)

def comment_replacer(match):
    start,mid,end = match.group(1,2,3)
    if mid is None:
        # single line comment
        return ''
    elif start is not None or end is not None:
        # multi line comment at start or end of a line
        return ''
    elif '
' in mid:
        # multi line comment with line break
        return '
'
    else:
        # multi line comment without line break
        return ' '

def remove_comments(text):
    return comment_re.sub(comment_replacer, text)

(^)? will match if the comment starts at the beginning of a line, as long as the MULTILINE-flag is used.
[^\S ] will match any whitespace character except newline. We don't want to match line breaks if the comment starts on it's own line.
/\*(.*?)\*/ will match a multi-line comment and capture the content. Lazy matching, so we don't match two or more comments. DOTALL-flag makes . match newlines.
//[^ ] will match a single-line comment. Can't use . because of the DOTALL-flag.
($)? will match if the comment stops at the end of a line, as long as the MULTILINE-flag is used.

Examples:

>>> s = ("qwe /* 123
"
         "456
"
         "789 */ asd /* 123 */ zxc
"
         "rty // fgh
")
>>> print '"' + '"
"'.join(
...     remove_comments(s).splitlines()
... ) + '"'
"qwe"
"asd zxc"
"rty"
>>> comments_test = ("hello // comment
"
...                  "line 2 /* a comment */
"
...                  "line 3 /* a comment*/ /*comment*/
"
...                  "line 4 /* a comment
"
...                  "continuation of a comment*/ line 5
"
...                  "/* comment */line 6
"
...                  "line 7 /*********
"
...                  "********************
"
...                  "**************/
"
...                  "line ?? /*********
"
...                  "********************
"
...                  "********************
"
...                  "********************
"
...                  "********************
"
...                  "**************/
")
>>> print '"' + '"
"'.join(
...     remove_comments(comments_test).splitlines()
... ) + '"'
"hello"
"line 2"
"line 3 "
"line 4"
"line 5"
"line 6"
"line 7"
"line ??"
"line ??"

Edits:

Updated to new specification.
Added another example.

Python regex question: stripping multi-line comments but maintaining a line break

Tags:

python

comments

regex

parsing

Roee Adler

1 Answers

Markus Jarderot

Recent Activity

Donate For Us

Python regex question: stripping multi-line comments but maintaining a line break

Tags:

python

comments

regex

parsing

Roee Adler

1 Answers

Markus Jarderot

Related questions

Recent Activity

Donate For Us