I'm parsing a source code file, and I want to remove all line comments (i.e. starting with "//") and multi-line comments (i.e. /..../). However, if the multi-line comment has at least one line-break in it (\n), I want the output to have exactly one line break instead.
For example, the code:
qwe /* 123
456
789 */ asd
should turn exactly into:
qwe
asd
and not "qweasd" or:
qwe
asd
What would be the best way to do so? Thanks
EDIT: Example code for testing:
comments_test = "hello // comment\n"+\
"line 2 /* a comment */\n"+\
"line 3 /* a comment*/ /*comment*/\n"+\
"line 4 /* a comment\n"+\
"continuation of a comment*/ line 5\n"+\
"/* comment */line 6\n"+\
"line 7 /*********\n"+\
"********************\n"+\
"**************/\n"+\
"line ?? /*********\n"+\
"********************\n"+\
"********************\n"+\
"********************\n"+\
"********************\n"+\
"**************/\n"+\
"line ??"
Expected results:
hello
line 2
line 3
line 4
line 5
line 6
line 7
line ??
line ??
comment_re = re.compile(
r'(^)?[^\S\n]*/(?:\*(.*?)\*/[^\S\n]*|/[^\n]*)($)?',
re.DOTALL | re.MULTILINE
)
def comment_replacer(match):
start,mid,end = match.group(1,2,3)
if mid is None:
# single line comment
return ''
elif start is not None or end is not None:
# multi line comment at start or end of a line
return ''
elif '\n' in mid:
# multi line comment with line break
return '\n'
else:
# multi line comment without line break
return ' '
def remove_comments(text):
return comment_re.sub(comment_replacer, text)
(^)?
will match if the comment starts at the beginning of a line, as long as the MULTILINE
-flag is used.[^\S\n]
will match any whitespace character except newline. We don't want to match line breaks if the comment starts on it's own line./\*(.*?)\*/
will match a multi-line comment and capture the content. Lazy matching, so we don't match two or more comments. DOTALL
-flag makes .
match newlines.//[^\n]
will match a single-line comment. Can't use .
because of the DOTALL
-flag.($)?
will match if the comment stops at the end of a line, as long as the MULTILINE
-flag is used.Examples:
>>> s = ("qwe /* 123\n"
"456\n"
"789 */ asd /* 123 */ zxc\n"
"rty // fgh\n")
>>> print '"' + '"\n"'.join(
... remove_comments(s).splitlines()
... ) + '"'
"qwe"
"asd zxc"
"rty"
>>> comments_test = ("hello // comment\n"
... "line 2 /* a comment */\n"
... "line 3 /* a comment*/ /*comment*/\n"
... "line 4 /* a comment\n"
... "continuation of a comment*/ line 5\n"
... "/* comment */line 6\n"
... "line 7 /*********\n"
... "********************\n"
... "**************/\n"
... "line ?? /*********\n"
... "********************\n"
... "********************\n"
... "********************\n"
... "********************\n"
... "**************/\n")
>>> print '"' + '"\n"'.join(
... remove_comments(comments_test).splitlines()
... ) + '"'
"hello"
"line 2"
"line 3 "
"line 4"
"line 5"
"line 6"
"line 7"
"line ??"
"line ??"
Edits:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With