Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex to remove comments from source files

I'm making a program to automate the writing of some C code, (I'm writing to parse strings into enumerations with the same name) C's handling of strings is not that great. So some people have been nagging me to try python.

I made a function that is supposed to remove C-style /* COMMENT */ and //COMMENT from a string: Here is the code:

def removeComments(string):     re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string     re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string 

So I tried this code out.

str="/* spam * spam */ eggs" removeComments(str) print str 

And it apparently did nothing.

Any suggestions as to what I've done wrong?

There's a saying I've heard a couple of times:

If you have a problem and you try to solve it with Regex you end up with two problems.


EDIT: Looking back at this years later. (after a fair bit more parsing experience)

I think regex may have been the right solution. And the simple regex used here "good enough". I may not have emphasized this enough in the question. This was for a single specific file. That had no tricky situations. I think it would be a lot less maintenance to keep the file being parsed simple enough for the regex, than to complicate the regex, into an unreadable symbol soup. (e.g. require that the file only use // single line comments.)

like image 743
Lyndon White Avatar asked Feb 23 '10 14:02

Lyndon White


1 Answers

What about "//comment-like strings inside quotes"?

OP is asking how to do do it using regular expressions; so:

def remove_comments(string):     pattern = r"(\".*?\"|\'.*?\')|(/\*.*?\*/|//[^\r\n]*$)"     # first group captures quoted strings (double or single)     # second group captures comments (//single-line or /* multi-line */)     regex = re.compile(pattern, re.MULTILINE|re.DOTALL)     def _replacer(match):         # if the 2nd group (capturing comments) is not None,         # it means we have captured a non-quoted (real) comment string.         if match.group(2) is not None:             return "" # so we will return empty to remove the comment         else: # otherwise, we will return the 1st group             return match.group(1) # captured quoted-string     return regex.sub(_replacer, string) 

This WILL remove:

  • /* multi-line comments */
  • // single-line comments

Will NOT remove:

  • String var1 = "this is /* not a comment. */";
  • char *var2 = "this is // not a comment, either.";
  • url = 'http://not.comment.com';

Note: This will also work for Javascript source.

like image 167
Onur Yıldırım Avatar answered Sep 23 '22 10:09

Onur Yıldırım