Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching one-line JavaScript comments (//) with re

I'd like to filter out (mostly one-line) comments from (mostly valid) JavaScript using python's re module. For example:

// this is a comment
var x = 2 // and this is a comment too
var url = "http://www.google.com/" // and "this" too
url += 'but // this is not a comment' // however this one is
url += 'this "is not a comment' + " and ' neither is this " // only this

I'm now trying this for more than half an hour without any success. Can anyone please help me?

EDIT 1:

foo = 'http://stackoverflow.com/' // these // are // comments // too //

EDIT 2:

bar = 'http://no.comments.com/'
like image 305
Attila O. Avatar asked Jan 25 '10 23:01

Attila O.


2 Answers

My regex powers had gone a bit stale so I've used your question to fresh what I remember. It became a fairly large regex mostly because I also wanted to filter multi-line comments.

import re

reexpr = r"""
    (                           # Capture code
        "(?:\\.|[^"\\])*"       # String literal
        |
        '(?:\\.|[^'\\])*'       # String literal
        |
        (?:[^/\n"']|/[^/*\n"'])+ # Any code besides newlines or string literals
        |
        \n                      # Newline
    )|
    (/\*  (?:[^*]|\*[^/])*   \*/)        # Multi-line comment
    |
    (?://(.*)$)                 # Comment
    $"""
rx = re.compile(reexpr, re.VERBOSE + re.MULTILINE)

This regex matches with three different subgroups. One for code and two for comment contents. Below is a example of how to extract those.

code = r"""// this is a comment
var x = 2 * 4 // and this is a comment too
var url = "http://www.google.com/" // and "this" too
url += 'but // this is not a comment' // however this one is
url += 'this "is not a comment' + " and ' neither is this " // only this

bar = 'http://no.comments.com/' // these // are // comments
bar = 'text // string \' no // more //\\' // comments
bar = 'http://no.comments.com/'
bar = /var/ // comment

/* comment 1 */
bar = open() /* comment 2 */
bar = open() /* comment 2b */// another comment
bar = open( /* comment 3 */ file) // another comment 
"""

parts = rx.findall(code)
print '*' * 80, '\nCode:\n\n', '\n'.join([x[0] for x in parts if x[0].strip()])
print '*' * 80, '\nMulti line comments:\n\n', '\n'.join([x[1] for x in parts if x[1].strip()])
print '*' * 80, '\nOne line comments:\n\n', '\n'.join([x[2] for x in parts if x[2].strip()])
like image 87
driax Avatar answered Oct 08 '22 10:10

driax


It might be easier to parse if you had explicit semi-colons.

In any case, this works:

import re

rx = re.compile(r'.*(//(.*))$')

lines = ["// this is a comment", 
    "var x = 2 // and this is a comment too",
    """var url = "http://www.google.com/" // and "this" too""",
    """url += 'but // this is not a comment' // however this one is""",
    """url += 'this "is not a comment' + " and ' neither is this " // only this""",]

for line in lines: 
    print rx.match(line).groups()

Output of the above:

('// this is a comment', ' this is a comment')
('// and this is a comment too', ' and this is a comment too')
('// and "this" too', ' and "this" too')
('// however this one is', ' however this one is')
('// only this', ' only this')

I'm not sure what you're doing with the javascript after removing the comments, but JSMin might help. It removes comments well enough anyway, and there is an implementation in python.

like image 39
Seth Avatar answered Oct 08 '22 08:10

Seth