I am using python to go through a file and remove any comments. A comment is defined as a hash and anything to the right of it as long as the hash isn't inside double quotes. I currently have a solution, but it seems sub-optimal:
filelines = []
r = re.compile('(".*?")')
for line in f:
m = r.split(line)
nline = ''
for token in m:
if token.find('#') != -1 and token[0] != '"':
nline += token[:token.find('#')]
break
else:
nline += token
filelines.append(nline)
Is there a way to find the first hash not within quotes without for loops (i.e. through regular expressions?)
Examples:
' "Phone #":"555-1234" ' -> ' "Phone #":"555-1234" '
' "Phone "#:"555-1234" ' -> ' "Phone "'
'#"Phone #":"555-1234" ' -> ''
' "Phone #":"555-1234" #Comment' -> ' "Phone #":"555-1234" '
Edit: Here is a pure regex solution created by user2357112. I tested it, and it works great:
filelines = []
r = re.compile('(?:"[^"]*"|[^"#])*(#)')
for line in f:
m = r.match(line)
if m != None:
filelines.append(line[:m.start(1)])
else:
filelines.append(line)
See his reply for more details on how this regex works.
Edit2: Here's a version of user2357112's code that I modified to account for escape characters (\"). This code also eliminates the 'if' by including a check for end of string ($):
filelines = []
r = re.compile(r'(?:"(?:[^"\\]|\\.)*"|[^"#])*(#|$)')
for line in f:
m = r.match(line)
filelines.append(line[:m.start(1)])
r'''(?: # Non-capturing group
"[^"]*" # A quote, followed by not-quotes, followed by a quote
| # or
[^"#] # not a quote or a hash
) # end group
* # Match quoted strings and not-quote-not-hash characters until...
(#) # the comment begins!
'''
This is a verbose regex, designed to operate on a single line, so make sure to use the re.VERBOSE
flag and feed it one line at a time. It'll capture the first unquoted hash as group 1 if there is one, so you can use match.start(1)
to get the index. It doesn't handle backslash escapes, if you want to be able to put a backslash-escaped quote in a string. This is untested.
You can remove comments using this script:
import re
print re.sub(r'(?s)("[^"\\]*(?:\\.[^"\\]*)*")|#[^\n]*', lambda m: m.group(1) or '', '"Phone #"#:"555-1234"')
The idea is to capture first parts enclosed in double-quotes and to replace them by themself before searching a sharp:
(?s) # the dot matches newlines too
( # open the capture group 1
" # "
[^"\\]* # all characters except a quote or a backslash
# zero or more times
(?: # open a non-capturing group
\\. # a backslash and any character
[^"\\]* #
)* # repeat zero or more times
" # "
) # close the capture group 1
| # OR
#[^\n]* # a sharp and zero or one characters that are not a newline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With