Before I start, I know there are better ways than regex doing this (like tokenizers), that's not what the question is about. I'm already stuck using regex, and it already works as I need to, except one special case, which is what I need advice on.
I need to scan through some JavaScript-like code and insert the new
keyword in front of every object declaration. I already know the names of all objects that will need this keyword, and I know that none of them will have that keyword in the code before I start (so I don't need to deal with repeated new
words or guessing whether something is an object or not. For example, a typical line could look like this:
foo = Bar()
Where I would already know that Bar
is a 'class' and would need 'new' for object declaration. The following regex does the trick:
for classname in allowed_classes:
line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)\b(%s\s*\()' % classname, r'\1new \3', line)
It works like a charm, even making sure not to touch classname
when it's inside a string (The first portion of the regex tells it to make sure there are even number of quotes before-hand - it's a bit naive in that it will break with nested quotes, but I don't need to handle that case). Problem is, class names could also have $
in them. So the following line is allowed as well if $Bar
exists in allowed_classes:
foo = $Bar()
The above regex will ignore it, due to the dollar sign. I figured escaping it would do the trick, but this logic seems to have no effect on the above line even if $Bar
is one of the classes:
for classname in allowed_classes:
line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)\b(%s\s*\()' % re.escape(classname), r'\1new \3', line)
I also tried escaping it by hand using \
but it has no effect either. Can someone explain why converting $
to \$
isn't working and what could fix it?
Thanks
The reason your current regex isn't working is that you have a \b
just before your class name. \b
will match word boundaries, so only between word characters and non-word characters. For the string foo = Bar()
, the \b
will match between the space and the B
, but for foo = $Bar()
, the \b
cannot match between the space and the $
because they are both non-word characters.
To fix this, change \b
to (?=\b|\B\$)
, here is the resulting regex:
for classname in allowed_classes:
line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)(?=\b|\B\$)(%s\s*\()' % classname, r'\1new \3', line)
By using a lookahead, you can handle both of the following cases:
classname
does not start with $
, so we want a word boundary before trying to match classname
, the \b
inside of the lookahead handles thisclassname
does start with $
, so if the next character is a $
we want to match. I used \B\$
so it will only match if the character before the $
is not a word character, but this is probably unnecessary since I can't think of any valid JS code where that would be the caseIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With