Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escaping dollar signs in regex not working

Tags:

python

regex

Before I start, I know there are better ways than regex doing this (like tokenizers), that's not what the question is about. I'm already stuck using regex, and it already works as I need to, except one special case, which is what I need advice on.

I need to scan through some JavaScript-like code and insert the new keyword in front of every object declaration. I already know the names of all objects that will need this keyword, and I know that none of them will have that keyword in the code before I start (so I don't need to deal with repeated new words or guessing whether something is an object or not. For example, a typical line could look like this:

foo = Bar()

Where I would already know that Bar is a 'class' and would need 'new' for object declaration. The following regex does the trick:

for classname in allowed_classes:
    line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)\b(%s\s*\()' % classname, r'\1new \3', line)

It works like a charm, even making sure not to touch classname when it's inside a string (The first portion of the regex tells it to make sure there are even number of quotes before-hand - it's a bit naive in that it will break with nested quotes, but I don't need to handle that case). Problem is, class names could also have $ in them. So the following line is allowed as well if $Bar exists in allowed_classes:

foo = $Bar()

The above regex will ignore it, due to the dollar sign. I figured escaping it would do the trick, but this logic seems to have no effect on the above line even if $Bar is one of the classes:

for classname in allowed_classes:
    line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)\b(%s\s*\()' % re.escape(classname), r'\1new \3', line)

I also tried escaping it by hand using \ but it has no effect either. Can someone explain why converting $ to \$ isn't working and what could fix it?

Thanks

like image 335
Alexander Tsepkov Avatar asked Jan 15 '23 20:01

Alexander Tsepkov


1 Answers

The reason your current regex isn't working is that you have a \b just before your class name. \b will match word boundaries, so only between word characters and non-word characters. For the string foo = Bar(), the \b will match between the space and the B, but for foo = $Bar(), the \b cannot match between the space and the $ because they are both non-word characters.

To fix this, change \b to (?=\b|\B\$), here is the resulting regex:

for classname in allowed_classes:
    line = re.sub(r'^([^\'"]*(?:([\'"])[^\'"]*\2)*[^\'"]*)(?=\b|\B\$)(%s\s*\()' % classname, r'\1new \3', line)

By using a lookahead, you can handle both of the following cases:

  • classname does not start with $, so we want a word boundary before trying to match classname, the \b inside of the lookahead handles this
  • classname does start with $, so if the next character is a $ we want to match. I used \B\$ so it will only match if the character before the $ is not a word character, but this is probably unnecessary since I can't think of any valid JS code where that would be the case
like image 58
Andrew Clark Avatar answered Jan 21 '23 13:01

Andrew Clark