Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escaping invalid markdown using python regex

I've been trying to write some python to escape 'invalid' markdown strings.

This is for use with a python library (python-telegram-bot) which requires unused markdown characters to be escaped with a \.

My aim is to match lone *,_,` characters, as well as invalid hyperlinks - eg, if no link is provided, and escape them.

An example of what I'm looking for is:

*hello* is fine and should not be changed, whereas hello* would become hello\*. On top of that, if values are nested, they should not be escaped - eg _hello*_ should remain unchanged.

My thought was to match all the doubles first, and then replace any leftover lonely characters. I managed a rough version of this using re.finditer():

 def parser(txt):
   match_md = r'(\*)(.+?)(\*)|(\_)(.+?)(\_)|(`)(.+?)(`)|(\[.+?\])(\(.+?\))|(?P<astx>\*)|(?P<bctck>`)|(?P<undes>_)|(?P<sqbrkt>\[)'
   for e in re.finditer(match_md, txt):
     if e.group('astx') or e.group('bctck') or e.group('undes') or e.group('sqbrkt'):
       txt = txt[:e.start()] + '\\' + txt[e.start():]
   return txt

note: regex was written to match *text*, _text_, `text`, [text](url), and then single *, _, `, [, knowing the last groups

But the issue here, is of course that the offset changes as you insert more characters, so everything shifts away. Surely there's a better way to do this than adding an offset counter?

I tried to use re.sub(), but I haven't been able to find how to replace a specific group, or had any luck with (?:) to 'not match' the valid markdown.

This was my re.sub attempt:

def test(txt):
  match_md = r'(?:(\*)(.+?)(\*))|' \
             '(?:(\_)(.+?)(\_))|' \
             '(?:(`)(.+?)(`))|' \
             '(?:(\[.+?\])(\(.+?\)))|' \
             '(\*)|' \
             '(`)|' \
             '(_)|' \
             '(\[)'
  return re.sub(match_md, "\\\\\g<0>", txt)

This just prefixed every match with a backslash (which was expected, but I'd hoped the ?: would stop them being matched.)

Bonus would be if \'s already in the string were escaped too, so that they wouldn't interfere with the markdown present - this could be a source of error, as the library would see it as escaped, causing it see the rest as invalid.

Thanks in advance!

like image 337
SonOfLars Avatar asked Nov 07 '22 17:11

SonOfLars


1 Answers

You are probably looking for a regular expression like this:

def test(txt):
  match_md = r'((([_*]).+?\3[^_*]*)*)([_*])'
  return re.sub(match_md, "\g<1>\\\\\g<4>", txt)

Note that for clarity I just made up a sample for * and _. You can expand the list in the [] brackets easily. Now let's take a look at this thing.

The idea is to crunch through strings that look like *foo_* or _bar*_ followed by text that doesn't contain any specials. The regex that matches such a string is ([_*]).+?\1[^_*]*: We match an opening delimiter, save it in \1, and go further along the line until we see the same delimiter (now closing). Then we eat anything behind that that doesn't contain any delimiters.

Now we want to do that as long as no more delimited strings remain, that's done with (([_*]).+?\2[^_*]*)*. What's left on the right side now, if anything, is an isolated special, and that's what we need to mask. After the match we have the following sub matches:

  • g<0> : the whole match
  • g<1> : submatch of ((([_*]).+?\3[^_*]*)*)
  • g<2> : submatch of (([_*]).+?\3[^_*]*)
  • g<3> : submatch of ([_*]) (hence the \3 above)
  • g<4> : submatch of ([_*]) (the one to mask)

What's left to you now is to find a way how to treat the invalid hyperlinks, that's another topic.

Update:
Unfortunately this solution masks out valid markdown such as *hello* (=> \*hello\*). The work around to fix this would be to add a special char to the end of line and remove the masked special char once the substitution is done. OP might be looking for a better solution.

like image 171
yacc Avatar answered Nov 14 '22 21:11

yacc