Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.

The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.

The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.

Can anyone help with this regex?

EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!

like image 233
Xudonax Avatar asked Oct 22 '25 04:10

Xudonax


2 Answers

The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:

((?<!\)[#\$%\^&_\{\}~\\])

(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs

like image 53
SethMMorton Avatar answered Oct 23 '25 19:10

SethMMorton


The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.

Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].

I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].

like image 34
FastTurtle Avatar answered Oct 23 '25 18:10

FastTurtle



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!