Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I replace LaTeX $...$ and $$...$$ notations by something like <div>$...$</div>?

Tags:

python

regex

I currently have the problem that Jekyll does not work well with Markdown and LaTeX. So I have a lot of articles with $\frac{some}{latex}$ or $$\int^e_v {en} more$$.

How can I replace $...$ by <span>$...$</span> and $$...$$ by <div>$$..$$</div>?

Things that make this task difficult are:

  • The ... might include newlines. In fact, ... might contain anything, except $
  • The first $ gets replaced by <span>$, but the second one by $</span>
  • $...$ and $$...$$ can be used in the same document (but always seperated by at least one whitespace)

edit: I've just seen that I also need some escaping. So the task has one more difficulty:

  • \$ should not be matched for any of the two cases above.
like image 725
Martin Thoma Avatar asked Nov 01 '25 05:11

Martin Thoma


2 Answers

I know you asked for regex, but you'll run into headaches for those edges cases you mentioned you'll handle by hand. (If there are other regex solutions posted, compare this answer with theirs). With this it's simple to change the behavior of double and single TeX markers and handle escapes in the TeX code. Here is a very simple pyparsing example that does what you are looking for:

from pyparsing import *

D1 = QuotedString("$",escChar='\\')
D2 = QuotedString("$$",escChar='\\')

div_action = lambda x: "<div>$%s$</div>"%x[0]
span_action = lambda x: "<span>$$%s$$</span>"%x[0]
D1.setParseAction(span_action)
D2.setParseAction(div_action)
others  = Word(printables)
grammar = OneOrMore(D2 | D1 | others).leaveWhitespace()

And a use case:

S = "$\LaTeX$ is worth $$x=\$3.40$$"
print grammar.transformString(S)

Giving:

<span>$\LaTeX$</span> is worth <div>$$x=$3.40$$</div>
like image 132
Hooked Avatar answered Nov 03 '25 21:11

Hooked


We can accomplish this by doing two steps replacement:

import re
str = "$rac{some}{latex}$$$\int^e_v {en} more$$\$rac{some}{latex}$$$\int^e_v {en} more$$\n$rac{some}{latex}$\n$$\int^e_v {en} more$$\n\$rac{some}{latex}$\n$$\int^e_v {en} more$$"

#first step:
str = re.sub(r'(?<![\\])\$\$([^\$]+)\$\$', "<div>$$\g<1>$$</div>", str)
#second step:
str = re.sub(r'(?<![\$\\])\$([^\$]+)(?:(?<!\<div\>)(?<!\\)\$)', "<span>$\g<1>$</span>", str)
print str

Explanation:

First step:

We perform a replace only in the $$ occurrences, replacing it by <div>$$\g<1>$$</div>(\g<1> will be replaced by the first group defined in the regex).

str = re.sub(r'(?<![\\])\$\$([^\$]+)\$\$', "<div>$$\g<1>$$</div>", str)

Realize that we are using this regex (?<![\\])\$\$([^\$]+)\$\$ regex101 example which works in the following way:

  1. (?<![\\]) ... Defines that we are matching something ... which is not preceded by a \ [in the regex: (?<![\\])]. So firstly we said we do not want a \ before the expression.
  2. ... \$\$ ... Defines that we have to have a $$ occurrence in the beginning of the string.
  3. ... ([^\$]+) Defines that we want everything but a $ after the previous step [in the regex [^\$]+]. And then we put it into a capture group (...), for after refer to it in the code.
  4. ... \$\$ After all we finish the expression saying that we have to have a $$ occurrence in the final of the string.

Second step:

We perform a replace only in the $ occurrences, replacing it by <span>$\g<1>$</span>(again, the \g<1> will be replaced by the first group match defined in the regex)

str = re.sub(r'(?<![\$\\])\$([^\$]+)(?:(?<!\<div\>)(?<!\\)\$)', "<span>$\g<1>$</span>", str)

Realize also that we are using this other regex (?<![\$\\])\$([^\$]+)(?:(?<!\<div\>)(?<!\\)\$) (yeah, little bit harder) regex101 example which works in the following way:

  1. (?<![\$\\]) ... Defines that we are matching something ... which is not preceded by a \ or a $ [in the regex: (?<![\\\$])]. So firstly we said we do not want a \ or a $ in the beginning.
  2. ... \$ ... Defines that our string needs to start with one $
  3. ... ([^\$]+) ... Defines a capture group with everything but $, for future call back.
  4. ... (?:(?<!\<div\>)(?<!\\)\$) We finish saying that our string finish with a $ but not preceded by a div [in the regex: ?<!\<div\>)] or a \ [in the regex: (?<!\\)]. (then we put it all into a non-capture group to say that all of this is only one thing (?:(?<!\<div\>)(?<!\\)\$))

Note: perhaps there are more efficient ways to get this result.

like image 27
Caio Oliveira Avatar answered Nov 03 '25 21:11

Caio Oliveira



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!