I want to parse a LaTeX document and mark some of its terms with a special command. Specifically, I have a list of terms, say:
Astah
UML
use case
...
and I want to mark the first occurrence of Astah in the text with this custom command: \gloss{Astah}
. So far, this works (using Python):
for g in glossary:
pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M)
text = pattern.sub(start + r'\1' + end, text, 1)
and it works fine.
But then I found out that:
%
)\section{term}
or \paragraph{term}
)So I tried this:
for g in glossary:
pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M)
text = pattern.sub(r'\1' + start + r'\2' + end, text, 1)
but it matches terms inside comments which are preceded by other characters and it also matches terms inside titles.
Is it something about the "greediness" of regexes that I don't understand? or maybe the problem is somewhere else?
As an example, if I have this text:
\section{Astah}
Astah is a UML diagramming tool... bla bla...
% use case:
A use case is a...
I would like to transform it into:
\section{Astah}
\gloss{Astah} is a \gloss{UML} diagramming tool... bla bla...
% use case:
A \gloss{use case} is a...
Similarly, the negation variant of the character class is defined as "[^ ]" (with ^ within the square braces), it matches a single character which is not in the specified or set of possible characters. For example the regular expression [^abc] matches a single character except a or, b or, c.
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u. Positive lookahead works just the same. q(?= u) matches a q that is followed by a u, without making the u part of the match.
- the minus sign indicates a range in a character class (when it is not at the first position after the "[" opening bracket or the last position before the "]" closing bracket. Example: "[A-Z]" matches any uppercase character. Example: "[A-Z-]" or "[-A-Z]" match any uppercase character or "-".
The regular-expression operator + has the lowest precedence and is left associative.
The trick here is to use a regex that starts matching at the start of the line, because that allows us to check if the word we're trying to match is preceded by a comment:
^([^%\n]*?)(?<!\\section{)(?<!\\paragraph{)\b(Astah)\b
Requires multi-line flag m
. Occurences of this regex are to be replaced with \1\\gloss{\2}
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With