Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Negative regular expression before specific term

I want to parse a LaTeX document and mark some of its terms with a special command. Specifically, I have a list of terms, say:

Astah
UML
use case
...

and I want to mark the first occurrence of Astah in the text with this custom command: \gloss{Astah}. So far, this works (using Python):

for g in glossary:
    pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M)
    text = pattern.sub(start + r'\1' + end, text, 1)

and it works fine.

But then I found out that:

  • I don't want to match terms following a LaTeX inline comment (so terms preceded by one or more %)
  • and I don't want to match terms inside a section title (that is, \section{term} or \paragraph{term})

So I tried this:

for g in glossary:
    pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M)
    text = pattern.sub(r'\1' + start + r'\2' + end, text, 1)

but it matches terms inside comments which are preceded by other characters and it also matches terms inside titles.

Is it something about the "greediness" of regexes that I don't understand? or maybe the problem is somewhere else?

As an example, if I have this text:

\section{Astah}
Astah is a UML diagramming tool... bla bla...
% use case:
A use case is a...

I would like to transform it into:

\section{Astah}
\gloss{Astah} is a \gloss{UML} diagramming tool... bla bla...
% use case:
A \gloss{use case} is a...
like image 553
Giorgio Avatar asked Mar 04 '17 14:03

Giorgio


People also ask

How do you write a negation in regex?

Similarly, the negation variant of the character class is defined as "[^ ]" (with ^ within the square braces), it matches a single character which is not in the specified or set of possible characters. For example the regular expression [^abc] matches a single character except a or, b or, c.

What is a negative lookahead regex?

The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u. Positive lookahead works just the same. q(?= u) matches a q that is followed by a u, without making the u part of the match.

What does minus mean in regex?

- the minus sign indicates a range in a character class (when it is not at the first position after the "[" opening bracket or the last position before the "]" closing bracket. Example: "[A-Z]" matches any uppercase character. Example: "[A-Z-]" or "[-A-Z]" match any uppercase character or "-".

Which operator has the lowest precedence in regular expression?

The regular-expression operator + has the lowest precedence and is left associative.


1 Answers

The trick here is to use a regex that starts matching at the start of the line, because that allows us to check if the word we're trying to match is preceded by a comment:

^([^%\n]*?)(?<!\\section{)(?<!\\paragraph{)\b(Astah)\b

Requires multi-line flag m. Occurences of this regex are to be replaced with \1\\gloss{\2}.

like image 61
Aran-Fey Avatar answered Oct 05 '22 13:10

Aran-Fey