I want to parse a LaTeX document and mark some of its terms with a special command. Specifically, I have a list of terms, say: <pre class="prettyprint"><code>Astah UML use case ... </code></pre> and I want to mark the first occurrence of Astah in the text with this custom command: <code>\gloss{Astah}</code>. So far, this works (using Python): <pre class="prettyprint"><code>for g in glossary: pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M) text = pattern.sub(start + r'\1' + end, text, 1) </code></pre> and it works fine. But then I found out that: <ul> <li>I don't want to match terms following a LaTeX inline comment (so terms preceded by one or more <code>%</code>)</li> <li>and I don't want to match terms inside a section title (that is, <code>\section{term}</code> or <code>\paragraph{term}</code>)</li> </ul> So I tried this: <pre class="prettyprint"><code>for g in glossary: pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M) text = pattern.sub(r'\1' + start + r'\2' + end, text, 1) </code></pre> but it matches terms inside comments which are preceded by other characters and it also matches terms inside titles. Is it something about the "greediness" of regexes that I don't understand? or maybe the problem is somewhere else? As an example, if I have this text: <pre class="prettyprint"><code>\section{Astah} Astah is a UML diagramming tool... bla bla... % use case: A use case is a... </code></pre> I would like to transform it into: <pre class="prettyprint"><code>\section{Astah} \gloss{Astah} is a \gloss{UML} diagramming tool... bla bla... % use case: A \gloss{use case} is a... </code></pre>

The trick here is to use a regex that starts matching at the start of the line, because that allows us to check if the word we're trying to match is preceded by a comment: <pre class="prettyprint"><code>^([^%\n]*?)(?<!\\section{)(?<!\\paragraph{)\b(Astah)\b </code></pre> Requires multi-line flag <code>m</code>. Occurences of this regex are to be replaced with <code>\1\\gloss{\2}</code>.

Negative regular expression before specific term

Tags:

python

regex

latex

I want to parse a LaTeX document and mark some of its terms with a special command. Specifically, I have a list of terms, say:

Astah
UML
use case
...

and I want to mark the first occurrence of Astah in the text with this custom command: \gloss{Astah}. So far, this works (using Python):

for g in glossary:
    pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M)
    text = pattern.sub(start + r'\1' + end, text, 1)

and it works fine.

But then I found out that:

I don't want to match terms following a LaTeX inline comment (so terms preceded by one or more %)
and I don't want to match terms inside a section title (that is, \section{term} or \paragraph{term})

So I tried this:

for g in glossary:
    pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M)
    text = pattern.sub(r'\1' + start + r'\2' + end, text, 1)

but it matches terms inside comments which are preceded by other characters and it also matches terms inside titles.

Is it something about the "greediness" of regexes that I don't understand? or maybe the problem is somewhere else?

As an example, if I have this text:

\section{Astah}
Astah is a UML diagramming tool... bla bla...
% use case:
A use case is a...

I would like to transform it into:

\section{Astah}
\gloss{Astah} is a \gloss{UML} diagramming tool... bla bla...
% use case:
A \gloss{use case} is a...

553

asked Mar 04 '17 14:03

Giorgio

1 Answers

The trick here is to use a regex that starts matching at the start of the line, because that allows us to check if the word we're trying to match is preceded by a comment:

^([^%\n]*?)(?<!\\section{)(?<!\\paragraph{)\b(Astah)\b

Requires multi-line flag m. Occurences of this regex are to be replaced with \1\\gloss{\2}.

answered Oct 05 '22 13:10

Aran-Fey

Related questions
                            
                                What is Python's *Args and **kwargs equivalent in PHP? [duplicate]
                            
                                Using Single Celery Server For Multi Django Projects
                            
                                django+uwsgi logging with TimedRotatingFileHandler "overwrites rotated log file"
                            
                                How to determine feature importance of non linear kernals in SVM
                            
                                Python 3 - Gaussian divisors of a Gaussian integer
                            
                                Hacky way to augment multichannel images in Keras
                            
                                Why Pandas .loc speed in Pandas depends on DataFrame initialization? How to make MultiIndex .loc as fast as possible?
                            
                                keras: issue using ImageDataGenerator and KFold for fit_generator
                            
                                Pruning dir-filename in coverage.py html report
                            
                                (Re-)creating "numpy.sum" with numba (supporting "axis" along which to reduce)
                            
                                GridSearchCV does not give the same results as expected when compared to xgboost.cv
                            
                                How to write a dataframe in pyspark having null values to CSV
                            
                                Export coordinate system as ESPG code: to_epsg() or ExportToEPSG()
                            
                                Keyboard Interrupt interactive Python in Visual Studio IDE
                            
                                pandas extractall() is not extracting all cases given a regex?
                            
                                How do you create object with modelviewset and POST request?
                            
                                Pandas: most efficient way to apply complex function over entire data frame
                            
                                How to quickly find sum of all pairs of elements in 2 different arrays
                            
                                Scheduling algorithm, finding all non overlapping intervals of set length
                            
                                Django Rest Framework: Register multiple serializers in ViewSet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With