Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regular expression only matches once

Tags:

python

regex

I'm trying to create a simple markdown to latex converter, just to learn python and basic regex, but I'm stuck trying to figure out why the below code doesn't work:

re.sub (r'\[\*\](.*?)\[\*\]: ?(.*?)$',  r'\\footnote{\2}\1', s, flags=re.MULTILINE|re.DOTALL)

I want to convert something like:

s = """This is a note[*] and this is another[*]
[*]: some text
[*]: other text"""

to:

This is a note\footnote{some text} and this is another\footnote{other text}

this is what I got (from using my regex above):

This is a note\footnote{some text} and this is another[*]

[*]: note 2

Why is the pattern only been matched once?

EDIT:

I tried the following lookahead assertion:

re.sub(r'\[\*\](?!:)(?=.+?\[\*\]: ?(.+?)$',r'\\footnote{\1}',flags=re.DOTALL|re.MULTILINE)
#(?!:) is to prevent [*]: to be matched

now it matches all the footnotes, however they're not matched correctly.

s = """This is a note[*] and this is another[*]
[*]: some text
[*]: other text"""

is giving me

This is a note\footnote{some text} and this is another\footnote{some text}
[*]: note 1
[*]: note 2

Any thoughts about it?

like image 741
Afonso Silva Avatar asked Sep 08 '15 16:09

Afonso Silva


1 Answers

The reason is that you can't match the same characters several times. Once a character is matched, it is consumed by the regex engine and can't be reused for an other match.

A (general) workaround consists to capture overlapped parts inside a lookahead assertion with capture groups. But it can't be done in your case because there is no way to differentiate which note is associated with the placeholder.

A more simple way can be to extract all the notes first in a list and then to replace each placeholder with a callback. Example:

import re

s='''This is a note[*] and this is another[*]
[*]: note 1
[*]: note 2'''

# text and notes are separated
[text,notes] = re.split(r'((?:\r?\n\[\*\]:[^\r\n]*)+$)', s)[:-1]

# this generator gives the next replacement string 
def getnote(notes):
    for note in re.split(r'\r?\n\[\*\]: ', notes)[1:]:
        yield r'\footnote{{{}}}'.format(note)

note = getnote(notes)

res = re.sub(r'\[\*\]', lambda m: note.next(), text)
print res
like image 193
Casimir et Hippolyte Avatar answered Sep 21 '22 05:09

Casimir et Hippolyte