How can I find all Markdown links using regular expressions?

Tags:

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].

I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.

So far I have this:

(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])

Regular expression visualization

Debuggex Demo

But this doesn't seem to match either of my two test cases in Debuggex:

http://example.com
(Example)[http://example.com]

Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.

What am I doing wrong? Or is this not doable at all?

EDIT: I'm doing this in Python so will be using their regex engine.

250

asked Aug 03 '14 21:08

Sam Kellett

1 Answers

The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.

You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).

Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])

An online demo

pattern details:

(?|                                       # open a branch reset group
    # first case there is only the url
    (?<txt>                               # in this case, the text and the url  
        (?<url>                           # are the same
            (?:ht|f)tps?://\S+(?<=\P{P})
        )
    )
  |                                       # OR
    # the (text)[url] format
    \( ([^)]+) \)                         # this group will be named "txt" too 
    \[ (\g<url>) \]                       # this one "url"
)

This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.

\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)

(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

148

answered Oct 10 '22 22:10

Casimir et Hippolyte

Related questions
                            
                                Matplotlib subscript
                            
                                How do you debug url routing in Flask?
                            
                                How to call some function from the Flask app in Python?
                            
                                How do I extend a SQLAlchemy bound declarative model with extra methods?
                            
                                Is 'or' used on the right-hand-side of an assignment pythonic?
                            
                                Filter SQLAlchemy query result object's one-to-many attribute
                            
                                Variance inflation factor in ridge regression in python
                            
                                How does Flask keep the request global threadsafe
                            
                                How to get ReferenceField data in mongoengine?
                            
                                Use selenium webdriver as a baseclass python
                            
                                NumPy won't install in Python 3.4.0 in Win7
                            
                                Performance issues with App Engine memcache / ndb.get_multi
                            
                                Flask test_client: Testing DELETE request with data
                            
                                Catching the exception thrown by python script in shell script
                            
                                Flask: Creating objects that remain over multiple requests
                            
                                Defining a gradient with respect to a subtensor in Theano
                            
                                How do I use colorbar with hist2d in matplotlib.pyplot?
                            
                                PyCharm - Have author appear before imports?
                            
                                ImportError: cannot import name murmurhash3_32
                            
                                Error "__init__ method from base class is not called" for an abstract class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I find all Markdown links using regular expressions?

Tags:

python

regex

markdown

Sam Kellett

People also ask

1 Answers

Casimir et Hippolyte

Recent Activity

Donate For Us