I currently have the Python code for parsing markdown text in order to extract the content inside the square brackets of a markdown link along with the hyperlink.
import re
# Extract []() style links
link_name = "[^]]+"
link_url = "http[s]?://[^)]+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic
This will fail to grab the proper link because it matches up to the first bracket, rather than the last one.
How can I make the regex respect up till the final bracket?
For those types of urls, you'd need a recursive approach which only the newer regex module supports:
import regex as re
data = """
It's very easy to make some words **bold** and other words *italic* with Markdown.
You can even [link to Google!](http://google.com)
[a link](https://www.wiki.com/atopic_(subtopic))
"""
pattern = re.compile(r'\[([^][]+)\](\(((?:[^()]+|(?2))+)\))')
for match in pattern.finditer(data):
description, _, url = match.groups()
print(f"{description}: {url}")
This yields
link to Google!: http://google.com
a link: https://www.wiki.com/atopic_(subtopic)
See a demo on regex101.com.
This cryptic little beauty boils down to
\[([^][]+)\] # capture anything between "[" and "]" into group 1
(\( # open group 2 and match "("
((?:[^()]+|(?2))+) # match anything not "(" nor ")" or recurse group 2
# capture the content into group 3 (the url)
\)) # match ")" and close group 2
NOTE: The problem with this approach is that it fails for e.g. urls like
[some nasty description](https://google.com/()
# ^^^
which are surely totally valid in Markdown. If you're to encounter any such urls, use a proper parser instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With