I am attempting to extract anchor text and associated URLs from Markdown. I've seen this question. Unfortunately, the answer doesn't seem to fully answer what I want.
In Markdown, there are two ways to insert a link:
[anchor text](http://my.url)
[anchor text][2]
[1]: http://my.url
My script looks like this (note that I am using regex, not re):
import regex
body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][4]\r\n\r\n [1]: http://yahoo.com"
rex = """(?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])"""
pattern = regex.compile(rex)
matches = regex.findall(pattern, body_markdown, overlapped=True)
for m in matches:
print m
This produces the output:
('http://google.com', 'http://google.com')
('http://yahoo.com', 'http://yahoo.com')
My expected output is:
('inline link', 'http://google.com')
('non inline link', 'http://yahoo.com')
How can I properly capture the anchor text from Markdown?
How can I properly capture the anchor text from Markdown?
Parse it into a structured format (e.g., html) and then use the appropriate tools to extract link labels and addresses.
import markdown
from lxml import etree
body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][1]\r\n\r\n [1]: http://yahoo.com"
doc = etree.fromstring(markdown.markdown(body_markdown))
for link in doc.xpath('//a'):
print link.text, link.get('href')
Which gets me:
inline link http://google.com
non inline link http://yahoo.com
The alternative is writing your own Markdown parser, which seems like the wrong place to focus your effort.
You can do it with a couple simple re
patterns:
import re
INLINE_LINK_RE = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
FOOTNOTE_LINK_TEXT_RE = re.compile(r'\[([^\]]+)\]\[(\d+)\]')
FOOTNOTE_LINK_URL_RE = re.compile(r'\[(\d+)\]:\s+(\S+)')
def find_md_links(md):
""" Return dict of links in markdown """
links = dict(INLINE_LINK_RE.findall(md))
footnote_links = dict(FOOTNOTE_LINK_TEXT_RE.findall(md))
footnote_urls = dict(FOOTNOTE_LINK_URL_RE.findall(md))
for key, value in footnote_links.iteritems():
footnote_links[key] = footnote_urls[value]
links.update(footnote_links)
return links
Then you could use it like:
>>> body_markdown = """
... This is an [inline link](http://google.com).
... This is a [footnote link][1].
...
... [1]: http://yahoo.com
... """
>>> links = find_md_links(body_markdown)
>>> links
{'footnote link': 'http://yahoo.com', 'inline link': 'http://google.com'}
>>> links.values()
['http://yahoo.com', 'http://google.com']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With