I'm looking for something like this:
data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
print get_images_url_from_markdown(data)
that returns a list of image URLs from the text:
['http://somewebsite.com/image1.jpg', 'http://anotherwebsite.com/image2.jpg']
Is there anything available, or do I have to scrape Markdown myself with BeautifulSoup?
Python-Markdown has an extensive Extension API. In fact, the Table of Contents Extension does essentially what you want with headings (instead of images) plus a bunch of other stuff you don't need (like adding unique id attributes and building a nested list for the TOC).
After the document is parsed, it is contained in an ElementTree object and you can use a treeprocessor to extract the data you want before the tree is serialized to text. Just be aware that if you have included any images as raw HTML, this will fail to find those images (you would need to parse the HTML output and extract in that case).
Start off by following this tutorial, except that you will need to create a treeprocessor
rather than an inline Pattern
. You should end up with something like this:
import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension
# First create the treeprocessor
class ImgExtractor(Treeprocessor):
def run(self, doc):
"Find all images and append to markdown.images. "
self.markdown.images = []
for image in doc.findall('.//img'):
self.markdown.images.append(image.get('src'))
# Then tell markdown about it
class ImgExtExtension(Extension):
def extendMarkdown(self, md, md_globals):
img_ext = ImgExtractor(md)
md.treeprocessors.add('imgext', img_ext, '>inline')
# Finally create an instance of the Markdown class with the new extension
md = markdown.Markdown(extensions=[ImgExtExtension()])
# Now let's test it out:
data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
html = md.convert(data)
print md.images
The above outputs:
[u'http://somewebsite.com/image1.jpg', u'http://anotherwebsite.com/image2.jpg']
If you really want a function which returns the list, just wrap that all up in one and you're good to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With