Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get a list of image URLs from a Markdown file in Python?

I'm looking for something like this:

data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''

print get_images_url_from_markdown(data)

that returns a list of image URLs from the text:

['http://somewebsite.com/image1.jpg', 'http://anotherwebsite.com/image2.jpg']

Is there anything available, or do I have to scrape Markdown myself with BeautifulSoup?

like image 717
Doc Avatar asked Mar 25 '15 15:03

Doc


1 Answers

Python-Markdown has an extensive Extension API. In fact, the Table of Contents Extension does essentially what you want with headings (instead of images) plus a bunch of other stuff you don't need (like adding unique id attributes and building a nested list for the TOC).

After the document is parsed, it is contained in an ElementTree object and you can use a treeprocessor to extract the data you want before the tree is serialized to text. Just be aware that if you have included any images as raw HTML, this will fail to find those images (you would need to parse the HTML output and extract in that case).

Start off by following this tutorial, except that you will need to create a treeprocessor rather than an inline Pattern. You should end up with something like this:

import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension

# First create the treeprocessor

class ImgExtractor(Treeprocessor):
    def run(self, doc):
        "Find all images and append to markdown.images. "
        self.markdown.images = []
        for image in doc.findall('.//img'):
            self.markdown.images.append(image.get('src'))

# Then tell markdown about it

class ImgExtExtension(Extension):
    def extendMarkdown(self, md, md_globals):
        img_ext = ImgExtractor(md)
        md.treeprocessors.add('imgext', img_ext, '>inline')

# Finally create an instance of the Markdown class with the new extension

md = markdown.Markdown(extensions=[ImgExtExtension()])

# Now let's test it out:

data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
html = md.convert(data)
print md.images

The above outputs:

[u'http://somewebsite.com/image1.jpg', u'http://anotherwebsite.com/image2.jpg']

If you really want a function which returns the list, just wrap that all up in one and you're good to go.

like image 106
Waylan Avatar answered Oct 17 '22 06:10

Waylan