Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check image urls using python-markdown

On a website I'm creating I'm using Python-Markdown to format news posts. To avoid issues with dead links and HTTP-content-on-HTTPS-page problems I'm requiring editors to upload all images to the site and then embed them (I'm using a markdown editor which I've patched to allow easy embedding of those images using standard markdown syntax).

However, I'd like to enforce the no-external-images policy in my code.

One way would be writing a regex to extract image URLs from the markdown sourcecode or even run it through the markdown renderer and use a DOM parser to extract all src attributes from img tags.

However, I'm curious if there's some way to hook into Python-Markdown to extract all image links or execute custom code (e.g. raising an exception if the link is external) during parsing.

like image 233
ThiefMaster Avatar asked May 08 '11 21:05

ThiefMaster


2 Answers

One approach would be to intercept the <img> node at a lower level just after Markdown parses and constructs it:

import re
from markdown import Markdown
from markdown.inlinepatterns import ImagePattern, IMAGE_LINK_RE

RE_REMOTEIMG = re.compile('^(http|https):.+')

class CheckImagePattern(ImagePattern):

    def handleMatch(self, m):
        node = ImagePattern.handleMatch(self, m)
        # check 'src' to ensure it is local
        src = node.attrib.get('src')
        if src and RE_REMOTEIMG.match(src):
            print 'ILLEGAL:', m.group(9)
            # or alternately you could raise an error immediately
            # raise ValueError("illegal remote url: %s" % m.group(9))
        return node

DATA = '''
![Alt text](/path/to/img.jpg)
![Alt text](http://remote.com/path/to/img.jpg)
'''

mk = Markdown()
# patch in the customized image pattern matcher with url checking
mk.inlinePatterns['image_link'] = CheckImagePattern(IMAGE_LINK_RE, mk)
result = mk.convert(DATA)
print result

Output:

ILLEGAL: http://remote.com/path/to/img.jpg
<p><img alt="Alt text" src="/path/to/img.jpg" />
<img alt="Alt text" src="http://remote.com/path/to/img.jpg" /></p>
like image 111
samplebias Avatar answered Oct 18 '22 23:10

samplebias


Updated with Python 3 and Python-Mardown 3

import re
from markdown import Markdown
from markdown.inlinepatterns import Pattern, IMAGE_LINK_RE

RE_REMOTEIMG = re.compile('^(http|https):.+')

class CheckImagePattern(Pattern):

    def handleMatch(self, m):
        node = Pattern.handleMatch(self, m)
        # check 'src' to ensure it is local
        src = node.attrib.get('src')
        if src and RE_REMOTEIMG.match(src):
            print 'ILLEGAL:', m.group(9)
            # or alternately you could raise an error immediately
            # raise ValueError("illegal remote url: %s" % m.group(9))
        return node

DATA = '''
![Alt text](/path/to/img.jpg)
![Alt text](http://remote.com/path/to/img.jpg)
'''

mk = Markdown()
# patch in the customized image pattern matcher with url checking
mk.inlinePatterns['image_link'] = CheckImagePattern(IMAGE_LINK_RE, mk)
result = mk.convert(DATA)
print result

Hope it's useful!

like image 38
KitKit Avatar answered Oct 19 '22 00:10

KitKit