How to retrieve a webpage in python, including any images

Tags:

urllib

I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

which retrieves the source fine, but I also need to download any linked images.

I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:

wget -r --no-parent http://127.0.0.1/myurl.php

I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.

Any help is much appreciated! Thanks

754

asked Sep 05 '11 20:09

Jingo

2 Answers

Don't use regex when there is a perfectly good parser built in to Python:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)

158

answered Sep 19 '22 18:09

Gringo Suave

Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.

answered Sep 22 '22 18:09

Marcelo Cantos

Related questions
                            
                                python: change printing of class [duplicate]
                            
                                How to speed up numpy array-filling in python?
                            
                                Compute shadow length using PyEphem
                            
                                bad math or bad programming, maybe both?
                            
                                What does INSTALLED_APPS setting in Django actually do?
                            
                                Where can I download a sample Django template?
                            
                                Implementing a multi-process server in Python, with Twisted
                            
                                Install python module to non default version of python on Mac
                            
                                Unknown screen output of manually installed Python 2.7
                            
                                PLY: quickly parsing long lists of items?
                            
                                How do you add csrf validation to pyramid?
                            
                                SimpleHTTPRequestHandler close connection before returning from do_POST method
                            
                                n-largest elements in an sequence (need to retain duplicates)
                            
                                What's the difference between LP_* pointers and *_p pointers in ctypes? (and weird interaction with structs)
                            
                                Non-blocking socket in Python?
                            
                                python: How to debug multiprocess? (using eclipse+pydev)
                            
                                Asynchronous WSGI with Twisted
                            
                                Python - From DST-adjusted local time to UTC
                            
                                How to create connection timeout with python SocketServer
                            
                                Set Font Properties to Tick Labels with Matplot Lib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With