Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to retrieve a webpage in python, including any images

Tags:

python

urllib

I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

which retrieves the source fine, but I also need to download any linked images.

I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:

wget -r --no-parent http://127.0.0.1/myurl.php

I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.

Any help is much appreciated! Thanks

like image 754
Jingo Avatar asked Sep 05 '11 20:09

Jingo


People also ask

How to scrape images from websites in Python?

As mentioned above, Python libraries are essential for scraping images: We’ll use request to retrieve data from URLs, BeautifulSoup to create the scraping pipeline, and Pillow to help Python process the images. Let’s install all three libraries with a single command: Then, we need to choose the web page we want to collect images from.

How to work with images in Python?

Working with Images in Python. PIL is the Python Imaging Library which provides the python interpreter with image editing capabilities. It was developed by Fredrik Lundh and several other contributors. Pillow is the friendly PIL fork and an easy to use library developed by Alex Clark and other contributors. We’ll be working with Pillow.

Can You scrape images and videos from the web?

In our previous web scraping tutorials that covered cURL and Puppeteer, we explored how to use these utilities for web scraping — and we focused on retrieving textual data like the top list of Hacker News articles. However, the web has another essential component to it — visual data, which includes components like images and videos.

Why choose Python for web crawlers?

Why choose Python? As outlined in our overview of Python web crawlers, Python is a great choice for data collection projects — and many data science professionals seem to agree, preferring Python components over their R counterparts. The most important factor is (arguably) speed.


2 Answers

Don't use regex when there is a perfectly good parser built in to Python:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)
like image 158
Gringo Suave Avatar answered Sep 19 '22 18:09

Gringo Suave


Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.

like image 40
Marcelo Cantos Avatar answered Sep 22 '22 18:09

Marcelo Cantos