I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:
import urllib
page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php
which retrieves the source fine, but I also need to download any linked images.
I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:
wget -r --no-parent http://127.0.0.1/myurl.php
I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.
Any help is much appreciated! Thanks
As mentioned above, Python libraries are essential for scraping images: We’ll use request to retrieve data from URLs, BeautifulSoup to create the scraping pipeline, and Pillow to help Python process the images. Let’s install all three libraries with a single command: Then, we need to choose the web page we want to collect images from.
Working with Images in Python. PIL is the Python Imaging Library which provides the python interpreter with image editing capabilities. It was developed by Fredrik Lundh and several other contributors. Pillow is the friendly PIL fork and an easy to use library developed by Alex Clark and other contributors. We’ll be working with Pillow.
In our previous web scraping tutorials that covered cURL and Puppeteer, we explored how to use these utilities for web scraping — and we focused on retrieving textual data like the top list of Hacker News articles. However, the web has another essential component to it — visual data, which includes components like images and videos.
Why choose Python? As outlined in our overview of Python web crawlers, Python is a great choice for data collection projects — and many data science professionals seem to agree, preferring Python components over their R counterparts. The most important factor is (arguably) speed.
Don't use regex when there is a perfectly good parser built in to Python:
from urllib.request import urlretrieve # Py2: from urllib
from html.parser import HTMLParser # Py2: from HTMLParser
base_url = 'http://127.0.0.1/'
class ImgParser(HTMLParser):
def __init__(self, *args, **kwargs):
self.downloads = []
HTMLParser.__init__(self, *args, **kwargs)
def handle_starttag(self, tag, attrs):
if tag == 'img':
for attr in attrs:
if attr[0] == 'src':
self.downloads.append(attr[1])
parser = ImgParser()
with open('test.html') as f:
# instead you could feed it the original url obj directly
parser.feed(f.read())
for path in parser.downloads:
url = base_url + path
print(url)
urlretrieve(url, path)
Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With