I want to download all files from an internet page, actually all the image files. I found the 'urllib' module to be what I need. There seems to be a method to download a file, if you know the filename, but I don't.
urllib.urlretrieve('http://www.example.com/page', 'myfile.jpg')
Is there a method to download all the files from the page and maybe return a list?
Here's a little example to get you started with using BeautifulSoup for this kind of exercise - you give this script a URL, and it will print out the URLs of images that are referenced from that page in the src
attribute of img
tags that end with jpg
or png
:
import sys, urllib, re, urlparse
from BeautifulSoup import BeautifulSoup
if not len(sys.argv) == 2:
print >> sys.stderr, "Usage: %s <URL>" % (sys.argv[0],)
sys.exit(1)
url = sys.argv[1]
f = urllib.urlopen(url)
soup = BeautifulSoup(f)
for i in soup.findAll('img', attrs={'src': re.compile('(?i)(jpg|png)$')}):
full_url = urlparse.urljoin(url, i['src'])
print "image URL: ", full_url
Then you can use urllib.urlretrieve
to download each of the images pointed to by full_url
, but at that stage you have to decide how to name them and what to do with the downloaded images, which isn't specified in your question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With