I've got a little project where I’m trying to download a series of wallpapers from a web page. I'm new to python.
I'm using the urllib library, which is returning a long string of web page data which includes
<a href="http://website.com/wallpaper/filename.jpg">
I know that every filename I need to download has
'http://website.com/wallpaper/'
How can i search the page source for this portion of text, and return the rest of the image link, ending with "*.jpg" extension?
r'http://website.com/wallpaper/ xxxxxx .jpg'
I'm thinking if I could format a regular expression with the xxxx portion not being evaluated? Just check for the path, and the .jpg extension. Then return the whole string once a match is found
Am I on the right track?
BeautifulSoup is pretty convenient for this sort of thing.
import re
import urllib3
from bs4 import BeautifulSoup
jpg_regex = re.compile('\.jpg$')
site_regex = re.compile('website\.com\/wallpaper\/')
pool = urllib3.PoolManager()
request = pool.request('GET', 'http://your_website.com/')
soup = BeautifulSoup(request)
jpg_list = list(soup.find_all(name='a', attrs={'href':jpg_regex}))
site_list = list(soup.find_all(name='a', attrs={'href':site_regex}))
result_list = map(lambda a: a.get('href'), jpg_list and site_list)
I think a very basic regex will do.
Like:
(http:\/\/website\.com\/wallpaper\/[\w\d_-]*?\.jpg)
and if you use $1this will return the whole String .
And if you use
(http:\/\/website\.com\/wallpaper\/([\w\d_-]*?)\.jpg)
then $1 will give the whole string and $2 will give the file name only.
Note: escaping (\/) is language dependent so use what is supported by python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With