I wrote some stupid code for learning just, but it doesn't work for any sites. here is the code:
import urllib2, re
from BeautifulSoup import BeautifulSoup as Soup
class Founder:
def Find_all_links(self, url):
page_source = urllib2.urlopen(url)
a = page_source.read()
soup = Soup(a)
a = soup.findAll(href=re.compile(r'/.a\w+'))
return a
def Find_shortcut_icon (self, url):
a = self.Find_all_links(url)
b = ''
for i in a:
strre=re.compile('shortcut icon', re.IGNORECASE)
m=strre.search(str(i))
if m:
b = i["href"]
return b
def Save_icon(self, url):
url = self.Find_shortcut_icon(url)
print url
host = re.search(r'[0-9a-zA-Z]{1,20}\.[a-zA-Z]{2,4}', url).group()
opener = urllib2.build_opener()
icon = opener.open(url).read()
file = open(host+'.ico', "wb")
file.write(icon)
file.close()
print '%s icon successfully saved' % host
c = Founder()
print c.Save_icon('http://lala.ru')
The most strange thing is it works for site: http://habrahabr.ru http://5pd.ru
But doesn't work for most others that I've checked.
Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data from websites. To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml .
For web scraping to work in Python, we're going to perform three basic steps: Extract the HTML content using the requests library. Analyze the HTML structure and identify the tags which have our content. Extract the tags using Beautiful Soup and put the data in a Python list.
Thomas K's answer got me started in the right direction, but I found some websites that didn't say rel="shortcut icon", like 1800contacts.com that says just rel="icon". This works in Python 3 and returns the link. You can write that to file if you want.
from bs4 import BeautifulSoup
import requests
def getFavicon(domain):
if 'http' not in domain:
domain = 'http://' + domain
page = requests.get(domain)
soup = BeautifulSoup(page.text, features="lxml")
icon_link = soup.find("link", rel="shortcut icon")
if icon_link is None:
icon_link = soup.find("link", rel="icon")
if icon_link is None:
return domain + '/favicon.ico'
return icon_link["href"]
You're making it far more complicated than it needs to be. Here's a simple way to do it:
import urllib
page = urllib.urlopen("http://5pd.ru/")
soup = BeautifulSoup(page)
icon_link = soup.find("link", rel="shortcut icon")
icon = urllib.urlopen(icon_link['href'])
with open("test.ico", "wb") as f:
f.write(icon.read())
In case anyone wants to use a single check with regex, the following works for me:
import re
from bs4 import BeautifulSoup
html_code = "<Some HTML code you get from somewhere>"
soup = BeautifulSoup(html_code, features="lxml")
for item in soup.find_all('link', attrs={'rel': re.compile("^(shortcut icon|icon)$", re.I)}):
print(item.get('href'))
This will also account for occurrences of case sensitivity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With