How to get favicon by using beautiful soup and python

Tags:

I wrote some stupid code for learning just, but it doesn't work for any sites. here is the code:

import urllib2, re
from BeautifulSoup import BeautifulSoup as Soup

class Founder:
    def Find_all_links(self, url):
        page_source = urllib2.urlopen(url)
        a = page_source.read()
        soup = Soup(a)

        a = soup.findAll(href=re.compile(r'/.a\w+'))
        return a
    def Find_shortcut_icon (self, url):
        a = self.Find_all_links(url)
        b = ''
        for i in a:
            strre=re.compile('shortcut icon', re.IGNORECASE)
            m=strre.search(str(i))
            if m:
                b = i["href"]
        return b
    def Save_icon(self, url):
        url = self.Find_shortcut_icon(url)
        print url
        host = re.search(r'[0-9a-zA-Z]{1,20}\.[a-zA-Z]{2,4}', url).group()
        opener = urllib2.build_opener()
        icon = opener.open(url).read()
        file = open(host+'.ico', "wb")
        file.write(icon)
        file.close()
        print '%s icon successfully saved' % host
c = Founder()
print c.Save_icon('http://lala.ru')

The most strange thing is it works for site: http://habrahabr.ru http://5pd.ru

But doesn't work for most others that I've checked.

510

asked Jan 12 '11 21:01

kurd

3 Answers

Thomas K's answer got me started in the right direction, but I found some websites that didn't say rel="shortcut icon", like 1800contacts.com that says just rel="icon". This works in Python 3 and returns the link. You can write that to file if you want.

from bs4 import BeautifulSoup
import requests

def getFavicon(domain):
    if 'http' not in domain:
        domain = 'http://' + domain
    page = requests.get(domain)
    soup = BeautifulSoup(page.text, features="lxml")
    icon_link = soup.find("link", rel="shortcut icon")
    if icon_link is None:
        icon_link = soup.find("link", rel="icon")
    if icon_link is None:
        return domain + '/favicon.ico'
    return icon_link["href"]

174

answered Nov 03 '22 02:11

Joshua Stafford

You're making it far more complicated than it needs to be. Here's a simple way to do it:

import urllib
page = urllib.urlopen("http://5pd.ru/")
soup = BeautifulSoup(page)
icon_link = soup.find("link", rel="shortcut icon")
icon = urllib.urlopen(icon_link['href'])
with open("test.ico", "wb") as f:
    f.write(icon.read())

answered Nov 03 '22 00:11

Thomas K

In case anyone wants to use a single check with regex, the following works for me:

import re

from bs4 import BeautifulSoup

html_code = "<Some HTML code you get from somewhere>"

soup = BeautifulSoup(html_code, features="lxml")

for item in soup.find_all('link', attrs={'rel': re.compile("^(shortcut icon|icon)$", re.I)}):
    print(item.get('href'))

This will also account for occurrences of case sensitivity.

answered Nov 03 '22 00:11

Beetle

Related questions
                            
                                How to count the occurrence of values in one pandas Dataframe if the values to count are in another (in a faster way)?
                            
                                "Pythonic" way to return elements from an iterable as long as a condition based on previous element is true
                            
                                What Python way would you suggest to check whois database records?
                            
                                What are the pros and cons of the various Python implementations?
                            
                                How to implement property() with dynamic name (in python)
                            
                                Tool (or combination of tools) for reproducible environments in Python
                            
                                Find all strings in python code files
                            
                                os.system() execute command under which linux shell?
                            
                                Running Python code contained in a string
                            
                                Form validation in django
                            
                                Pythonic way of iterating over 3D array
                            
                                Python interpreter as a c++ class
                            
                                Upper limit in Python time.sleep()?
                            
                                Sorting elements in string with Python
                            
                                Python regex look-behind requires fixed-width pattern
                            
                                Python del() built-in can't be used in assignment?
                            
                                Python subprocess Help
                            
                                Modify python script to run on every file in a directory
                            
                                Python: what does "...".encode("utf8") fix?
                            
                                Handling Signals in Python Threads

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get favicon by using beautiful soup and python

Tags:

python

favicon

beautifulsoup