Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get favicon by using beautiful soup and python

I wrote some stupid code for learning just, but it doesn't work for any sites. here is the code:

import urllib2, re
from BeautifulSoup import BeautifulSoup as Soup

class Founder:
    def Find_all_links(self, url):
        page_source = urllib2.urlopen(url)
        a = page_source.read()
        soup = Soup(a)

        a = soup.findAll(href=re.compile(r'/.a\w+'))
        return a
    def Find_shortcut_icon (self, url):
        a = self.Find_all_links(url)
        b = ''
        for i in a:
            strre=re.compile('shortcut icon', re.IGNORECASE)
            m=strre.search(str(i))
            if m:
                b = i["href"]
        return b
    def Save_icon(self, url):
        url = self.Find_shortcut_icon(url)
        print url
        host = re.search(r'[0-9a-zA-Z]{1,20}\.[a-zA-Z]{2,4}', url).group()
        opener = urllib2.build_opener()
        icon = opener.open(url).read()
        file = open(host+'.ico', "wb")
        file.write(icon)
        file.close()
        print '%s icon successfully saved' % host
c = Founder()
print c.Save_icon('http://lala.ru')

The most strange thing is it works for site: http://habrahabr.ru http://5pd.ru

But doesn't work for most others that I've checked.

like image 510
kurd Avatar asked Jan 12 '11 21:01

kurd


People also ask

How do you call a beautiful soup in Python?

Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data from websites. To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml .

How do you use beautiful soup in Python for web scraping?

For web scraping to work in Python, we're going to perform three basic steps: Extract the HTML content using the requests library. Analyze the HTML structure and identify the tags which have our content. Extract the tags using Beautiful Soup and put the data in a Python list.


3 Answers

Thomas K's answer got me started in the right direction, but I found some websites that didn't say rel="shortcut icon", like 1800contacts.com that says just rel="icon". This works in Python 3 and returns the link. You can write that to file if you want.

from bs4 import BeautifulSoup
import requests

def getFavicon(domain):
    if 'http' not in domain:
        domain = 'http://' + domain
    page = requests.get(domain)
    soup = BeautifulSoup(page.text, features="lxml")
    icon_link = soup.find("link", rel="shortcut icon")
    if icon_link is None:
        icon_link = soup.find("link", rel="icon")
    if icon_link is None:
        return domain + '/favicon.ico'
    return icon_link["href"]
like image 174
Joshua Stafford Avatar answered Nov 03 '22 02:11

Joshua Stafford


You're making it far more complicated than it needs to be. Here's a simple way to do it:

import urllib
page = urllib.urlopen("http://5pd.ru/")
soup = BeautifulSoup(page)
icon_link = soup.find("link", rel="shortcut icon")
icon = urllib.urlopen(icon_link['href'])
with open("test.ico", "wb") as f:
    f.write(icon.read())
like image 28
Thomas K Avatar answered Nov 03 '22 00:11

Thomas K


In case anyone wants to use a single check with regex, the following works for me:

import re

from bs4 import BeautifulSoup

html_code = "<Some HTML code you get from somewhere>"

soup = BeautifulSoup(html_code, features="lxml")

for item in soup.find_all('link', attrs={'rel': re.compile("^(shortcut icon|icon)$", re.I)}):
    print(item.get('href'))

This will also account for occurrences of case sensitivity.

like image 21
Beetle Avatar answered Nov 03 '22 00:11

Beetle