Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fetching the first image from a website that belongs to the post

I've written a program that fetches the desired information from a blog or any page. The next thing, I want to achieve is to retrieve the first image from that page, that belongs to the respective post (Just like Facebook does when a post is shared).

I was able to achieve this to some extent by fetching the first image with an alt tag (since many websites don't have alt tags in their logos and icons etc, the first one should belong to the post). But this does not seem to work in some cases. Is there any other (better) way to achieve this? I'm using python 2.7.9 and BeautifulSoup 4.

d = feedparser.parse('http://rss.cnn.com/rss/edition.rss')

for entry in d.entries:
    try:
        if entry.title is not None:
            print entry.title
            print ""
    except Exception, e:
        print e

    try:
        if entry.link is not None:
            print entry.link
            print ""
    except Exception, e:
        print e

    try:
        if entry.published[5:16] is not None:
            print entry.published[5:16]
            print ""
    except Exception, e:
        print e

    try:
        if  entry.category is not None:
            print entry.category
            print ""
    except Exception, e:
        print e

    try:
        if entry.get('summary', '') is not None:
            print entry.get('summary', '')
            print ""
    except Exception, e:
        print e

    time.sleep(5)

    r = requests.get(entry.link, headers = {'User-Agent' : 'Safari/534.55.3 '})
    soup = BeautifulSoup(r.text, 'html.parser') 

    for img in soup.findAll('img'):
        if img.has_attr('alt'):
            if img['src'].endswith('.jpg') == True or img['src'].endswith('.png') == True:
                print img['src']
                break
like image 264
Ahmed Dhanani Avatar asked Nov 08 '15 18:11

Ahmed Dhanani


1 Answers

It is probably more practical to take a look at the opengraph module:

https://pypi.python.org/pypi/opengraph/0.5

and correct it the way you like.

It will fetch "first image" from HTML code or use og:image.

If you want to learn, you can also do it by looking at the source code. The module uses BeautifulSoup too.

I needed the following monkeypatch to activate scraping as fallback:

import re
from bs4 import BeautifulSoup
from opengraph import OpenGraph

def parser(self, html):
    """
    """
    if not isinstance(html,BeautifulSoup):
        doc = BeautifulSoup(html, from_encoding='utf-8')
    else:
        doc = html
    ogs = doc.html.head.findAll(property=re.compile(r'^og'))
    for og in ogs:
        self[og[u'property'][3:]]=og[u'content']

    # Couldn't fetch all attrs from og tags, try scraping body
    if not self.is_valid() and self.scrape:
        for attr in self.required_attrs:
            if not hasattr(self, attr):
                try:
                    self[attr] = getattr(self, 'scrape_%s' % attr)(doc)
                except AttributeError:
                    pass


OpenGraph.parser = parser
OpenGraph.scrape = True   # workaround for some subtle bug in opengraph

You may need to handle relatives URLs in the image sources, but it is quite straightforward with use of urljoin from urlparse

import opengraph
...
page = opengraph.OpenGraph(url=link, scrape=True)
...
if page.is_valid():
    ...
    image_url = page.get('image', None)
    ...
    if not image_url.startswith('http'):
        image_url = urljoin(page['_url'], page['image'])

(some check are omitted for brevity from the code fragment)

like image 160
Roman Susi Avatar answered Nov 06 '22 18:11

Roman Susi