Fetching the first image from a website that belongs to the post

Question

I've written a program that fetches the desired information from a blog or any page. The next thing, I want to achieve is to retrieve the first image from that page, that belongs to the respective post (Just like Facebook does when a post is shared).

I was able to achieve this to some extent by fetching the first image with an alt tag (since many websites don't have alt tags in their logos and icons etc, the first one should belong to the post). But this does not seem to work in some cases. Is there any other (better) way to achieve this? I'm using python 2.7.9 and BeautifulSoup 4.

d = feedparser.parse('http://rss.cnn.com/rss/edition.rss')

for entry in d.entries:
    try:
        if entry.title is not None:
            print entry.title
            print ""
    except Exception, e:
        print e

    try:
        if entry.link is not None:
            print entry.link
            print ""
    except Exception, e:
        print e

    try:
        if entry.published[5:16] is not None:
            print entry.published[5:16]
            print ""
    except Exception, e:
        print e

    try:
        if  entry.category is not None:
            print entry.category
            print ""
    except Exception, e:
        print e

    try:
        if entry.get('summary', '') is not None:
            print entry.get('summary', '')
            print ""
    except Exception, e:
        print e

    time.sleep(5)

    r = requests.get(entry.link, headers = {'User-Agent' : 'Safari/534.55.3 '})
    soup = BeautifulSoup(r.text, 'html.parser') 

    for img in soup.findAll('img'):
        if img.has_attr('alt'):
            if img['src'].endswith('.jpg') == True or img['src'].endswith('.png') == True:
                print img['src']
                break

Roman Susi · Accepted Answer

It is probably more practical to take a look at the opengraph module:

https://pypi.python.org/pypi/opengraph/0.5

and correct it the way you like.

It will fetch "first image" from HTML code or use og:image.

If you want to learn, you can also do it by looking at the source code. The module uses BeautifulSoup too.

I needed the following monkeypatch to activate scraping as fallback:

import re
from bs4 import BeautifulSoup
from opengraph import OpenGraph

def parser(self, html):
    """
    """
    if not isinstance(html,BeautifulSoup):
        doc = BeautifulSoup(html, from_encoding='utf-8')
    else:
        doc = html
    ogs = doc.html.head.findAll(property=re.compile(r'^og'))
    for og in ogs:
        self[og[u'property'][3:]]=og[u'content']

    # Couldn't fetch all attrs from og tags, try scraping body
    if not self.is_valid() and self.scrape:
        for attr in self.required_attrs:
            if not hasattr(self, attr):
                try:
                    self[attr] = getattr(self, 'scrape_%s' % attr)(doc)
                except AttributeError:
                    pass


OpenGraph.parser = parser
OpenGraph.scrape = True   # workaround for some subtle bug in opengraph

You may need to handle relatives URLs in the image sources, but it is quite straightforward with use of urljoin from urlparse

import opengraph
...
page = opengraph.OpenGraph(url=link, scrape=True)
...
if page.is_valid():
    ...
    image_url = page.get('image', None)
    ...
    if not image_url.startswith('http'):
        image_url = urljoin(page['_url'], page['image'])

(some check are omitted for brevity from the code fragment)

Fetching the first image from a website that belongs to the post

Tags:

python

beautifulsoup

python-2.7

web-scraping

Ahmed Dhanani

1 Answers

Roman Susi

Recent Activity

Donate For Us

Fetching the first image from a website that belongs to the post

Tags:

python

beautifulsoup

python-2.7

web-scraping

Ahmed Dhanani

1 Answers

Roman Susi

Related questions

Recent Activity

Donate For Us