Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieving Facebook-like link summaries (title, summary, relevant images) using Python

I would like to replicate the functionality that Facebook uses to parse a link. When you submit a link into your Facebook status, their system goes out and retrieves a suggested title, summary and often one or more relevant images from that page, from which you can choose a thumbnail.

My application needs to accomplish this using Python, but I am open to any kind of a guide, blog post or experience of other developers which relates to this and might help me figure out how to accomplish it.

I would really like to learn from other people's experience before just jumping in.

To be clear, when given the URL of a web page, I want to be able to retrieve:

  1. The title: Probably just the <title> tag but possibly the <h1>, not sure.
  2. A one-paragraph summary of the page.
  3. A bunch of relevant images that could be used as a thumbnail. (The tricky part is to filter out irrelevant images like banners or rounded corners)

I may have to implement it myself, but I would at least want to know about how other people have been doing these kinds of tasks.

like image 601
Ram Rachum Avatar asked Jul 21 '10 11:07

Ram Rachum


2 Answers

BeautifulSoup is well-suited to accomplish most of this.

Basically, you simply initialize the soup object, then do something like the following to extract what you are interested in:

title = soup.findAll('title')
images = soup.findAll('img')

You could then download each of the images based on their url using urllib2.

The title is fairly simple, but the images could be a bit more difficult since you have to download each one to get the relevant stats on them. Perhaps you could filter out most of the images based on size and number of colors? Rounded corners, as an example, are going to be small and only have 1-2 colors, generally.

As for the page summary, that may be a bit more difficult, but I've been doing something like this:

  1. I use BeautifulSoup to remove all style, script, form, and head blocks from the html by using: .findAll, then .extract.
  2. I grab the remaining text using: .join(soup.findAll(text = True))

In your application, perhaps you could use this "text" content as the page summary?

I hope this helps.

like image 61
Donald Miner Avatar answered Nov 15 '22 09:11

Donald Miner


Here's a complete solution: https://github.com/svven/summary

>>> import summary
>>> s = summary.Summary('http://stackoverflow.com/users/76701/ram-rachum')
>>> s.extract()
>>> s.title
u'User Ram Rachum - Stack Overflow'
>>> s.description
u'Israeli Python hacker.'
>>> s.image
https://www.gravatar.com/avatar/d24c45635a5171615a7cdb936f36daad?s=128&d=identic
on&r=PG
>>>
like image 31
ducu Avatar answered Nov 15 '22 10:11

ducu