I would like to replicate the functionality that Facebook uses to parse a link. When you submit a link into your Facebook status, their system goes out and retrieves a suggested title
, summary
and often one or more relevant image
s from that page, from which you can choose a thumbnail.
My application needs to accomplish this using Python, but I am open to any kind of a guide, blog post or experience of other developers which relates to this and might help me figure out how to accomplish it.
I would really like to learn from other people's experience before just jumping in.
To be clear, when given the URL of a web page, I want to be able to retrieve:
<title>
tag but possibly the <h1>
, not sure.I may have to implement it myself, but I would at least want to know about how other people have been doing these kinds of tasks.
BeautifulSoup is well-suited to accomplish most of this.
Basically, you simply initialize the soup
object, then do something like the following to extract what you are interested in:
title = soup.findAll('title')
images = soup.findAll('img')
You could then download each of the images based on their url
using urllib2
.
The title is fairly simple, but the images could be a bit more difficult since you have to download each one to get the relevant stats on them. Perhaps you could filter out most of the images based on size and number of colors? Rounded corners, as an example, are going to be small and only have 1-2 colors, generally.
As for the page summary, that may be a bit more difficult, but I've been doing something like this:
html
by using: .findAll
, then .extract
..join(soup.findAll(text = True))
In your application, perhaps you could use this "text"
content as the page summary?
I hope this helps.
Here's a complete solution: https://github.com/svven/summary
>>> import summary
>>> s = summary.Summary('http://stackoverflow.com/users/76701/ram-rachum')
>>> s.extract()
>>> s.title
u'User Ram Rachum - Stack Overflow'
>>> s.description
u'Israeli Python hacker.'
>>> s.image
https://www.gravatar.com/avatar/d24c45635a5171615a7cdb936f36daad?s=128&d=identic
on&r=PG
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With