Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get information from a web page (title, pictures, heads, etc...)

In Facebook, when you add a link to your wall, it gets the title, pictures and part of the text. I've seen this behavior in other websites where you can add links, how does it work? does it has a name? Is there any javascript/jQuery extension that implements it?

And how is possible that facebook goes to another website and gets the html when it's, supposedly, forbidden to make a cross site ajax call ??

Thanks.

like image 648
vtortola Avatar asked Dec 04 '22 23:12

vtortola


1 Answers

Basic Methodology

When the fetch event is triggered (for example on Facebook pasting a URL in) you can use AJAX to request the url*, then parse the returned data as you wish.

Parsing the data is the tricky bit, because so many websites have varying standards. Taking the text between the title tags is a good start, along with possibly searching for a META description (but these are being used less and less as search engines evolve into more sophisticated content based searches).

Failing that, you need some way of finding the most important text on the page and taking the first 100 chars or so as well as finding the most prominent picture on the page.

This is not a trivial task, it is extremely complicated trying to derive semantics from such a liquid and contrasting set of data (a generic returned web page). For example, you might find the biggest image on the page, that's a good start, but how do you know it's not a background image? How do you know that's the image that best describes that page?

Good luck!

*If you can't directly AJAX third party URL's, this can be done by requesting a page on your local server which fetches the remote page server side with some sort of HTTP request.

Some Extra Thoughts

If you grab an image from a remote server and 'hotlink' it on your site, many sites seem to sometimes have 'anti hotlinking' replacement images when you try and display this image, so it might be worth comparing the requested image from your server page with the actual fetched image so you don't show anything nasty by accident.

A lot of title tags in the head will be generic and non descriptive, it would be better to fetch the title of the article (assuming an article type site) if there is one available as it will be more descriptive, finding this is difficult though!

If you are really smart, you might be able to piggy back off Google for example (check their T&C though). If a user requests a certain URL, you can google search it behind the scenes, and use the returned google descriptive text as your return text. If google changes their markup significantly though this could break very quickly!

like image 173
Tom Gullen Avatar answered Dec 07 '22 22:12

Tom Gullen