Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the thumbnail image from a Web page

I have C# code for fetching images from URLs like http://i.imgur.com/QvkaduU.jpg but how would I fetch the image from Web pages like this:http://imgur.com/gallery/QvkaduU?

Is there any "easy" way to do this or I will have to fetch the HTML and construct a C# parser that looks in HTML for images that are bigger than all the others?

Let me clear this up. If you paste http://imgur.com/gallery/QvkaduU (HTML version) into for example Facebook's status update field it will find the main image and make a thumbnail out of it, this is exactly the behavior I'm looking for. The question is, how is this done? Do I have to write my own HTML parser or is there an easy way to get this?

like image 247
Banshee Avatar asked Mar 19 '13 19:03

Banshee


4 Answers

There is no easy way to get a "good" thumbnail image for an arbitrary URL.

Facebook's algorithm for doing so is fairly complex. Page developers are able to give it a hint by adding various meta tags to the <head>, including:

<meta property="og:image" content="http://url_to_your_image_here" />

or

<link rel="image_src" href="http://www.code-digital.co.uk/preview.jpg" />

(more on this)

... so if you wanted to replicate Facebook's algorithm, you would need to fetch the page source, parse it for any "hints" like the one above (you'd better check that I haven't missed any other "hint" formats), and come up with a fallback algorithm if the page doesn't include one of those.

A more realistic solution would be to use someone else's URL -> thumbnail system.

If you like Facebook's version, I think you should be able to request Facebook's thumbnail for a given URL via their API.

Other services which offer this sort of thing are:

  • http://webthumb.bluga.net/home (not free)
  • http://immediatenet.com/thumbnail_api.html (free, may have restrictive TOS)
  • https://www.google.com/search?q=get+thumbnail+for+url
like image 192
Rich Avatar answered Oct 07 '22 17:10

Rich


If the QvkaduU part is always the same between the html page and the image, could you just do a string replacement?

"http://imgur.com/gallery/QvkaduU".Replace("imgur.com/gallery","i.imgur.com") + ".jpg";

like image 45
overflowedstack Avatar answered Oct 07 '22 17:10

overflowedstack


I would fetch the whole HTML source and put all <img ... src="..."> parameters as well as < ... style="... background-image: ...;"> css inline properties using regex and try to download all files behind the links temporary. Then I would (try to convert it to Bitmap and) check the pixel size, the largest picture should be the picture you want.

Google might help you how to check pixel size and convert any images.

The regex to get all image links from a HTML source should be

<img[^>]+src=\"([^"]+)\".*?>|<[^>]+style=\"[^"]*background-image:\s*url\(\s*'?([^')])\s*'?)\s*;.*?> (not tested, but pretty sure)

The result will be in the 2nd or 3rd group index, also don't forget to prefix the current url on relative links.

like image 33
Martin Braun Avatar answered Oct 07 '22 19:10

Martin Braun


You're already on the right track, yes the most reliable way would be to fetch the HTML, parse it and look for images, you would then rank the images based on position and size. For instance, if the first image you find is big enough to make the thumbnail, then cool, if however it is small, you go to the next image, etc. It would be most advisable to use an image plugin like Timthumb (I think I've seen an ASP.NET version sometime) and cache the images such that once you've looked up the thumbnail to represent a website, you can call the image(s) from the catch instead.

like image 31
Chibueze Opata Avatar answered Oct 07 '22 17:10

Chibueze Opata