Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract images from a site that I'm linking to?

If you're familiar with Reddit, you'll know how all of their posts containing pictures get a small thumbnail preview beside the title of the submission. How does Reddit go about doing that? Does it just check to see if the link ends with .jpg, .png, .bmp, etc?

like image 944
vette982 Avatar asked Mar 28 '10 00:03

vette982


2 Answers

reddit will try to pull a thumbnail from any source--not just an image URL. This is done firstly by having set rules for specific sites, and secondly by having one generic process for retrieving thumbnails for unknown URLs--and is an automated periodic task.

One of the (many) benefits of reddit is that the source code is open, and if you understand Python, you should check out /r2/lib/scraper.py for a more detailed view at how this process works.

Also, while StackOverflow is a great place to have programming-related questions answered, you might also want to check out reddit's own /r/redditdev for information on reddit development.

Hey there redditor!

like image 94
gpmcadam Avatar answered Sep 22 '22 10:09

gpmcadam


  1. Indeed, if the URL contains .jpg, .png, etc., use that.
  2. If the site is a popular domain (flickr.com, youtube.com, amazon.com, etc.), have a set of predefined rules to extract something you know will be relevant (may it be the featured image, YouTube thumbnail, Amazon product image, etc.)
  3. Otherwise, if all you have to work with is some HTML, you'll have to dig it out yourself. You could choose the first one on the page, the biggest by size, or even the one you've algorithmically determined to be the most relevent (e.g. relatively big, inside what you think is the main body content.)

If you have to resort to the last option, one technique I'd recommend is to extract multiple images, and A/B test them to find the one which has the best click-through rate. That way you can nearly always get the best one.

like image 38
Ashley Williams Avatar answered Sep 23 '22 10:09

Ashley Williams