Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape logos from websites?

First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).

This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.

The two solutions I've begun to create are these:

  1. Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
  2. The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.

Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.

I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.

So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)

Thanks!

like image 422
Keith Hanson Avatar asked Apr 09 '11 20:04

Keith Hanson


2 Answers

Check this API by Clearbit. It's super simple to use:

Just send a query to: https://logo.clearbit.com/[enter-domain-here]

For example: https://logo.clearbit.com/www.stackoverflow.com

and get back the logo image!

More about it here

like image 67
Anupam Avatar answered Oct 19 '22 16:10

Anupam


I had to find logos for ~10K websites for a previous project and tried the same technique you mentioned of extracting the image with "logo" in the URL. My variation was I loaded each webpage in webkit so that all images were loaded from CSS or JavaScript. This technique gave me logos for ~40% of websites.

Then I considered creating an app like Nick suggested to manually select the logo for the remaining websites, however I realized it was more cost effective to just give these to someone cheap (who I found via Elance) to do the work manually.

So I suggest don't bother solving this properly with a fully technical solution - outsource the manual labour.

like image 20
hoju Avatar answered Oct 19 '22 16:10

hoju