Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get relevant image and summary from URL

Tags:

java

android

I'm not sure how to define it but basically I want to retrieve a relevant image and text summary from a given URL.

For example - when a user pastes a link to the share box on Facebook, it immediately gets the article title and/or a short text block from the article itself and a relevant image. It never gets the wrong image, like the logo of the site or text from around the article itself...

Same for Google+ and other social networks or services like these.

I started by assuming I need to read the page content using the below code, how can I determine which image is the relevant one (from the article body) and which text is the article text?

URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
    new InputStreamReader(oracle.openStream()));

String inputLine;
while ((inputLine = in.readLine()) != null)
    System.out.println(inputLine);

in.close();

I'm of course not asking for code here (unless someone has a snippet for example and is willing to share) but more for how to even approach this... where do I start?

Any help will be appreciated!

like image 550
Lior Iluz Avatar asked Jul 24 '12 15:07

Lior Iluz


1 Answers

I can recommend Boilerpipe for raw text extraction, it uses some advanced algorithms to find the relevant text and remove the boilerplate surrounding it (like menus, footers etc..).

Regarding the image, apart from using meta tags as already suggested in the comments, you could use an html parser (like htmlparser) to extract all "img" tags, and then use some heuristics to select the best one. I'm using some heuristics like :

  • No image smaller than 30px, they are usually icons or ad tracking images
  • The squared the better, this avoids rulers and similar stuff
  • No standard known banner size
  • The higher in the page the better
  • Near content extracted by Boilerplate (this is hard)

I've been using these heuristics in production for page scraping for some time and they give good results.

However, to properly apply these rules, you may need to download images to get their size and/or parse style attributes.

If you are planning to run this server side, as a page scraping service, then it's ok. If you are planning to do it on the fly on an android device, it could be too heavy.

like image 106
Simone Gianni Avatar answered Oct 11 '22 06:10

Simone Gianni