I'm not sure how to define it but basically I want to retrieve a relevant image and text summary from a given URL.
For example - when a user pastes a link to the share box on Facebook, it immediately gets the article title and/or a short text block from the article itself and a relevant image. It never gets the wrong image, like the logo of the site or text from around the article itself...
Same for Google+ and other social networks or services like these.
I started by assuming I need to read the page content using the below code, how can I determine which image is the relevant one (from the article body) and which text is the article text?
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
I'm of course not asking for code here (unless someone has a snippet for example and is willing to share) but more for how to even approach this... where do I start?
Any help will be appreciated!
I can recommend Boilerpipe for raw text extraction, it uses some advanced algorithms to find the relevant text and remove the boilerplate surrounding it (like menus, footers etc..).
Regarding the image, apart from using meta tags as already suggested in the comments, you could use an html parser (like htmlparser) to extract all "img" tags, and then use some heuristics to select the best one. I'm using some heuristics like :
I've been using these heuristics in production for page scraping for some time and they give good results.
However, to properly apply these rules, you may need to download images to get their size and/or parse style attributes.
If you are planning to run this server side, as a page scraping service, then it's ok. If you are planning to do it on the fly on an android device, it could be too heavy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With