Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Main Content ( Highest Text Density ) From a news article Web-Page [closed]

I want to make a code to extract the main news from a news website . News websites contain the main news , ads , reviews , copyright notice so i want to get only the main news like done in boilerpipe but i want to know how to do that .

So i want to have information about how is the process for doing this work .

Sudhanshu

like image 834
Sudhanshu Gupta Avatar asked Mar 02 '12 12:03

Sudhanshu Gupta


2 Answers

the boilerpipe websites contains source code, quickstart instructions, links to the original scientific paper and to the corresponding conference presentation video:

http://code.google.com/p/boilerpipe/

This should give you a quite comprehensive set of information on how this works and how you can apply this in your scenario.

Best,

Christian

like image 87
Christian Kohlschütter Avatar answered Sep 20 '22 03:09

Christian Kohlschütter


We try a lot of open source crawlers, like Readability, Beautiful Soup etc. for same, but after testing Diffbot API we decide use it for AppMarkt. It fast and extract news articles really well from various languages.

like image 44
Andrei Bourdine Avatar answered Sep 22 '22 03:09

Andrei Bourdine