Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Body Text extraction from websites e.g. extract only article heading and text not all text in site

I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this.

So for example for a news article I would like to identify the heading and all the text, but not the comments section and so on.

Are there any algorithms for that out there? Thank you!

like image 864
Scoox Avatar asked Apr 21 '11 15:04

Scoox


People also ask

How do I extract specific text from a website?

Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.

How do I extract text from a website in R?

To extract text from a webpage of interest, we specify what HTML elements we want to select by using html_nodes() . For instance, if we want to scrape the primary heading for the Web Scraping Wikipedia webpage we simply identify the <h1> node as the node we want to select.


1 Answers

In computer science literature this problem is usually referred to as the page segmentation or boiler plate detection problem. See the report Boilerplate Detection using Shallow Text Features and its related blog post. Also, I have a few reports and software sites bookmarked that address the problem. Also, see this stackoverflow question.

like image 178
Jeff Kubina Avatar answered Oct 25 '22 19:10

Jeff Kubina