Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract only the main textual content from an HTML page?

Tags:

Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.


The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

I want to try to exclude all that is not related with the content of the page.

Taking this page as example, I don't want the menus above neither the links in the footer.

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

like image 437
Renato Dinhani Avatar asked Aug 11 '11 05:08

Renato Dinhani


People also ask

How can I get just text from a website?

Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.

How do you separate text in HTML?

The HTML <br> element defines a line break.


2 Answers

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:

ArticleExtractor.INSTANCE.getText(url); 

You can use a String:

ArticleExtractor.INSTANCE.getText(myHtml); 

There are also options to use a Reader, which opens up a large number of options.

like image 162
Kurt Kaylor Avatar answered Sep 21 '22 19:09

Kurt Kaylor


You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

Reader reader = ... InputSource is = new InputSource(reader);  // parse the document into boilerpipe's internal data structure TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();  // perform the extraction/classification process on "doc" ArticleExtractor.INSTANCE.process(doc);  // iterate over all blocks (= segments as "ArticleExtractor" sees them)  for (TextBlock block : getTextBlocks()) {     // block.isContent() tells you if it's likely to be content or not      // block.getText() gives you the block's text } 

TextBlock has some more exciting methods, feel free to play around!

like image 39
Christian Kohlschütter Avatar answered Sep 23 '22 19:09

Christian Kohlschütter