<h3>Update</h3> Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text. So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report. <hr> <h3>The Question</h3> I download some pages from random sites, and now I want to analyze the textual content of the page. The problem is that a web page have a lot of content like menus, publicity, banners, etc. I want to try to exclude all that is not related with the content of the page. Taking this page as example, I don't want the menus above neither the links in the footer. Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents. At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter). The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags) Edit: I want to do this inside my Java code, not an external application (if this can be possible). I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. There are a few ways to feed HTML into Boilerpipe and extract HTML. You can use a URL: <pre class="prettyprint"><code>ArticleExtractor.INSTANCE.getText(url); </code></pre> You can use a String: <pre class="prettyprint"><code>ArticleExtractor.INSTANCE.getText(myHtml); </code></pre> There are also options to use a Reader, which opens up a large number of options.

How can I extract only the main textual content from an HTML page?

Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.

The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

I want to try to exclude all that is not related with the content of the page.

Taking this page as example, I don't want the menus above neither the links in the footer.

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

437

asked Aug 11 '11 05:08

Renato Dinhani

2 Answers

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:

ArticleExtractor.INSTANCE.getText(url);

You can use a String:

ArticleExtractor.INSTANCE.getText(myHtml);

There are also options to use a Reader, which opens up a large number of options.

162

answered Sep 21 '22 19:09

Kurt Kaylor

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

Reader reader = ... InputSource is = new InputSource(reader);  // parse the document into boilerpipe's internal data structure TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();  // perform the extraction/classification process on "doc" ArticleExtractor.INSTANCE.process(doc);  // iterate over all blocks (= segments as "ArticleExtractor" sees them)  for (TextBlock block : getTextBlocks()) {     // block.isContent() tells you if it's likely to be content or not      // block.getText() gives you the block's text }

TextBlock has some more exciting methods, feel free to play around!

answered Sep 23 '22 19:09

Christian Kohlschütter

Related questions
                            
                                Hibernate with Oracle sequence doesn't use it
                            
                                Creating MVC3 Razor Helper like Helper.BeginForm()
                            
                                Which DOM events can be bound to a DIV element?
                            
                                Generate all subsets of size k (containing k elements) in Python
                            
                                Could not validate certificate signature?
                            
                                strcmp() return values in C [duplicate]
                            
                                why doesn't the favicon file show in the address bar of google chrome?
                            
                                Not able to Connect a Remote Host to the VisualVM
                            
                                Action filter execution order
                            
                                In Scala, is there a pre-existing library function for converting exceptions to Options?
                            
                                Is foldl ever preferable to its strict cousin, foldl'?
                            
                                Create WCF Client without auto generated proxy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With