Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following HTML:

<html>
   <body>
      <div id="header">This is the header we don't care about</div>
      <div id="content">This is the <b>Main Page</b> content.  it is the
      longest block of text in this document and should be chosen as
      most likely being the important page content.</div>
   </body>
</html>

I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.

Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.

like image 881
Max Avatar asked Dec 05 '22 07:12

Max


1 Answers

One word: Boilerpipe

like image 192
Max Avatar answered Dec 28 '22 07:12

Max