<p>I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following HTML:</p> <pre class="prettyprint"><code><html> <body> <div id="header">This is the header we don't care about</div> <div id="content">This is the <b>Main Page</b> content. it is the longest block of text in this document and should be chosen as most likely being the important page content.</div> </body> </html> </code></pre> <p>I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.</p> <p>Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.</p>

Scraping largest block of text from HTML document

Tags:

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following HTML:

Click to copy

<html>
   <body>
      <div id="header">This is the header we don't care about</div>
      <div id="content">This is the <b>Main Page</b> content.  it is the
      longest block of text in this document and should be chosen as
      most likely being the important page content.</div>
   </body>
</html>

I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.

Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.

881

asked Dec 05 '22 07:12

Max

1 Answers

One word: Boilerpipe

192

answered Dec 28 '22 07:12

Max

Related questions
                            
                                How do I navigate to a new page with a Button in ASP.net Core?
                            
                                angular 4 submit form by pressing enter with login button
                            
                                Adding to a variable with a function JavaScript in HTML
                            
                                How to reset CSS ::after style only for last element?
                            
                                Pandas DataFrame to HTML: Formatting the values to display centered
                            
                                How to call a function on selection change in v-select? [duplicate]
                            
                                How to insert minor customization into Swagger UI inside its markup?
                            
                                How to get data-value of selected option on select change in jquery
                            
                                Why are my CSS star layers not aligning properly?
                            
                                PHP get URL string parameters?
                            
                                How to animate underline from left to right?
                            
                                How to make an Accordion component with Reactstrap?
                            
                                Add link to Font Awesome icon in ReactJS
                            
                                Dynamic Spacing classes with SASS
                            
                                How do I enable dropdown on hover in vue bootstrap?
                            
                                Quill - Add Image URL instead of uploading it
                            
                                Clear content with button
                            
                                Firebase code 400 "message": "CONFIGURATION_NOT_FOUND"
                            
                                Make <h1> tag the same maximum width regardless of capitalization of text within
                            
                                Why am I getting this Javascript runtime error?

Scraping largest block of text from HTML document

Tags:

html

text-extraction

html-content-extraction

screen-scraping

Max

1 Answers

Max

Recent Activity

Donate For Us