What algorithms could I use to identify content on a web page

Question

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.

Gideon · Accepted Answer

This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm

Faruz · Answer

First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.

After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.

What algorithms could I use to identify content on a web page

Tags:

algorithm

html-content-extraction

webpage

VoY

2 Answers

Gideon

Faruz

Recent Activity

Donate For Us

What algorithms could I use to identify content on a web page

Tags:

algorithm

html-content-extraction

webpage

VoY

2 Answers

Gideon

Faruz

Related questions

Recent Activity

Donate For Us