Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.

like image 988
VoY Avatar asked Jan 04 '10 12:01

VoY


2 Answers

This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm

like image 130
Gideon Avatar answered Nov 06 '22 04:11

Gideon


First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.

After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.

like image 2
Faruz Avatar answered Nov 06 '22 04:11

Faruz