Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping only main contents of webpage (ignore header, footer & sidebars)

I am familiar with scraping and using XPATH in php to parse the DOM to get what i want from a page. What i would like to hear are some suggestions on how i could programatically ignore the header, footer and sidebars on a page, and only extract the main body content.

Situation given is that there is no specific target, so i cannot simply ignore specific id's like #header and #footer, because every page is written slightly differently.

I know that google does this, i know it must be possible, i just don't really know where to start with it.

Thanks!

like image 835
deweydb Avatar asked Mar 26 '13 17:03

deweydb


2 Answers

There is no definite way to determine it but you can get reasonable results with heuristic methods. A suggestion:

Scrape two or more pages from the same website and start comparing them block by block starting on the top level, going a few levels deep until the blocks are sufficient equal. The comparison would not be == but a similarity index, for example with similar_text. Blocks above a certain percentage of similarity will most likely be header, footer or menu. You will have to find out by experiment which threshold is useful

like image 72
Fabian Schmengler Avatar answered Oct 12 '22 23:10

Fabian Schmengler


There is no small or quick way to scrape content from webpage. I have done a lot of these. There is not simple rule about this. Earlier in the html3/table based design days, there was different way to identify and site design itself was limited. screen size was limited so often menu was on top side and no space for right or left panels. then came era with panels with table designs. now is the time with floating content. And then we even use overflow:hidden so its even more hard to know body by word count, etc.

When writing html file, the code is never tagged as a content or menu. You can sometimes derive that from class names but that is not universal. the content gets its size and positions from CSS. so your parser alone can never determine the body part of the page. If you use a embedded html viewer and use DHTML/JS to locate sizes of blocks after rendering, there might be some way to do it but still it will never be universal. My suggestion is to make your parser and improve it case by case.

For google, it has made programs for most combinations of html designs. But even for google, making a universal parser, i think is impossible.

like image 21
thevikas Avatar answered Oct 13 '22 00:10

thevikas