Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

What's a generic way of doing this that will work on most major news sites?

What are some good tools or libraries for data mining? (preferably python based)

like image 368
kefeizhou Avatar asked Jan 12 '11 17:01

kefeizhou


People also ask

How do you inspect a website for scraping?

Inspecting the HTML of a Website Firstly load the web page you want to scrape from. Right click on the page and select inspect. This will load the HTML of the website which shows the make-up of the website. Select the tool at the top left of the pane to highlight the code responsible for each part of the web page.

How do I scrape hidden data from a website?

You can use the Attribute selector to scrape these hidden tags from HTML. You can write your selector manually and then enter the “content” in the attribute name option to scrape efficiently.


1 Answers

There are a number of ways to do it, but, none will always work. Here are the two easiest:

  • if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
  • Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.
like image 101
gte525u Avatar answered Oct 05 '22 20:10

gte525u