Web scraping - how to identify main content on a webpage

Tags:

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

What's a generic way of doing this that will work on most major news sites?

What are some good tools or libraries for data mining? (preferably python based)

368

asked Jan 12 '11 17:01

kefeizhou

1 Answers

There are a number of ways to do it, but, none will always work. Here are the two easiest:

if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.

101

answered Oct 05 '22 20:10

gte525u

Related questions
                            
                                Django DetailView - how to use 'request' in get_context_data
                            
                                Running R script from python
                            
                                Principal components analysis using pandas dataframe
                            
                                networkx - change color/width according to edge attributes - inconsistent result
                            
                                Using a pip cache directory in docker builds
                            
                                matplotlib has no attribute 'pyplot'
                            
                                How to pass dictionary as command line argument to Python script?
                            
                                Relations on composite keys using sqlalchemy
                            
                                How to plot a gradient color line in matplotlib?
                            
                                Django: Get current user in model save
                            
                                How to specify python requests http put body?
                            
                                Anti-Join Pandas
                            
                                CSVWriter not saving data to file the moment I write it
                            
                                How to pass an argument to a function pointer parameter?
                            
                                Slicing a vector in C++
                            
                                Simple implementation of N-Gram, tf-idf and Cosine similarity in Python
                            
                                Dynamic terminal printing with python
                            
                                Writing to MySQL database with pandas using SQLAlchemy, to_sql
                            
                                Python packages - import by class, not file
                            
                                Python : Trying to POST form using requests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Web scraping - how to identify main content on a webpage

Tags:

python

html-parsing

web-scraping

webpage

kefeizhou

People also ask

1 Answers

gte525u

Recent Activity

Donate For Us