Extract News article content from stored .html pages

Tags:

I am reading text from html files and doing some analysis. These .html files are news articles.

Code:

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?

I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.

I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

EDIT: To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general

304

asked May 20 '15 17:05

Abhishek Bhatia

1 Answers

Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.

The quickstart only shows loading from a URL, but you can load from a HTML string with:

import newspaper

# LOAD HTML INTO STRING FROM FILE...

article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)

answered Sep 29 '22 04:09

Harry

Related questions
                            
                                opencv warpPerspective parameter count
                            
                                Python exception handling in list comprehension
                            
                                Django-Grappelli: Reverse for 'grp_related_lookup' with arguments '()' and keyword arguments '{}' not found
                            
                                Python: strip a wildcard word
                            
                                scipy equivalent for MATLAB spy
                            
                                Python Turtle Graphics Window only Opens Briefly then Closes
                            
                                xlrd import issue with Python 2.7
                            
                                Python -- matplotlib elliptic curves
                            
                                Convert a space delimited file to comma separated values file in python
                            
                                Specifying which category to treat as the base with 'statsmodels'
                            
                                Installing shapefile / shapelib not found via conda or pip
                            
                                Long Int literal - Invalid Syntax?
                            
                                In python, how can I print lines that do NOT contain a certain string, rather than print lines which DO contain a certain string:
                            
                                Adding modules from opencv_contrib to OpenCV
                            
                                Pretty-printing JSON with ASCII color in python
                            
                                How to create a traceback object
                            
                                python QLineEdit Text Color
                            
                                tkinter: Open a new window with a button prompt [closed]
                            
                                How to use Cython typed memoryviews to accept strings from Python?
                            
                                Feedparser.parse() 'SSL: CERTIFICATE_VERIFY_FAILED'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract News article content from stored .html pages

Tags:

python

urllib2

bs4

Abhishek Bhatia

People also ask

1 Answers

Harry

Recent Activity

Donate For Us