Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract News article content from stored .html pages

I am reading text from html files and doing some analysis. These .html files are news articles.

Code:

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?

I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.

I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

EDIT: To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general

like image 304
Abhishek Bhatia Avatar asked May 20 '15 17:05

Abhishek Bhatia


People also ask

How to extract content from any news or blog article?

Extract content, date, author, and other metadata from any news or blog article on the web! Ujeebu Article Extraction API extracts clean text, and other structured data from news and blog articles. Full-Text RSS can extract article content from a web page and transform partial web feeds into full-text feeds.

Is it easy to extract text content from a web page?

When we talk about web pages, this includes the HTML, JavaScript, menus, media, header, footer, … Automatically and correctly extracting content is not easy. Through this article, I propose to explore the problem and to discuss some tools and recommendations to achieve this task. Extracting text content from a web page might seem simple.

What is article Extraction?

Article Extraction is the process of extracting article content from news articles, blogs, or web pages. This is a form of web scraping specific to news articles, press releases, etc.

How to extract real-time news from news portals?

A quick and easy way to do this: you can extract real-time news from news portals like WSJ, New York Times and Reuters using web data extraction tools. These tools can not only extract the articles, publish time, author name, etc, but also pull the image URLs, web page URLs from the news websites.


1 Answers

Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.

The quickstart only shows loading from a URL, but you can load from a HTML string with:

import newspaper

# LOAD HTML INTO STRING FROM FILE...

article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)
like image 75
Harry Avatar answered Sep 29 '22 04:09

Harry