I am reading text from html files and doing some analysis. These .html files are news articles.
Code:
html = open(filepath,'r').read()
raw = nltk.clean_html(html)
raw.unidecode(item.decode('utf8'))
Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?
I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.
I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.
EDIT: To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general
Extract content, date, author, and other metadata from any news or blog article on the web! Ujeebu Article Extraction API extracts clean text, and other structured data from news and blog articles. Full-Text RSS can extract article content from a web page and transform partial web feeds into full-text feeds.
When we talk about web pages, this includes the HTML, JavaScript, menus, media, header, footer, … Automatically and correctly extracting content is not easy. Through this article, I propose to explore the problem and to discuss some tools and recommendations to achieve this task. Extracting text content from a web page might seem simple.
Article Extraction is the process of extracting article content from news articles, blogs, or web pages. This is a form of web scraping specific to news articles, press releases, etc.
A quick and easy way to do this: you can extract real-time news from news portals like WSJ, New York Times and Reuters using web data extraction tools. These tools can not only extract the articles, publish time, author name, etc, but also pull the image URLs, web page URLs from the news websites.
Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.
The quickstart only shows loading from a URL, but you can load from a HTML string with:
import newspaper
# LOAD HTML INTO STRING FROM FILE...
article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With