Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find "date" in generic webpage using Python

I want to crawl the exact publish time for news articles published in the web.

Some webpage have nice and formatted header where I can extract "last-modified" or "publish-date", the information in the header is messy, but useable. (By the way, metadata_parser helps a lot!)

But larger news agency like BBC and CNN don't put date and time information in the html header. So I am trying to get date and publish time from the html code.

For BBC, the date time is embedded like:

<div data-timestamp-inserted="true" class="date date--v2" data-seconds="1447658338" data-datetime="16 November 2015">16 November 2015</div>

For CNN, it is like:

<p class="update-time">Updated 0137 GMT (0937 HKT) November 16, 2015 <span id="js-pagetop_video_source" class="video__source top_source">| Video Source: <a href="http://www.cnn.com/">CNN</a></span></p>

For nytimes,

<p class="byline-dateline"><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author" data-byline-name="AURELIEN BREEDEN" itemprop="name">AURELIEN BREEDEN</span>, </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><span class="byline-author" data-byline-name="KIMIKO DE FREYTAS-TAMURA" itemprop="name">KIMIKO DE FREYTAS-TAMURA</span> and </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person" itemid="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html"><a href="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html" rel="author" title="More Articles by KATRIN BENNHOLD"><span class="byline-author" data-byline-name="KATRIN BENNHOLD" itemprop="name">KATRIN BENNHOLD</span></a></span><time class="dateline" datetime="2015-11-16" itemprop="datePublished" content="2015-11-16">NOV. 16, 2015</time></p>

As can be seen, almost every news agency has their own way of putting data and time in the webpage.

My question is, is it possible to extract date time information using some kind of fuzzy search in BeautifulSoup and kind of package so I don't have to write rule for each website?

Thanks!

like image 781
Sean Avatar asked Nov 18 '15 04:11

Sean


2 Answers

The htmldate module does just that, it is tested on different cases and features a series of robust heuristics so that you don't have to write code each time to scrape the date of the websites you're interested in.

It also uses dateparser to yield more precise results.

1. Install the package:

pip install htmldate

2. Retrieve a web page, parse it and output the date:

from htmldate import find_date

find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')

(disclaimer: I'm the author)

If the extraction doesn't work feel free to file a bug report on the issues page.

like image 25
adbar Avatar answered Oct 13 '22 11:10

adbar


In my experience and humble opinion, the best way to scrape generic information is with NER (Named-Entity Recognition) systems.

I would recommend to use Scrapinghub's webstruct library:

Webstruct is a library for creating statistical NER systems that work on HTML data, i.e. a library for building tools that extract named entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only on text data. This allows to define features that use HTML structure, and also to embed annotation results back into HTML.

Github repository: https://github.com/scrapinghub/webstruct

Documentation: http://webstruct.readthedocs.org/en/latest/

UPDATE:

As you need to scrape dates, you can also use Dateparser:

dateparser provides modules to easily parse localized dates in almost any string formats commonly found on web pages.

Github repository: https://github.com/scrapinghub/dateparser

Documentation: https://dateparser.readthedocs.org/en/latest/

like image 177
Andrés Pérez-Albela H. Avatar answered Oct 13 '22 13:10

Andrés Pérez-Albela H.