Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Publishing date in newspaper library always returning None

I've been using newspaper library lately. The only issue I am finding is when I do article.publish_date I am always getting None.

class NewsArticle:
    def __init__(self,url):
        self.article = Article(url)
        self.article.download()
        self.article.parse()
        self.article.nlp()

    def getKeywords(self):
        x = self.article.keywords
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x

        return self.article.keywords

    def getSummary(self):
        return self.article.summary.encode('ascii', 'ignore')

    def getAuthors(self):
        x = self.article.authors
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x

    def thumbnail_url(self):
        return self.article.top_image.encode('ascii', 'ignore')

    def date_made(self):
        print self.article.publish_date
        return self.article.publish_date
    def get_videos(self):
        x=self.article.movies
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x
    def get_title(self):
        return self.article.title.encode('ascii','ignore')

I'm going over a bunch of URLS. You can see I'm printing out the publish_date before returning it.

I get as I said before:

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

All the other functions are working as intended. The documentation from the site looks at an example,

>>> article.publish_date
datetime.datetime(2013, 12, 30 0, 0)

I'm doing this I'm pretty sure. I'm not sure if someone had an eye to see my issue.

like image 546
Eigenvalue Avatar asked Oct 11 '15 18:10

Eigenvalue


1 Answers

I'm 100% sure that you have solved this issue in the last 5ish years, but I wanted to throw in my knowledge on newspaper.

This Python library isn't perfect, because it's designed to make a best effort in harvesting specific elements, such as article's title, author's name, published date and several other items. Even with a best effort newspaper will miss content that isn't in a place that it's designed to look.

For example this is from the extract code of newspaper.

3 strategies for publishing date extraction. The strategies are descending in accuracy and the next strategy is only attempted if a preferred one fails.

1. Pubdate from URL
2. Pubdate from metadata
3. Raw regex searches in the HTML + added heuristics

If newspaper does find a date in the URL it moves to the metatag, but only these:

PUBLISH_DATE_TAGS = [
            {'attribute': 'property', 'value': 'rnews:datePublished',
             'content': 'content'},
            {'attribute': 'property', 'value': 'article:published_time',
             'content': 'content'},
            {'attribute': 'name', 'value': 'OriginalPublicationDate',
             'content': 'content'},
            {'attribute': 'itemprop', 'value': 'datePublished',
             'content': 'datetime'},
            {'attribute': 'property', 'value': 'og:published_time',
             'content': 'content'},
            {'attribute': 'name', 'value': 'article_date_original',
             'content': 'content'},
            {'attribute': 'name', 'value': 'publication_date',
             'content': 'content'},
            {'attribute': 'name', 'value': 'sailthru.date',
             'content': 'content'},
            {'attribute': 'name', 'value': 'PublishDate',
             'content': 'content'},
            {'attribute': 'pubdate', 'value': 'pubdate',
             'content': 'datetime'},
            {'attribute': 'name', 'value': 'publish_date',
             'content': 'content'},

Fox news stores their dates in the meta tag section, but in a tag that newspaper doesn't query. To extract the dates from Fox news articles you would do this:

article_meta_data = article.meta_data

article_published_date = str({value for (key, value) in article_meta_data.items() if key == 'dcterms.created'})
print(article_published_date)
{'2020-10-11T12:51:53-04:00'}

Sometimes a source has its published dates in a section that newspaper doesn't look at. When this happens you have to wrap some additional code around newspaper to harvest the date.

For example BBC stores its dates in the script application/ld+json. Newspaper isn't designed to query or extract from this script. To extract the dates from BBC articles you would do this:

soup = BeautifulSoup(article.html, 'html.parser')
bbc_dictionary = json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))

date_published = [value for (key, value) in bbc_dictionary.items() if key == 'datePublished']
print(date_published)
['2020-10-11T20:11:33.000Z']

I published a Newspaper Usage Document on GitHub that discusses various collection strategies and other topics surrounding this library.

like image 141
Life is complex Avatar answered Sep 18 '22 14:09

Life is complex