I've been using newspaper library lately. The only issue I am finding is when I do article.publish_date
I am always getting None
.
class NewsArticle:
def __init__(self,url):
self.article = Article(url)
self.article.download()
self.article.parse()
self.article.nlp()
def getKeywords(self):
x = self.article.keywords
for i in range(0,len(x)):
x[i] = x[i].encode('ascii', 'ignore')
return x
return self.article.keywords
def getSummary(self):
return self.article.summary.encode('ascii', 'ignore')
def getAuthors(self):
x = self.article.authors
for i in range(0,len(x)):
x[i] = x[i].encode('ascii', 'ignore')
return x
def thumbnail_url(self):
return self.article.top_image.encode('ascii', 'ignore')
def date_made(self):
print self.article.publish_date
return self.article.publish_date
def get_videos(self):
x=self.article.movies
for i in range(0,len(x)):
x[i] = x[i].encode('ascii', 'ignore')
return x
def get_title(self):
return self.article.title.encode('ascii','ignore')
I'm going over a bunch of URLS. You can see I'm printing out the publish_date
before returning it.
I get as I said before:
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
All the other functions are working as intended. The documentation from the site looks at an example,
>>> article.publish_date
datetime.datetime(2013, 12, 30 0, 0)
I'm doing this I'm pretty sure. I'm not sure if someone had an eye to see my issue.
I'm 100% sure that you have solved this issue in the last 5ish years, but I wanted to throw in my knowledge on newspaper.
This Python library isn't perfect, because it's designed to make a best effort in harvesting specific elements, such as article's title, author's name, published date and several other items. Even with a best effort newspaper will miss content that isn't in a place that it's designed to look.
For example this is from the extract code of newspaper.
3 strategies for publishing date extraction. The strategies are descending in accuracy and the next strategy is only attempted if a preferred one fails.
1. Pubdate from URL
2. Pubdate from metadata
3. Raw regex searches in the HTML + added heuristics
If newspaper does find a date in the URL it moves to the metatag, but only these:
PUBLISH_DATE_TAGS = [
{'attribute': 'property', 'value': 'rnews:datePublished',
'content': 'content'},
{'attribute': 'property', 'value': 'article:published_time',
'content': 'content'},
{'attribute': 'name', 'value': 'OriginalPublicationDate',
'content': 'content'},
{'attribute': 'itemprop', 'value': 'datePublished',
'content': 'datetime'},
{'attribute': 'property', 'value': 'og:published_time',
'content': 'content'},
{'attribute': 'name', 'value': 'article_date_original',
'content': 'content'},
{'attribute': 'name', 'value': 'publication_date',
'content': 'content'},
{'attribute': 'name', 'value': 'sailthru.date',
'content': 'content'},
{'attribute': 'name', 'value': 'PublishDate',
'content': 'content'},
{'attribute': 'pubdate', 'value': 'pubdate',
'content': 'datetime'},
{'attribute': 'name', 'value': 'publish_date',
'content': 'content'},
Fox news stores their dates in the meta tag section, but in a tag that newspaper doesn't query. To extract the dates from Fox news articles you would do this:
article_meta_data = article.meta_data
article_published_date = str({value for (key, value) in article_meta_data.items() if key == 'dcterms.created'})
print(article_published_date)
{'2020-10-11T12:51:53-04:00'}
Sometimes a source has its published dates in a section that newspaper doesn't look at. When this happens you have to wrap some additional code around newspaper to harvest the date.
For example BBC stores its dates in the script application/ld+json. Newspaper isn't designed to query or extract from this script. To extract the dates from BBC articles you would do this:
soup = BeautifulSoup(article.html, 'html.parser')
bbc_dictionary = json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))
date_published = [value for (key, value) in bbc_dictionary.items() if key == 'datePublished']
print(date_published)
['2020-10-11T20:11:33.000Z']
I published a Newspaper Usage Document on GitHub that discusses various collection strategies and other topics surrounding this library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With