Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract image using Newspaper from HTML

I can't download articles like one usually does to instantiate the Article object, like below:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image

However, I can get the HTML from a request. Can I use this raw HTML and pass it somehow to Newspaper to extract the image from it? (below is an attempt, but doesn't work). Thanks

from newspaper import Article
import requests
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image
like image 813
notverygood Avatar asked Sep 19 '25 02:09

notverygood


1 Answers

The Python module Newspaper allows proxies to be used, but this feature is not listed within the module's documentation.


Proxies with Newspaper

from newspaper import Article
from newspaper.configuration import Configuration

# add your corporate proxy information and test the connection
PROXIES = {
           'http': "http://ip_address:port_number",
           'https': "https://ip_address:port_number"
          }

config = Configuration()
config.proxies = PROXIES

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

Requests with Proxies and Newspaper

import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
like image 77
Life is complex Avatar answered Sep 21 '25 16:09

Life is complex