I'm using html2text in python to get raw text (tags included) of a HTML page by taking any URL but I'm getting an error.
My code -
import html2text
import urllib2
proxy = urllib2.ProxyHandler({'http': 'http://<proxy>:<pass>@<ip>:<port>'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
print html2text.html2text(html)
The error -
Traceback (most recent call last):
File "t.py", line 8, in <module>
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>
Can anyone explain what I'm doing wrong?
Extracting the HTML file To extract the raw HTML file, we simply pass the website URL into the request. get() function. We now have an unstructured text file, containing the HTML code extracted from the URL path we passed. The way requests delivers the HTML code output is quite messy for analysis.
URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.
If you don't require SSL, this script in Python 2.7.x
should work:
import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()
and in Python 3.x
use urllib.request
instead of urllib
Because urllib2
for Python 2, in Python 3 it was merged into urllib
.
http://
is required.
EDIT: In 2020, you should use the 3rd party module requests
. requests
can be installed with pip
.
import requests
print(requests.get("http://stackoverflow.com").text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With