Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get raw html text of a given url using python

Tags:

python

html

I'm using html2text in python to get raw text (tags included) of a HTML page by taking any URL but I'm getting an error.

My code -

import html2text
import urllib2

proxy = urllib2.ProxyHandler({'http': 'http://<proxy>:<pass>@<ip>:<port>'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
print html2text.html2text(html)

The error -

Traceback (most recent call last):
  File "t.py", line 8, in <module>
    html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>

Can anyone explain what I'm doing wrong?

like image 913
aquaman Avatar asked Feb 19 '15 15:02

aquaman


People also ask

How do I get raw HTML in Python?

Extracting the HTML file To extract the raw HTML file, we simply pass the website URL into the request. get() function. We now have an unstructured text file, containing the HTML code extracted from the URL path we passed. The way requests delivers the HTML code output is quite messy for analysis.

How do I extract text from a URL in Python?

URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.


1 Answers

If you don't require SSL, this script in Python 2.7.x should work:

import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()

and in Python 3.x use urllib.request instead of urllib

Because urllib2 for Python 2, in Python 3 it was merged into urllib.

http:// is required.

EDIT: In 2020, you should use the 3rd party module requests. requests can be installed with pip.

import requests
print(requests.get("http://stackoverflow.com").text)
like image 84
noɥʇʎԀʎzɐɹƆ Avatar answered Oct 15 '22 19:10

noɥʇʎԀʎzɐɹƆ