Using httplib2 and urllib2, I'm trying to fetch pages from this url, but all of them didn't work out and ended up with this exception.
content = conn.request(uri="http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1129, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 901, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 871, in _conn_request
response = conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1027, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
HTTP header was like this
http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902
GET /news/news_print.asp?artice_id=20110727092902 HTTP/1.1
Host: www.zdnet.co.kr
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: ko-kr,ko;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: RMID=7d83495d4f336fe0; __utma=37206251.1552605885.1328771258.1328771258.1329070845.2; __utmz=37206251.1328771258.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ASPSESSIONIDCSQCQTDD=BCLEHPPDEPHEBJDLCFNDMKDN; __utmc=37206251; ASPSESSIONIDSSQCQQCB=MJPLMOJAFPDFCLONCANBIKHN; _EXEN=2
X-FireLogger: 1.2
HTTP/1.1 200 OK
Date: Mon, 13 Feb 2012 18:02:56 GMT
Content-Length: 19158
Content-Type: text/html;charset=UTF-8; Charset=UTF-8
Set-Cookie: ASPSESSIONIDSQSDQRDB=NGAIFHKAGDIOGEMANAOLLKKF; path=/
Cache-Control: private
Any clue?
The urllib2 module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more. Open the URL url, which can be either a string or a Request object. HTTPS requests do not do any verification of the server's certificate.
urllib2 is deprecated in python 3. x. use urllib instaed.
The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.
This works fine for me:
import urllib2
opener = urllib2.build_opener()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1',
}
opener.addheaders = headers.items()
response = opener.open("http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
print response.headers
print response.read()
The website discards all requests that occur without a User-Agent
string.
For all the people that end up here with a similar problem after installing httplib2 0.8:
Version 0.8 has a regression with connection handling in relation with HTTP keep-alive. See the bug report: https://code.google.com/p/httplib2/issues/detail?id=250
There is a fix for this issue, but it has not been released so far. Until then just use httplib2 0.7.7.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With