HTTP 403 error retrieving robots.txt with mechanize

Question

This shell command succeeds

$ curl -A "Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)" http://fifa-infinity.com/robots.txt

and prints robots.txt. Omitting the user-agent option results in a 403 error from the server. Inspecting the robots.txt file shows that content under http://www.fifa-infinity.com/board is allowed for crawling. However, the following fails (python code):

import logging
import mechanize
from mechanize import Browser

ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)'
br = Browser()
br.addheaders = [('User-Agent', ua)]
br.set_debug_http(True)
br.set_debug_responses(True)
logging.getLogger('mechanize').setLevel(logging.DEBUG)
br.open('http://www.fifa-infinity.com/robots.txt')

And the output on my console is:

No handlers could be found for logger "mechanize.cookies"
send: 'GET /robots.txt HTTP/1.1
Accept-Encoding: identity
Host: www.fifa-infinity.com
Connection: close
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)

'
reply: 'HTTP/1.1 403 Bad Behavior
'
header: Date: Wed, 13 Feb 2013 15:37:16 GMT
header: Server: Apache
header: X-Powered-By: PHP/5.2.17
header: Vary: User-Agent,Accept-Encoding
header: Connection: close
header: Transfer-Encoding: chunked
header: Content-Type: text/html
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moshev/Projects/forumscrawler/lib/python2.7/site-packages/mechanize/_mechanize.py", line 203, in open
    return self._mech_open(url, data, timeout=timeout)
  File "/home/moshev/Projects/forumscrawler/lib/python2.7/site-packages/mechanize/_mechanize.py", line 255, in _mech_open
    raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Bad Behavior

Strangely, using curl without setting the user-agent results in "403: Forbidden" rather than "403: Bad Behavior".

Am I somehow doing something wrong, or is this a bug in mechanize/urllib2? I don't see how simply getting robots.txt can be "bad behaviour"?

Hui Zheng · Accepted Answer

As verified by experiment, you need add an Accept header to specify acceptable content types(any type will do, as long as "Accept" header exists). For example, it will work after changing:

br.addheaders = [('User-Agent', ua)]

to:

br.addheaders = [('User-Agent', ua), ('Accept', '*/*')]

HTTP 403 error retrieving robots.txt with mechanize

Tags:

python

http-status-code-403

robots.txt

mechanize

Moshev

1 Answers

Hui Zheng

Recent Activity

Donate For Us

HTTP 403 error retrieving robots.txt with mechanize

Tags:

python

http-status-code-403

robots.txt

mechanize

Moshev

1 Answers

Hui Zheng

Related questions

Recent Activity

Donate For Us