Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around

like image 245
Diego Avatar asked May 17 '10 00:05

Diego


3 Answers

oh you need to ignore the robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)
like image 186
Yuda Prawira Avatar answered Oct 17 '22 10:10

Yuda Prawira


You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?

like image 23
Alex Martelli Avatar answered Oct 17 '22 12:10

Alex Martelli


The code to make a correct request:

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content
like image 13
Vladislav Avatar answered Oct 17 '22 12:10

Vladislav