I'm working with Python's Mechanize module. I've come across 3 different sites that cannot be opened by mechanize directly:
http://www.cpsc.gov/cpscpub/prerel/prhtml03/03059.html
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
Adding the following code allows mechanize to open and parse the wikipedia article and the google search results:
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
But, my workarounds are no match for the CPSC.gov website - when I try to open it with the mechanize Browser, my python freezes - to the point where I can't even Keyboard Interrupt it.
What's going on here?
In the case of the cpsc.gov site, it looks like there's a refresh header that isn't being correctly processed by mechanize HTTPRefreshProcessor. However, you can workaround the problem as follows:
import mechanize
url = 'http://www.cpsc.gov/cpscpub/prerel/prhtml03/03059.html'
br = mechanize.Browser()
br.set_handle_refresh(False)
br.open(url)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With