I am using the following code to connect to a website using a proxy:
proxy_support = urllib2.ProxyHandler({"http":"http://"+proxy})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
html = urllib2.urlopen(url).read()
I am rotating a list of proxies and they change frequently. Whenever I come to a bad proxy in which this connection fails, the connection goes through my IP.
I tested this by spamming requests to whatismyip, and occasionally my IP would show up.
Can I stop a connection BEFORE it goes out if it is trying to use my home IP?
I will try my best to explain this problem as I already had it before.
If there is a proxy set at the connection handler urllib2 will initiate, check it (correct address? user?password?port?) and use it for its connection.
If you look at the code the author even acknowledges that it is not optimal:
The opener will use several default handlers, including support
for HTTP and FTP. If there is a ProxyHandler, **it must be at the
front of the list of handlers.** (Yuck.)
So it calls the proxy before so if there is a proxy it will use if not No...
BUT if there is any error (bad url,bad proxy) it will return NONE to the connection handler.
So the connection handler will connect as it does not have any proxy set.
Now back to your problem:
You may check the proxies before using it and discard the bad ones. But still you have the problem that some proxies will die or change DURING your program is running.
For this you may modify urllib2 to return a localproxy instead of None. In this scenario YOUR localproxy will use a default page for everything so your program knows 'when' it has hit a problematic proxy.
This is a hack, maybe even ugly hack.
I did it and scraped the web happily afterwards.
Hope that helps you
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With