I'm using the requests library to get a lot of webpages from somewhere. He's the pertinent code:
response = requests.Session()
retries = Retry(total=5, backoff_factor=.1)
response.mount('http://', HTTPAdapter(max_retries=retries))
response = response.get(url)
After a while it just hangs/freezes (never on the same webpage) while getting the page. Here's the traceback when I interrupt it:
File "/Users/Student/Hockey/Scrape/html_pbp.py", line 21, in get_pbp
response = r.read().decode('utf-8')
File "/anaconda/lib/python3.6/http/client.py", line 456, in read
return self._readall_chunked()
File "/anaconda/lib/python3.6/http/client.py", line 566, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/anaconda/lib/python3.6/http/client.py", line 612, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/anaconda/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt
Does anybody know what could be causing it? Or (more importantly) does anybody know a way to stop it if it takes more than a certain amount of time so that I could try again?
Seems like setting a (read) timeout might help you.
Something along the lines of:
response = response.get(url, timeout=5)
(This will set both connect and read timeout to 5 seconds.)
In requests
, unfortunately, neither connect nor read timeouts are set by default, even though the docs say it's good to set it:
Most requests to external servers should have a timeout attached, in case the server is not responding in a timely manner. By default, requests do not time out unless a timeout value is set explicitly. Without a timeout, your code may hang for minutes or more.
Just for completeness, the connect timeout is the number of seconds requests
will wait for your client to establish a connection to a remote machine, and the read timeout is the number of seconds the client will wait between bytes sent from the server.
Patching the documented "send" function will fix this for all requests - even in many dependent libraries and sdk's. When patching libs, be sure to patch supported/documented functions, otherwise you may wind up silently losing the effect of your patch.
import requests
DEFAULT_TIMEOUT = 180
old_send = requests.Session.send
def new_send(*args, **kwargs):
if kwargs.get("timeout", None) is None:
kwargs["timeout"] = DEFAULT_TIMEOUT
return old_send(*args, **kwargs)
requests.Session.send = new_send
The effects of not having any timeout are quite severe, and the use of a default timeout can almost never break anything - because TCP itself has timeouts as well.
On Windows the default TCP timeout is 240 seconds, TCP RFC recommend a minimum of 100 seconds for RTO*retry. Somewhere in that range is a safe default.
To set timeout globally instead of specifying in every request:
from requests.adapters import TimeoutSauce
REQUESTS_TIMEOUT_SECONDS = float(os.getenv("REQUESTS_TIMEOUT_SECONDS", 5))
class CustomTimeout(TimeoutSauce):
def __init__(self, *args, **kwargs):
if kwargs["connect"] is None:
kwargs["connect"] = REQUESTS_TIMEOUT_SECONDS
if kwargs["read"] is None:
kwargs["read"] = REQUESTS_TIMEOUT_SECONDS
super().__init__(*args, **kwargs)
# Set it globally, instead of specifying ``timeout=..`` kwarg on each call.
requests.adapters.TimeoutSauce = CustomTimeout
sess = requests.Session()
sess.get(...)
sess.post(...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With