I'm doing a webscraping on a site and sometimes when running the script I get this error:
ReadTimeout: HTTPSConnectionPool(host='...', port=443): Read timed out. (read timeout=10)
My code:
url = 'mysite.com'
all_links_page = []
page_one = requests.get(url, headers=getHeaders(), timeout=10)
sleep(2)
if page_one.status_code == requests.codes.ok:
soup_one = BeautifulSoup(page_one.content.decode('utf-8'), 'lxml')
page_links_one = soup_one.select("ul.product_list")
for links_one in page_links_one:
for li in links_one.select("li"):
all_links_page.append(li.a.get("href").strip())
The answers I found was not satisfactory
I was helped by increasing the timeout, immediately set 120 seconds. It turned out that the response from the server comes within 40 seconds.
Why do you have the timeout parameter in there? I would just eliminate the timeout parameter. The reason you get that error is because you set it to 10 which says if you don't receive a response from the server in 10 seconds, raise and error. So it's not necessarily the server calling you out. If no timeout is specified explicitly, requests do not time out (at least on your end).
page_one = requests.get(url, headers=headers) #< --- don't use the timeout parameter
This exception might occurs due to timeout or the available memory:
import urllib3, socket
from urllib3.connection import HTTPConnection
HTTPConnection.default_socket_options = (
HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_SNDBUF, 1000000), #1MB in byte
(socket.SOL_SOCKET, socket.SO_RCVBUF, 1000000)
])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With