Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests is slow and takes very long to complete HTTP or HTTPS request

Tags:

When requesting a web resource or website or web service with the requests library, the request takes a long time to complete. The code looks similar to the following:

import requests requests.get("https://www.example.com/") 

This request takes over 2 minutes (exactly 2 minutes 10 seconds) to complete! Why is it so slow and how can I fix it?

like image 848
vauhochzett Avatar asked Jun 26 '20 16:06

vauhochzett


People also ask

Are Python requests slow?

Python requests is slow and takes very long to complete HTTP or HTTPS request - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational knowledge.

Does Python requests use https?

Requests verifies SSL certificates for HTTPS requests, just like a web browser. SSL Certificates are small data files that digitally bind a cryptographic key to an organization's details.

Does Python requests wait for response?

It will wait until the response arrives before the rest of your program will execute. If you want to be able to do other things, you will probably want to look at the asyncio or multiprocessing modules. Chad S. Chad S.


1 Answers

There can be multiple possible solutions to this problem. There are a multitude of answers on StackOverflow for any of these, so I will try to combine them all to save you the hassle of searching for them.

In my search I have uncovered the following layers to this:

First, try logging

For many problems, activating logging can help you uncover what goes wrong (source):

import requests import logging  import http.client http.client.HTTPConnection.debuglevel = 1  # You must initialize logging, otherwise you'll not see debug output. logging.basicConfig() logging.getLogger().setLevel(logging.DEBUG) requests_log = logging.getLogger("requests.packages.urllib3") requests_log.setLevel(logging.DEBUG) requests_log.propagate = True  requests.get("https://www.example.com") 

In case the debug output does not help you solve the problem, read on.

If you only need to check if the server is up, try a HEAD or streaming request

It can be faster to not request all data, but to only send a HEAD request (source):

requests.head("https://www.example.com") 

Some servers don't support this, then you can try to stream the response (source):

requests.get("https://www.example.com", stream=True) 

For multiple requests in a row, try utilizing a Session

If you send multiple requests in a row, you can speed up the requests by utilizing a requests.Session. This makes sure the connection to the server stays open and configured and also persists cookies as a nice benefit. Try this (source):

import requests session = requests.Session() for _ in range(10):     session.get("https://www.example.com") 

To parallelize your requests (try for > 10 requests), use requests-futures

If you send a very large number of requests at once, each request blocks execution. You can parallelize this utilizing, e.g., requests-futures (idea from kederrac):

from concurrent.futures import as_completed from requests_futures.sessions import FuturesSession  with FuturesSession() as session:     futures = [session.get("https://www.example.com") for _ in range(10)]     for future in as_completed(futures):         response = future.result() 

Be careful not to overwhelm the server with too many requests at the same time.

If this also does not solve your problem, read on...

The reason might not lie with requests, but the server or your connection

In many cases, the reason might lie with the server you are requesting from. First, verify this by requesting any other URL in the same fashion:

requests.get("https://www.google.com") 

If this works fine, you can focus your efforts on the following possible problems:

The server only allows specific user-agent strings

The server might specifically block requests, or they might utilize a whitelist, or some other reason. To send a nicer user-agent string, try this (source):

headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"} requests.get("https://www.example.com", headers=headers) 

The server rate-limits you

If this problem only occurs sometimes, e.g. after a few requests, the server might be rate-limiting you. Check the response to see if it reads something along those lines (i.e. "rate limit reached", "work queue depth exceeded" or similar; source).

Here, the solution is just to wait longer between requests, for example by using time.sleep().

The server response is incorrectly formatted, leading to parsing problems

You can check this by not reading the response you receive from the server. If the code is still slow, this is not your problem, but if this fixed it, the problem might lie with parsing the response.

  1. In case some headers are set incorrectly, this can lead to parsing errors which prevents chunked transfer (source).
  2. In other cases, setting the encoding manually might resolve parsing problems (source).

To fix those, try:

r = requests.get("https://www.example.com") r.raw.chunked = True # Fix issue 1 r.encoding = 'utf-8' # Fix issue 2 print(response.text) 

IPv6 does not work, but IPv4 does

This might be the worst problem of all to find. An easy, albeit weird, way to check this, is to add a timeout parameter as follows:

requests.get("https://www.example.com/", timeout=5) 

If this returns a successful response, the problem should lie with IPv6. The reason is that requests first tries an IPv6 connection. When that times out, it tries to connect via IPv4. By setting the timeout low, you force it to switch to IPv4 within a shorter amount of time.

Verify by utilizing, e.g., wget or curl:

wget --inet6-only https://www.example.com -O - > /dev/null # or curl --ipv6 -v https://www.example.com 

In both cases, we force the tool to connect via IPv6 to isolate the issue. If this times out, try again forcing IPv4:

wget --inet4-only https://www.example.com -O - > /dev/null # or curl --ipv4 -v https://www.example.com 

If this works fine, you have found your problem! But how to solve it, you ask?

  1. A brute-force solution is to disable IPv6 completely.
  2. You may also disable IPv6 for the current session only.
  3. You may just want to force requests to use IPv4. (In the linked answer, you have to adapt the code to always return socket.AF_INET for IPv4.)
  4. If you want to fix this problem for SSH, here is how to force IPv4 for SSH. (In short, add AddressFamily inet to your SSH config.)
  5. You may also want to check if the problem lies with your DNS or TCP.
like image 160
vauhochzett Avatar answered Oct 05 '22 17:10

vauhochzett