Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does requests' stream=True option streams data one block at a time?

I'm using the following code to test how many seconds an HTTP connection can be kept alive:

    start_time = time.time()
    try:
      r = requests.get(BIG_FILE_URL, stream=True)
      total_length = r.headers['Content-length']
      for chunk in r.iter_content(chunk_size=CHUNK_SIZE): 
        time.sleep(1)
    # ... except and more logic to report total time and percentage downloaded

To be sure Python doesn't just download everything at once and creates a generator, I've used tcpdump. It does send one packet per second (approximately) but I didn't find what makes the server send one block at a time and how does the requests library does that.

I've checked several SOF questions and looked at the requests library documentation, but all resources explain how to use the library to download large files, and none of them explain the internals of the stream=True option.

My question is: what in the tcp protocol or HTTP request headers, makes the server send one block at a time and not the whole file at once?

EDIT + possible answer:

After working with Wireshark, I found out Python implements it using the TCP's sliding window. Meaning, it won't send an ack while the next chunk is not called.

That might cause some unexpected behavior as the sliding window might be a lot bigger than the chunk, and the chunks in the code might not represent actual packets.
Example: if you set the chunk to 1000 bytes, a default sliding window of 64K (my default on Ubuntu 18) will cause 64 chunks to be sent immediately. If the body size is less than 64K the connection might close immediately. So this is not a good idea for keeping connection online.

like image 248
Tom Avatar asked Feb 21 '20 17:02

Tom


People also ask

What is stream true requests?

When stream=True is set on the request, this avoids reading the content at once into memory for large responses. The chunk size is the number of bytes it should read into memory. This is not necessarily the length of each item returned as decoding can take place. chunk_size must be of type int or None.

What is stream true in Python?

1 Answer. Show activity on this post. using stream = True sets the stage for you to read the response data in chunks as opposed to having the entire response body downloaded in one go upfront.

How does requests work in Python?

Python requests module has several built-in methods to make Http requests to specified URI using GET, POST, PUT, PATCH or HEAD requests. A Http request is meant to either retrieve data from a specified URI or to push data to a server. It works as a request-response protocol between a client and a server.

How do you request a session in Python?

The Requests Session object allows you to persist specific parameters across requests to the same site. To get the Session object in Python Requests, you need to call the requests. Session() method. The Session object can store such parameters as cookies and HTTP headers.


2 Answers

This is not explained in user documentation. By going through the source code of requests, I found out that if we set stream=True in requests.get(...) then headers['Transfer-Encoding'] = 'chunked' is set in the HTTP headers. Thus specifying the Chunked transfer encoding. In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out independently of one another by the server. Hope this answers the question.

like image 159
Tarique Avatar answered Nov 15 '22 19:11

Tarique


This question caught my curiosity, so I decided to go down this research rabbit hole. Here are some of my (open to corrections!) findings:

  • Client to server communication is standardized by the Open Systems Interconnection model ( OSI ).

  • The transfer of data is handled by layer 4 - the Transport Layer. TCP/IP always breaks the data into packets. The IP packet lengths max out at approx. 65.5K bytes.

    Now what keeps Python from recombining all these packets into the original file before returning it?

    The requests iter_contentmethod has a nested generator which wraps a urllib3 generator method: class urllib3.response.HTTPResponse(...).stream(...)

    The 'chunk_size' parameter seems to set a buffer for how much data is read from the open socket into memory before it's written to the file system.

    Here's a copy of the iter_content method that was helpful:
def iter_content(self, chunk_size=1, decode_unicode=False):
        """Iterates over the response data.  When stream=True is set on the
        request, this avoids reading the content at once into memory for
        large responses.  The chunk size is the number of bytes it should
        read into memory.  This is not necessarily the length of each item
        returned as decoding can take place.

        chunk_size must be of type int or None. A value of None will
        function differently depending on the value of `stream`.
        stream=True will read data as it arrives in whatever size the
        chunks are received. If stream=False, data is returned as
        a single chunk.

        If decode_unicode is True, content will be decoded using the best
        available encoding based on the response.
        """

        def generate():
            # Special case for urllib3.
            if hasattr(self.raw, 'stream'):
                try:
                    for chunk in self.raw.stream(chunk_size, decode_content=True):
                        yield chunk
                except ProtocolError as e:
                    raise ChunkedEncodingError(e)
                except DecodeError as e:
                    raise ContentDecodingError(e)
                except ReadTimeoutError as e:
                    raise ConnectionError(e)
            else:
                # Standard file-like object.
                while True:
                    chunk = self.raw.read(chunk_size)
                    if not chunk:
                        break
                    yield chunk

            self._content_consumed = True

        if self._content_consumed and isinstance(self._content, bool):
            raise StreamConsumedError()
        elif chunk_size is not None and not isinstance(chunk_size, int):
            raise TypeError("chunk_size must be an int, it is instead a %s." % type(chunk_size))
        # simulate reading small chunks of the content
        reused_chunks = iter_slices(self._content, chunk_size)

        stream_chunks = generate()

        chunks = reused_chunks if self._content_consumed else stream_chunks

        if decode_unicode:
            chunks = stream_decode_response_unicode(chunks, self)
          

 return chunks
like image 29
mark_s Avatar answered Nov 15 '22 17:11

mark_s