How does requests' stream=True option streams data one block at a time?

Tags:

I'm using the following code to test how many seconds an HTTP connection can be kept alive:

    start_time = time.time()
    try:
      r = requests.get(BIG_FILE_URL, stream=True)
      total_length = r.headers['Content-length']
      for chunk in r.iter_content(chunk_size=CHUNK_SIZE): 
        time.sleep(1)
    # ... except and more logic to report total time and percentage downloaded

To be sure Python doesn't just download everything at once and creates a generator, I've used tcpdump. It does send one packet per second (approximately) but I didn't find what makes the server send one block at a time and how does the requests library does that.

I've checked several SOF questions and looked at the requests library documentation, but all resources explain how to use the library to download large files, and none of them explain the internals of the stream=True option.

My question is: what in the tcp protocol or HTTP request headers, makes the server send one block at a time and not the whole file at once?

EDIT + possible answer:

After working with Wireshark, I found out Python implements it using the TCP's sliding window. Meaning, it won't send an ack while the next chunk is not called.

That might cause some unexpected behavior as the sliding window might be a lot bigger than the chunk, and the chunks in the code might not represent actual packets.
Example: if you set the chunk to 1000 bytes, a default sliding window of 64K (my default on Ubuntu 18) will cause 64 chunks to be sent immediately. If the body size is less than 64K the connection might close immediately. So this is not a good idea for keeping connection online.

248

asked Feb 21 '20 17:02

Tom

2 Answers

This is not explained in user documentation. By going through the source code of requests, I found out that if we set stream=True in requests.get(...) then headers['Transfer-Encoding'] = 'chunked' is set in the HTTP headers. Thus specifying the Chunked transfer encoding. In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out independently of one another by the server. Hope this answers the question.

159

answered Nov 15 '22 19:11

Tarique

This question caught my curiosity, so I decided to go down this research rabbit hole. Here are some of my (open to corrections!) findings:

Client to server communication is standardized by the Open Systems Interconnection model ( OSI ).
The transfer of data is handled by layer 4 - the Transport Layer. TCP/IP always breaks the data into packets. The IP packet lengths max out at approx. 65.5K bytes.

Now what keeps Python from recombining all these packets into the original file before returning it?

The requests iter_contentmethod has a nested generator which wraps a urllib3 generator method: class urllib3.response.HTTPResponse(...).stream(...)

The 'chunk_size' parameter seems to set a buffer for how much data is read from the open socket into memory before it's written to the file system.

Here's a copy of the iter_content method that was helpful:

def iter_content(self, chunk_size=1, decode_unicode=False):
        """Iterates over the response data.  When stream=True is set on the
        request, this avoids reading the content at once into memory for
        large responses.  The chunk size is the number of bytes it should
        read into memory.  This is not necessarily the length of each item
        returned as decoding can take place.

        chunk_size must be of type int or None. A value of None will
        function differently depending on the value of `stream`.
        stream=True will read data as it arrives in whatever size the
        chunks are received. If stream=False, data is returned as
        a single chunk.

        If decode_unicode is True, content will be decoded using the best
        available encoding based on the response.
        """

        def generate():
            # Special case for urllib3.
            if hasattr(self.raw, 'stream'):
                try:
                    for chunk in self.raw.stream(chunk_size, decode_content=True):
                        yield chunk
                except ProtocolError as e:
                    raise ChunkedEncodingError(e)
                except DecodeError as e:
                    raise ContentDecodingError(e)
                except ReadTimeoutError as e:
                    raise ConnectionError(e)
            else:
                # Standard file-like object.
                while True:
                    chunk = self.raw.read(chunk_size)
                    if not chunk:
                        break
                    yield chunk

            self._content_consumed = True

        if self._content_consumed and isinstance(self._content, bool):
            raise StreamConsumedError()
        elif chunk_size is not None and not isinstance(chunk_size, int):
            raise TypeError("chunk_size must be an int, it is instead a %s." % type(chunk_size))
        # simulate reading small chunks of the content
        reused_chunks = iter_slices(self._content, chunk_size)

        stream_chunks = generate()

        chunks = reused_chunks if self._content_consumed else stream_chunks

        if decode_unicode:
            chunks = stream_decode_response_unicode(chunks, self)
          

 return chunks

answered Nov 15 '22 17:11

mark_s

Related questions
                            
                                Output multiple losses added by add_loss in Keras
                            
                                How to check and get Alexa slot value with Python ask sdk
                            
                                Open a Word Document Using Python [duplicate]
                            
                                Package missing in Alpine Linux even though it's listed on package repo website [closed]
                            
                                building wheel for dlib (setup.py) loop
                            
                                No module named PyQt5.sip
                            
                                how to save a pandas DataFrame to an excel file?
                            
                                How to get authenticated identity response from AWS Cognito using boto3
                            
                                "not all arguments converted during string formatting" when to_sql
                            
                                Case-sensitive entity recognition
                            
                                GroupBy operation using an entire dataframe to group values
                            
                                Why does logger.info() only appear after calling logging.info()?
                            
                                Get the next element of list in Python
                            
                                Cannot install pyqt5-tools - 'Could not find a version that satisfies the requirement pyqt5-tools'
                            
                                PyTorch: Dataloader for time series task
                            
                                Can't get a certain item from a webpage using requests
                            
                                Download files from personal OneDrive using Python
                            
                                PIL and python static typing
                            
                                Type hint for a dict gives TypeError: 'type' object is not subscriptable [duplicate]
                            
                                Argparse expected one argument

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does requests' stream=True option streams data one block at a time?

Tags:

python

http

tcp

python-requests

EDIT + possible answer:

Tom

People also ask

2 Answers

Tarique

mark_s

Recent Activity

Donate For Us