Content-length header not the same as when manually calculating it?

Tags:

python-requests

An answer here (Size of raw response in bytes) says :

Just take the len() of the content of the response:
>>> response = requests.get('https://github.com/')
>>> len(response.content)
51671

However doing that does not get the accurate content length. For example check out this python code:

import sys
import requests

def proccessUrl(url):
    try:
        r = requests.get(url)
        print("Correct Content Length: "+r.headers['Content-Length'])
        print("bytes of r.text       : "+str(sys.getsizeof(r.text)))
        print("bytes of r.content    : "+str(sys.getsizeof(r.content)))
        print("len r.text            : "+str(len(r.text)))
        print("len r.content         : "+str(len(r.content)))
    except Exception as e:
        print(str(e))

#this url contains a content-length header, we will use that to see if the content length we calculate is the same.
proccessUrl("https://stackoverflow.com")

If we try and manually calculate the content length and compare it to what is in the header, we get an answer that is much larger?

Correct Content Length: 51504
bytes of r.text       : 515142
bytes of r.content    : 257623
len r.text            : 257552
len r.content         : 257606

Why does len(r.content) not return the correct content length? And how can we manually calculate it accurately if the header is missing?

817

asked Jun 12 '18 20:06

1 Answers

The Content-Length header reflects the body of the response. That's not the same thing as the length of the text or content attributes, because the response could be compressed. requests decompresses the response for you.

You'd have to bypass a lot of internal plumbing to get the original, compressed, raw content, and then you have to access some more internals if you want the response object to still work correctly. The 'easiest' method is to enable streaming, then reading from the raw socket:

from io import BytesIO

r = requests.get(url, stream=True)
# read directly from the raw urllib3 connection
raw_content = r.raw.read()
content_length = len(raw_content)
# replace the internal file-object to serve the data again
r.raw._fp = BytesIO(raw_content)

Demo:

>>> import requests
>>> from io import BytesIO
>>> url = "https://stackoverflow.com"
>>> r = requests.get(url, stream=True)
>>> r.headers['Content-Encoding'] # a compressed response
'gzip'
>>> r.headers['Content-Length']   # the raw response contains 52055 bytes of compressed data
'52055'
>>> r.headers['Content-Type']     # we are served UTF-8 HTML data
'text/html; charset=utf-8'
>>> raw_content = r.raw.read()
>>> len(raw_content)              # the raw content body length
52055
>>> r.raw._fp = BytesIO(raw_content)
>>> len(r.content)    # the decompressed binary content, byte count
258719
>>> len(r.text)       # the Unicode content decoded from UTF-8, character count
258658

This reads the full response into memory, so don't use this if you expect large responses! In that case, you could instead use shutil.copyfileobj() to copy the data from the r.raw file to a spooled temporary file (which will switch to an on-disk file once a certain size is reached), get the file size of that file, then stuff that file onto r.raw._fp.

A function that adds a Content-Type header to any request that is missing that header would look like this:

import requests
import shutil
import tempfile

def ensure_content_length(
    url, *args, method='GET', session=None, max_size=2**20,  # 1Mb
    **kwargs
):
    kwargs['stream'] = True
    session = session or requests.Session()
    r = session.request(method, url, *args, **kwargs)
    if 'Content-Length' not in r.headers:
        # stream content into a temporary file so we can get the real size
        spool = tempfile.SpooledTemporaryFile(max_size)
        shutil.copyfileobj(r.raw, spool)
        r.headers['Content-Length'] = str(spool.tell())
        spool.seek(0)
        # replace the original socket with our temporary file
        r.raw._fp.close()
        r.raw._fp = spool
    return r

This accepts an existing session, and lets you specify the request method too. Adjust max_size as needed for your memory constraints. Demo on https://github.com, which lacks a Content-Length header:

>>> r = ensure_content_length('https://github.com/')
>>> r
<Response [200]>
>>> r.headers['Content-Length']
'14490'
>>> len(r.content)
54814

Note that if there is no Content-Encoding header present or the value for that header is set to identity, and the Content-Length is available, then just you can rely on Content-Length being the full size of the response. That's because then there is obviously no compression applied.

As a side note: you should not use sys.getsizeof() if what your are after is the length of a bytes or str object (the number of bytes or characters in that object). sys.getsizeof() gives you the internal memory footprint of a Python object, which covers more than just the number of bytes or characters in that object. See What is the difference between len() and sys.getsizeof() methods in python?

176

answered Oct 06 '22 00:10

Martijn Pieters

Related questions
                            
                                PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages
                            
                                Representing voxels with matplotlib
                            
                                Fastest way to cast all dataframe columns to float - pandas astype slow
                            
                                How to get the symmetric difference of two dictionaries
                            
                                Keras training only specific outputs
                            
                                TypeError: run() missing 1 required positional argument: 'fetches' on Session.run()
                            
                                How to split one column into multiple columns in Pandas using regular expression?
                            
                                Incremental training of random forest model using python sklearn
                            
                                Checking if an environment variable exists and is set to True [closed]
                            
                                Scrapy: Save response.body as html file?
                            
                                Flatten multi-index pandas dataframe where column names become values
                            
                                How to find first local maximum for every group?
                            
                                Python installer: "0x80070642 - User cancelled installation"
                            
                                psycopg2 import error when ubuntu upgraded to 17.10 (from 17.04)
                            
                                import matplotlib.pyplot as plt, ImportError: libGL.so.1: cannot open shared object file: No such file or directory
                            
                                Using multiple conditions in Django's Case When expressions
                            
                                How to interpret scipy.stats.probplot results?
                            
                                saving a dataframe to csv file (python)
                            
                                What is a vectorized way to create multiple powers of a NumPy array?
                            
                                Change the regression line colour of Seaborn's pairplot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Content-length header not the same as when manually calculating it?

Tags:

python

python-requests

Jonathan Laliberte

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us