I download a file using the <code>get</code> function of Python <code>requests</code> library. For storing the file, I'd like to determine the filename the way a web browser would for its 'save' or 'save as ...' dialog. Easy, right? I can just get it from the <code>Content-Disposition</code> HTTP header, accessible on the response object: <pre class="prettyprint"><code>import re d = r.headers['content-disposition'] fname = re.findall("filename=(.+)", d) </code></pre> But looking more closely at this topic, it isn't that easy: According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. <code>the_report.pdf</code>) or a quoted string that can also contain whitespace (e.g. <code>"the report.pdf"</code>) and escape sequences. Further, <blockquote> when both "filename" and "filename*" are present in a single header field value, [we] SHOULD pick "filename*" and ignore "filename". </blockquote> The value of <code>filename*</code>, though, is yet a bit more complicated than the one of <code>filename</code>. Also, the RFC seems to allow for additional whitespace around the <code>=</code>. Thus, for the examples listed in the RFC, I'd want the following results: <ul> <li> <pre class="prettyprint lang-none prettyprint-override"><code> Content-Disposition: Attachment; filename=example.html </code></pre> filename: <code>example.html</code> </li> <li> <pre class="prettyprint lang-none prettyprint-override"><code> Content-Disposition: INLINE; FILENAME= "an example.html" </code></pre> filename: <code>an example.html</code> </li> <li> <pre class="prettyprint lang-none prettyprint-override"><code> Content-Disposition: attachment; filename*= UTF-8''%e2%82%ac%20rates </code></pre> filename: <code>€ rates</code> </li> <li> <pre class="prettyprint lang-none prettyprint-override"><code> Content-Disposition: attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates </code></pre> filename: <code>€ rates</code> here, too (not <code>EURO rates</code>, as <code>filename*</code> takes precedence)</li> </ul> Now, I could easily adapt the regular expression to account for variable whitespace around the <code>=</code>, but having it handle all the other variations, too, would get rather unwieldy. (With the quoting and escaping, I'm not even sure RegEx can cover all the cases. Maybe they can, as there is no brace-nesting involved.) So do I have to implement a full-blown parser, or can I determine filename according to RFC 6266 by some few calls to a HTTP library (maybe <code>requests</code> itself)? As RFC 6266 is part of the HTTP standard, I could imagine that some libraries specialized on HTTP already cover this. (So I've also asked on Software Recommendations SE.)

The <code>rfc6266</code> library appears to do exactly what you need. It can parse raw headers, <code>requests</code> responses, and <code>urllib2</code> responses. It's on PyPI. Some examples: <pre class="prettyprint"><code>>>> import rfc6266, requests >>> rfc6266.parse_headers('''Attachment; filename=example.html''').filename_unsafe 'example.html' >>> rfc6266.parse_headers('''INLINE; FILENAME= "an example.html"''').filename_unsafe 'an example.html' >>> rfc6266.parse_headers( '''attachment; ''' '''filename*= UTF-8''%e2%82%ac%20rates''').filename_unsafe '€ rates' >>> rfc6266.parse_headers( '''attachment; ''' '''filename="EURO rates"; ''' '''filename*=utf-8''%e2%82%ac%20rates''').filename_unsafe '€ rates' >>> r = requests.get('http://example.com/€ rates') >>> rfc6266.parse_requests_response(r).filename_unsafe '€ rates' </code></pre> As a note, though: this library does not like nonstandard whitespace in the header.

how to determine the filename of content downloaded with HTTP in Python?

Tags:

I download a file using the get function of Python requests library. For storing the file, I'd like to determine the filename the way a web browser would for its 'save' or 'save as ...' dialog.

Easy, right? I can just get it from the Content-Disposition HTTP header, accessible on the response object:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)

But looking more closely at this topic, it isn't that easy:

According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. the_report.pdf) or a quoted string that can also contain whitespace (e.g. "the report.pdf") and escape sequences. Further,

when both "filename" and "filename*" are present in a single header field value, [we] SHOULD pick "filename*" and ignore "filename".

The value of filename*, though, is yet a bit more complicated than the one of filename.

Also, the RFC seems to allow for additional whitespace around the =.

Thus, for the examples listed in the RFC, I'd want the following results:

  Content-Disposition: Attachment; filename=example.html

filename: example.html

  Content-Disposition: INLINE; FILENAME= "an example.html"

filename: an example.html

  Content-Disposition: attachment;
                       filename*= UTF-8''%e2%82%ac%20rates

filename: € rates

  Content-Disposition: attachment;
                       filename="EURO rates";
                       filename*=utf-8''%e2%82%ac%20rates

filename: € rates here, too (not EURO rates, as filename* takes precedence)

Now, I could easily adapt the regular expression to account for variable whitespace around the =, but having it handle all the other variations, too, would get rather unwieldy. (With the quoting and escaping, I'm not even sure RegEx can cover all the cases. Maybe they can, as there is no brace-nesting involved.)

So do I have to implement a full-blown parser, or can I determine filename according to RFC 6266 by some few calls to a HTTP library (maybe requests itself)? As RFC 6266 is part of the HTTP standard, I could imagine that some libraries specialized on HTTP already cover this. (So I've also asked on Software Recommendations SE.)

444

asked May 05 '16 21:05

das-g

2 Answers

The rfc6266 library appears to do exactly what you need. It can parse raw headers, requests responses, and urllib2 responses. It's on PyPI.

Some examples:

>>> import rfc6266, requests
>>> rfc6266.parse_headers('''Attachment; filename=example.html''').filename_unsafe
'example.html'
>>> rfc6266.parse_headers('''INLINE; FILENAME= "an example.html"''').filename_unsafe
'an example.html'
>>> rfc6266.parse_headers(
    '''attachment; '''
    '''filename*= UTF-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> rfc6266.parse_headers(
    '''attachment; '''
    '''filename="EURO rates"; '''
    '''filename*=utf-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> r = requests.get('http://example.com/€ rates')
>>> rfc6266.parse_requests_response(r).filename_unsafe
'€ rates'

As a note, though: this library does not like nonstandard whitespace in the header.

answered Oct 11 '22 07:10

Alyssa Haroldsen

if you don't really need the result in utf-8

def getFilename(s):
  fname = re.findall("filename\*?=([^;]+)", s, flags=re.IGNORECASE)
  print fname[0].strip().strip('"')

but if utf-8 is a must

def getFilename(s):
    fname = re.findall("filename\*=([^;]+)", s, flags=re.IGNORECASE)
    if not fname:
        fname = re.findall("filename=([^;]+)", s, flags=re.IGNORECASE)
    if "utf-8''" in fname[0].lower():
        fname = re.sub("utf-8''", '', fname[0], flags=re.IGNORECASE)
        fname = urllib.unquote(fname).decode('utf8')
    else:
        fname = fname[0]
    # clean space and double quotes
    print fname.strip().strip('"')

# example
getFilename('Attachment; filename=example.html')
getFilename('INLINE; FILENAME= "an example.html"')

getFilename("attachment;filename*= UTF-8''%e2%82%ac%20rates")
getFilename("attachment; filename=\"EURO rates\";filename*=utf-8''%e2%82%ac%20rates")

getFilename("attachment;filename=\"_____ _____ ___ __ ____ _____ Hekayt Bent.2017.mp3\";filename*=UTF-8''%D8%A7%D8%BA%D9%86%D9%8A%D9%87%20%D8%AD%D9%83%D8%A7%D9%8A%D8%A9%20%D8%A8%D9%86%D8%AA%20%D9%84%D9%80%20%D9%85%D8%AD%D9%85%D8%AF%20%D8%B4%D8%AD%D8%A7%D8%AA%D8%A9%20Hekayt%20Bent.2017.mp3")

result

example.html
an example.html
€ rates
€ rates
اغنيه حكاية بنت لـ محمد شحاتة Hekayt Bent.2017.mp3

answered Oct 11 '22 06:10

ewwink

Related questions
                            
                                How to get raw response and request using Retrofit 2.0
                            
                                How to convert String to GString and replace placeholder in Groovy?
                            
                                How do I set up IntelliJ to build Haskell projects with Stack?
                            
                                Pandas read sql integer became float
                            
                                How to use the new IValueResolver of AutoMapper?
                            
                                What are the advantages/disadvantages for creating a top level function in ES6 with arrows or without?
                            
                                Android fitsSystemWindows not working when replacing fragments
                            
                                How can I compile my Typescript into a single JS file with no module loading system?
                            
                                When using C++ modules, is the any reason to separate function declarations (.hpp files) from their definitions (.cpp files)?
                            
                                tensorflow einsum vs. matmul vs. tensordot
                            
                                Installed a package with Anaconda, can't import in Python
                            
                                What is difference between Any , Hashable , AnyHashable in Swift 3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With