Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get filename from Content-Disposition in headers

I am downloading a file with Mechanize and in response headers there is a string:

Content-Disposition: attachment; filename=myfilename.txt 

Is there a quick standard way to get that filename value? What I have in mind now is this:

filename = f[1]['Content-Disposition'].split('; ')[1].replace('filename=', '') 

But it looks like a quick'n'dirty solution.

like image 267
Sergei Basharov Avatar asked Nov 07 '11 11:11

Sergei Basharov


People also ask

How do you use content disposition headers?

In a regular HTTP response, the Content-Disposition response header is a header indicating if the content is expected to be displayed inline in the browser, that is, as a Web page or as part of a Web page, or as an attachment, that is downloaded and saved locally.

What is content disposition attachment filename?

Content-Disposition is an optional header and allows the sender to indicate a default archival disposition; a filename. The optional "filename" parameter provides for this. This header field definition is based almost verbatim on Experimental RFC 1806 by R. Troost and S.

What is content disposition inline?

1. Content Disposition Type : inline: This indicates that data should be displayed automatically on prompt in browser. attachment: This indicates that user should receive a prompt (usually a Save As dialog box) to save the file locally on the disk to access it.

What is content disposition S3?

S3 provides multiple ways to set the Content-Disposition header of an object being downloaded, two of the main ways are: Set Content Disposition parameter on upload – works for new objects. Set response-content-disposition parameter in request – works for an existing object however requires a signed URL.


2 Answers

First get the value of the header by using mechanize, then parse the header using the builtin cgi module.

To demonstrate:

>>> import mechanize >>> browser = mechanize.Browser() >>> response = browser.open('http://example.com/your/url') >>> info = response.info() >>> header = info.getheader('Content-Disposition') >>> header 'attachment; filename=myfilename.txt' 

The header value can then be parsed:

>>> import cgi                >>> value, params = cgi.parse_header(header) >>> value 'attachment' >>> params {'filename': 'myfilename.txt'} 

params is a simple dict so params['filename'] is what you need. It doesn't matter whether the filename is wrapped in quotes or not.

like image 56
siebz0r Avatar answered Oct 08 '22 14:10

siebz0r


These regular expressions are based on the grammar from RFC 6266, but modified to accept headers without disposition-type, e.g. Content-Disposition: filename=example.html

i.e. [ disposition-type ";" ] disposition-parm ( ";" disposition-parm )* / disposition-type

It will handle filename parameters with and without quotes, and unquote quoted pairs from values in quotes, e.g. filename="foo\"bar" -> foo"bar

It will handle filename* extended parameters and prefer a filename* extended parameter over a filename parameter regardless of the order they appear in the header

It strips folder name information, e.g. /etc/passwd -> passwd, and it defaults to the basename from the URL path in the absence of a filename parameter (or header, or if the parameter value is empty string)

The token and qdtext regular expressions are based on the grammar from RFC 2616, the mimeCharset and valueChars regular expressions are based on the grammar from RFC 5987, and the language regular expression is based on the grammar from RFC 5646

import re, urllib from os import path from urlparse import urlparse  # content-disposition = "Content-Disposition" ":" #                        disposition-type *( ";" disposition-parm ) # disposition-type    = "inline" | "attachment" | disp-ext-type #                     ; case-insensitive # disp-ext-type       = token # disposition-parm    = filename-parm | disp-ext-parm # filename-parm       = "filename" "=" value #                     | "filename*" "=" ext-value # disp-ext-parm       = token "=" value #                     | ext-token "=" ext-value # ext-token           = <the characters in token, followed by "*">  token = '[-!#-\'*+.\dA-Z^-z|~]+' qdtext='[]-~\t !#-[]' mimeCharset='[-!#-&+\dA-Z^-z]+' language='(?:[A-Za-z]{2,3}(?:-[A-Za-z]{3}(?:-[A-Za-z]{3}){,2})?|[A-Za-z]{4,8})(?:-[A-Za-z]{4})?(?:-(?:[A-Za-z]{2}|\d{3}))(?:-(?:[\dA-Za-z]{5,8}|\d[\dA-Za-z]{3}))*(?:-[\dA-WY-Za-wy-z](?:-[\dA-Za-z]{2,8})+)*(?:-[Xx](?:-[\dA-Za-z]{1,8})+)?|[Xx](?:-[\dA-Za-z]{1,8})+|[Ee][Nn]-[Gg][Bb]-[Oo][Ee][Dd]|[Ii]-[Aa][Mm][Ii]|[Ii]-[Bb][Nn][Nn]|[Ii]-[Dd][Ee][Ff][Aa][Uu][Ll][Tt]|[Ii]-[Ee][Nn][Oo][Cc][Hh][Ii][Aa][Nn]|[Ii]-[Hh][Aa][Kk]|[Ii]-[Kk][Ll][Ii][Nn][Gg][Oo][Nn]|[Ii]-[Ll][Uu][Xx]|[Ii]-[Mm][Ii][Nn][Gg][Oo]|[Ii]-[Nn][Aa][Vv][Aa][Jj][Oo]|[Ii]-[Pp][Ww][Nn]|[Ii]-[Tt][Aa][Oo]|[Ii]-[Tt][Aa][Yy]|[Ii]-[Tt][Ss][Uu]|[Ss][Gg][Nn]-[Bb][Ee]-[Ff][Rr]|[Ss][Gg][Nn]-[Bb][Ee]-[Nn][Ll]|[Ss][Gg][Nn]-[Cc][Hh]-[Dd][Ee]' valueChars = '(?:%[\dA-F][\dA-F]|[-!#$&+.\dA-Z^-z|~])*' dispositionParm = '[Ff][Ii][Ll][Ee][Nn][Aa][Mm][Ee]\s*=\s*(?:({token})|"((?:{qdtext}|\\\\[\t !-~])*)")|[Ff][Ii][Ll][Ee][Nn][Aa][Mm][Ee]\*\s*=\s*({mimeCharset})\'(?:{language})?\'({valueChars})|{token}\s*=\s*(?:{token}|"(?:{qdtext}|\\\\[\t !-~])*")|{token}\*\s*=\s*{mimeCharset}\'(?:{language})?\'{valueChars}'.format(**locals())  try:   m = re.match('(?:{token}\s*;\s*)?(?:{dispositionParm})(?:\s*;\s*(?:{dispositionParm}))*|{token}'.format(**locals()), result.headers['Content-Disposition'])  except KeyError:   name = path.basename(urllib.unquote(urlparse(url).path))  else:   if not m:     name = path.basename(urllib.unquote(urlparse(url).path))    # Many user agent implementations predating this specification do not   # understand the "filename*" parameter.  Therefore, when both "filename"   # and "filename*" are present in a single header field value, recipients   # SHOULD pick "filename*" and ignore "filename"    elif m.group(8) is not None:     name = urllib.unquote(m.group(8)).decode(m.group(7))    elif m.group(4) is not None:     name = urllib.unquote(m.group(4)).decode(m.group(3))    elif m.group(6) is not None:     name = re.sub('\\\\(.)', '\1', m.group(6))    elif m.group(5) is not None:     name = m.group(5)    elif m.group(2) is not None:     name = re.sub('\\\\(.)', '\1', m.group(2))    else:     name = m.group(1)    # Recipients MUST NOT be able to write into any location other than one to   # which they are specifically entitled    if name:     name = path.basename(name)    else:     name = path.basename(urllib.unquote(urlparse(url).path)) 
like image 32
user916968 Avatar answered Oct 08 '22 15:10

user916968